Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PROJECT REPORT ON CLUSTERING CS-267 Submitted To: Dr. T. Y. Lin Submitted By: SUMITKUMAR S. GHOSH (SJSU ID: 006522360) A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Abstract: This report is mainly focusing on the K-means algorithm, a popular heuristic for k-means clustering is Lloyd's algorithm. Firstly, there is an introduction of clustering, its benefits and classification. Then there is a description of the K-means algorithm followed by a demonstration with the help of example and the approach to implement the algorithm, result of the implemented algorithm in the form of a graph and finally, the limitations of clustering. CLUSTERING: What is Clustering? Clustering is the grouping of data. However, in clustering groups are not predefined, instead they are formed according to the similarities in the characteristics of data. This forming of groups is called Clustering. Some proposed definitions of clustering: 1) Set of like elements. 2) A group in which distance between points is less than that of the points outside the group. Basic features of clustering: 1) The (best) number of clusters is not known. 2) There may not be any prior knowledge concerning the clusters. 3) Cluster results are dynamic. General Example: 1) Clustering of customers in a company according to their merits: Page | 2 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH 2) Clustering of houses according to attributes (size, density and distance) Similarity and distance measures for clustering: Centroid is the middle of a cluster, it is not necessary that it has to be an actual point in the cluster. Radius is the square root of the average mean squared distance between all pairs of points in the cluster. Diameter is the square root of the average mean squared distance between all pairs of points in the cluster. Classifications of clustering algorithm: Mathematical definition of clustering: Page | 3 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:D -> {1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. Where Kj = { ti | f(ti)= Kj, 1 ≤ i ≤ n, and ti ∈ D} Hierarchical Algorithms: These algorithms differ in how the clustering sets are created. A key data structure dendrogram can be used to illustrate the hierarchical techniques and the set of different clusters. The root in dendrogram tree contains one cluster where all elements are together. Each leave consists of a single element cluster. Each internal node represents new clusters formed by merging the clusters that appear as its children in the tree. Each level in the tree is associated with the distance measure that was used to merge the cluster. Agglomerative Algorithm: It starts with each individual item in its own cluster and iteratively merges clusters unless all items belong to one cluster. Algorithms differ in how the clusters are merged at each level. It assumes that a set of elements and distances between them is given as input. The output of the algorithm is a dendrogram which is represented as a set of ordered triples (d,k,K) where d is the threshold distance, k is the number of clusters and K is the set of clusters. Page | 4 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH This algorithm uses a procedure called NewCluster to determine how to create the next level of clusters from the previous levels. This is where the different type of agglomerative algorithms differ. The most well known agglomerative techniques are: I. Single link II. Complete link III. Average link Single link technique: It is based on the idea of finding maximal connected components in a graph. Maximal connected component is a graph in which there exists a path between any two vertices. Clusters are merged if there is at least one edge that connects the two clusters. That is if the minimum distance between any two points is less than or equal to the threshold distance being considered. Page | 5 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Single link approach has several problems: 1) It’s not very efficient in terms of time complexity (O(n^2)) in terms of time complexity and space complexity(O(n^2)) for each iteration so the modification could be developed by looking at which clusters from an earlier level can be merged at each step. 2) Another problem is that clustering creates clusters with long chains. 3) One alternative variation of single link algorithm is based on the use of minimum spanning tree. MST Single link Algorithm: It produces a minimum spanning tree given an adjacency matrix as input. The clusters are merged in increasing order of the distance found in the MST. Once two clusters are merged the distance between them in the tree becomes ∞. Page | 6 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Complete Link Algorithm: Similar to single link algorithm Looks for cliques rather than connected components. Clique is a maximal graph in which there is an edge between any two vertices. Here maximum distance between any cluster is looked for so that the clusters can be merged if the maximum distance is less than or equal to the distance threshold. Time complexity and space complexity (O(n^2)). Average link algorithm: It merges the two clusters if the average distance between any two points in the two clusters is below the distance threshold. Here the complete graph is examined at each stage (Not just the threshold graph). Page | 7 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Divisive clustering: Items are initially placed in one cluster. Idea is to split up clusters where some elements are not sufficiently close to other elements. The above in top to bottom fashion is the example of divisive clustering. Partitional Algorithms: Only one set of clusters is created. Various algorithms have internally different set of clusters. User must input the desired number of clusters. In addition, some criterion function or metric must be used to determine the goodness of any proposed solution. Metric could be the average distance between clusters. Page | 8 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Minimal Spanning Tree: It’s a very simplistic approach which illustrates how partitional algorithms work. Clustering problem is to define a mapping. The output of this algorithm is a set of ordered pairs (ti,j) where f(ti)=Kj. Time complexity O(n2) Squared error clustering algorithm: Square error is defined as Given a cluster Ki= {ti1,ti2,…,tim} and k={K1,k2,…,Kk} Ck is the cluster centroid. Nearest Neighbor algorithm: Similar to single link Items are iteratively merged into the existing clusters that are closest. Page | 9 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Threshold is taken as t. PAM (Partitioning around medoids) Algorithm: Medoid is the centrally located object in the cluster. It handles outliers well. It determines whether there is an item that should replace one of the existing medoids. By looking at all pair of medoids, non-medoid objects, the algorithm chooses the pair that improves the overall quality of clustering the best and exchanges them. Quality is the sum of all distances from non-medoid objects to the medoid of the cluster. PAM Cost Calculation At each step in algorithm, medoids are changed if the overall cost is improved. Cjih – cost change for an item tj associated with swapping medoid ti with non-medoid th. We use Cjih a cost change for an item tj while we swap medoid ti with non-medoid item th. Four cases are to be considered while calculating the cost:. Total change to the quality by medoid changes is given by: Page | 10 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH PAM Algorithm: PAM does not scale well to large data sets because of its computational complexity BEA (Bond Energy Algorithm): Database design (physical and logical) Vertical fragmentation Determine affinity (bond) between attributes based on common usage. Basic steps for the algorithm are: 1) Create an attribute affinity matrix in which each entry indicates the affinity between the two associated attributes. The entries in the similarity matrix are based on the frequency of common usage of attribute pairs. 2) The BEA then converts the similarity matrix to a BOND matrix in which the entries represent a type of nearest neighbor bonding based on probability of coaccess. The BEA algorithm rearranges rows or columns so that similar attributes appear close together in the matrix. 3) Finally the designer draws boxes around regions in the matrix with high similarity. Page | 11 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Clustered affinity matrix for BEA. Clustering with genetic algorithms. {A,B,C,D,E,F,G,H} Randomly choose initial solution: {A,C,E} {B,F} {D,G,H} or 10101000, 01000100, 00010011 Suppose crossover is performed at point four and choose 1st and 3rd individuals 10100011, 01000100, 00011000. This gives the new solution 00011000,01000100 and 10100011. GA Algorithm: K-Means Clustering: The most common algorithm that uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd's algorithm, particularly in the computer science community. In this the items are more among sets of clusters unless the desired sets are reached. High degree of similarity as well as dissimilarity among elements in clusters is obtained. Criterion function is not necessary. Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … + tim). Page | 12 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH K-means Clustering algorithm there is a starting point it depends on the data that is provided. If the order of the data is changed then the cluster values can change. Example: Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 Stop as the clusters with these means are the same. Limitations of K-means clustering: It does not handle outliers well. It’s not time efficient and does not scale well though it often produces good relation. Standard K-means Algorithm Demonstration 1) k initial "means" (in this case k=3, it is the number of clusters) are randomly selected from the set of data (shown in color). Page | 13 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH 2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. 3) Now the centroid of each of the k clusters becomes the new means. 4) Steps 2 and 3 are repeated until convergence has been reached and the final clusters are formed. Steps to write the code for k means clustering algorithm: Page | 14 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Step 1. Input the value of k = number of clusters. Ask for the data file else input data. Step 2. Use the provided value k and initially partition the data into k clusters. Training samples may be assigned randomly or systematically: 1. Take the first k training sample provided and assign them as single-element clusters. 2. According to the nearest centroid assign each of the remaining (N-k) training sample to the cluster. Each time a training sample is assigned to a cluster, recalculate the centroid of the clusters that is gaining the sample. Step 3 . Compute the distance of each sample from the centroid of each of the clusters in a sequential manner. If a sample of a cluster is closer to the centroid of another cluster then switch this sample to that cluster with centroid closer to the sample and update the centroid of the cluster gaining the new sample and the cluster losing the sample. Step 4 . Repeat step 3 until convergence is achieved, that is until it becomes steady; a pass through the training sample causes no new assignments. Exceptions: If the number of clusters is more than or equal to the number of data then we assign each data as the centroid of the cluster. Each of the centroid will have a cluster number. Page | 15 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH If the number of data is more than the number of cluster, calculate the distance to all centroid for each data, obtain the minimum distance. This data is said belong to the cluster that has the centroid with minimum distance from this data. Since we are not sure about the location of the centroid, we need to adjust the centroid location based on the current updated data. Then we assign all the data to this new centroid. This process is repeated until no data is moving to another cluster anymore. Mathematically this loop can be proved to be convergent. The convergence will always occur if the following condition satisfied: 1. Each switch in step 2 the sum of distances from each training sample to that training sample's group centroid is decreased. 2. There are only finitely many partitions of the training examples into k clusters. Source Code Developed in C: /* This K-means Code is developed by Ashutosh Singh, Graduate Student, Computer Science Department, & Sumit Ghosh, Graduate Student, Software Engineering Department, as part of the Data Mining Project in course CS267 under the, guidance of professor Dr. T Y Lin, Computer Science, Department. San Jose State University. */ #include<stdio.h> #include<math.h> #include<stdlib.h> #include<malloc.h> #include<string.h> #define TRUE 1 #define FALSE 0 //No of data int N; //No of Clusters that you want to form int K; //Centroid index Array Initialization int * CenterIndex; Page | 16 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH //Set of all Centorids double * Center; //Collection of Copies of Centoroid double * CenterCopy; //Data Collection double * AllData; //The set of clusters. double ** Cluster; //No of elements in the collection //will be used as stack processing. int * Top; FILE *fp; int lcount; char line[100]; char *key; char file_name[80] = "\0"; //Number of randomly generated k x(0 <= x <= n-1) //set as the starting center of cluster void CreateRandomArray(int n, int k,int * center){ int i=0; int j=0; //No of randomly generated k for( i=0;i<k;++i){ //random data selected from the input int a=rand()%n; for(j=0;j<i;j++) //repeat. if(center[j]==a) break; //add if unrepeated. if(j>=i) center[i]=a; Page | 17 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH /*if repeated, this time center are randomly generated.*/ else i--; } } //Back to the center of cluster from the smallest number int GetIndex(double value,double * center){ int i=0; //Center of mass of the smalles number int index=i; //minimum distance from the center of cluster double min=fabs(value-center[i]); for(i=0;i<K;i++){ /*if the distance is smalled than the current, updating the minimum value of the mean number and the distance.*/ if(fabs(value-center[i])<min){ index=i; min=fabs(value-center[i]); } } return index; } //to copy the array centroid. void CopyCenter(){ int i=0; for(i=0;i<K;i++) CenterCopy[i]=Center[i]; } //Centorid initialization, random generation method void InitCenter(){ int i=0; //Generate a random K CreateRandomArray(N,K,CenterIndex); for(i=0;i<K;i++){ //Assigned to the corresponding data array centroid Center[i]=AllData[CenterIndex[i]]; } CopyCenter();//to copy the array centroid. Page | 18 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH } //Add a data to a Cluster [index] collection void AddToCluster(int index,double value){ //into the stack with the operations here. Cluster[index][Top[index]++]=value; } //Update the cluster set. void UpdateCluster(){ int i=0; int tindex; //The set of all empty cluster mean initialized to 0. for(i=0;i<K;i++) Top[i]=0; for(i=0;i<N;i++){ /*Data obtained with the current center of cluster with the smallest index. */ tindex=GetIndex(AllData[i],Center); //Add to the appropriate collection AddToCluster(tindex,AllData[i]); } } /*Update the centroid set for each cluster of elements in the collection.*/ void UpdateCenter(){ int i=0; int j=0; double sum=0; for(i=0;i<K;i++){ sum=0; for(j=0;j<Top[i];j++) sum+=Cluster[i][j]; //If cluster is not empty if(Top[i]>0) //Finding the average. Center[i]=sum/Top[i]; } } Page | 19 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH /*Verifying the two array of elements whether they are same.*/ int IsEqual(double * center1 ,double * center2){ int i; for(i=0;i<K;i++){ if(fabs(center1[i]!=center2[i])) return FALSE; } return TRUE; } //Printing the clustering result. void Print(){ int i,j,n=0,x,y,count=1,centroid_num=0; double temp; double f[100]; float p,hun; FILE* fw; fw=fopen("object_output.xlsx","w+"); fprintf(fw,"No of objects: %d\n", N); for(i=0;i<K;i++){ printf("\nS%d Group: Centroid(%f)\n",i,Center[i]); fprintf(fw,"\nS%d Group: Centroid(%f)\n",i,Center[i]); centroid_num++; f[i]= Center[i]; for(j=0;j<Top[i];j++){ printf("%f\n",Cluster[i][j]); fprintf(fw,"%f\n", Cluster[i][j]); } } fclose(fw); /*Sort the Centroid. f[] is the sorted array.*/ for(i=K-2;i>=0;i--){ for(j=0;j<=i;j++){ if(f[j]<f[j+1]){ temp=f[j]; f[j]=f[j+1]; f[j+1]=temp; } } //end for 1. } Page | 20 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH hun= f[0]-f[centroid_num-1]; for(i=0;i<centroid_num-1;i++){ p= ((f[i]-f[i+1])/hun)*100; if(p>20){ count++; } } } //Initializing data in various clusters void InitData(){ int i=0; float a; char* fname = file_name; printf("Enter the no. of clusters: "); scanf("%d",&K); fp=fopen(fname, "r"); i = 0; key = '\0'; line[100] = '\0'; while(fgets(line, sizeof(line), fp) != NULL ) { //Get each line from the infile key = strtok(line, " \n"); while(key){ key = strtok(NULL, " \n"); i++; } } N = i; fclose(fp); printf("****************************************************\n"); printf("No of Objects tested: %d\n", N); //Memory allocation for the center of clusters Center=(double *)malloc(sizeof(double)*K); //set memory for the indexes of clusters. CenterIndex=(int *)malloc(sizeof(int)*K); //Memory allocation for the copy of the centers of clusters. CenterCopy=(double *)malloc(sizeof(double)*K); Page | 21 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Top=(int *)malloc(sizeof(int)*K); //Memory allocation for the data inputs AllData=(double *)malloc(sizeof(double)*N); //Memory allocation for the center of the clusters. Cluster=(double **)malloc(sizeof(double *)*K); //Initialize the K clusters. for(i=0;i<K;i++){ Cluster[i]=(double *)malloc(sizeof(double)*N); Top[i]=0; } fp=fopen(fname,"r"); i = 0; key = '\0'; line[100] = '\0'; while(fgets(line, sizeof(line), fp) != NULL ) { // Get each line from the infile key = strtok(line, " \n"); while(key){ //printf("next string:%f\n", atof(key)); AllData[i]= atof(key); //printf("%f\n", AllData[i]); key = strtok(NULL, " \n"); i++; } } fclose(fp); //Initizlize the means of the clusters. InitCenter(); UpdateCluster();//updating the clusters. } /*Given the no. of clusters K, K to N objects assigned to a cluster, Making the clusters based on the maximum and minimum similarty among the clusters.*/ int main(int argc, char* argv[]){ int Flag = 1; int i = 0; if(argc != 2){ Page | 22 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH //argv[0] is the program name printf("Usage: %s\t <Input_File>\n", argv[0]); exit(0); } strcpy(file_name, argv[1]); //Initializing the data values inputted. InitData(); //starting the iteration. while(Flag){ //Updating the clusters. UpdateCluster(); //Updating the centers. UpdateCenter(); /*if the previous means and the current means of the clusters are same then do nothing.*/ if(IsEqual(Center,CenterCopy)){ Flag=0; } /*otherwise a copy of the clusters based on the current iteration is pbtained.*/ else{ /*calling the copycenter method to obtain a copy of the clusters based on the current iteration.*/ CopyCenter(); } } //Printing the outputs. Print(); } Test Input Files and their Output results: 1. First Input Data: Output of the clusters formed: Page | 23 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH No of objects: 56 S0 Group: Centroid(7.745833) 2 2 4.4 5 6 7 7 8 11.4 12 12.15 16 S2 Group: Centroid(80.878125) 62 66 72.6 73 74.4 76 77 77.2 81 83 84 S1 Group: Centroid(35.245200) 22.2 22.3 22.9 23 23.7 24 27.26 28 28 28.8 29 32 32.8 33.3 37 38.8 39.9 43 45 45 47 47.47 49.9 55 55.8 S3 Group: Centroid(313.666667) 266 292 383 Page | 24 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH 85.85 91 95 97 99 2. Second Input Data: Output File has been provided separately with the report. ( Not including here due to the space constraint.) Graphs: The curves plotted for the both the outputs of the clusters are as below: First Case: 450 400 350 300 250 200 150 100 50 0 2 4 6 8 10 12 14 16 18 20 22 24 1 3 5 7 9 11 13 15 17 19 21 23 25 Second Case: Page | 25 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Here, in both the graphs we can see the clusters formed. The desired cluster numbers passed is 4. So, we find the cluster groups formed S0, S1, S2 and S3. The Result derived for larger set of data: Page | 26 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH Applications Image segmentation: In computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition. In this context, the standard Euclidean distance is usually insufficient in forming the clusters. Instead, a weighted distance measure utilizing pixel coordinates, RGB pixel color and/or intensity, and image texture is commonly used. k-means Clustering Algorithm in Image Retrieval System Retrieval is according to feature similarities with respect to the query, ignoring the similarities among images in database. Combining the low-level visual features and high-level concepts, the proposed approach fully explores the similarities among images in database, using such clustering algorithm and optimizes the relevance results from traditional image retrieval system by firstly clustering the similar images in the images database to improve the efficiency of images retrieval system. The results on the testing images show that the proposed approach can greatly improve the efficiency and performances of image retrieval. Limitations and Challenges in clustering: 1) Outlier handling : Outliers are those elements which generally do not belong into any cluster. However, if a clustering algorithm attempts to form a cluster with outliers then it will be a very large cluster and this may result in the formation of poor clusters with respect to the attributes. 2) Dynamic Data: If data is dynamic or changing continuously then their membership will reassessed and this may lead to reform the cluster over a period of time. 3) Interpretation of the semantic meaning of cluster: The labeling of the classes is unknown in advance. So, when the clustering process finishes creating a set of clusters the exact interpretation of each cluster may not be obvious. 4) No unique solution to a clustering problem: Exact number of clusters required is not easy to determine. If attempts are made to divide the data into similar grouping it would not be clear how many groups should be created in advance. 5) No supervised prior learning: Page | 27 A REPORT ON CLUSTERING BY SUMITKUMAR S. GHOSH In clustering there is no prior knowledge concerning what the attributes of each classification should be. So, clustering can be viewed similar to unsupervised learning. My Contribution to the Project: Studied the various clustering algorithms. Studied, designed and implemented the basic k-means algorithm in C. Tested the code with the help of already implemented code. Modified the code to improve the clustering result as per the requirements and test results. Researched and implemented the technique to find the actual number of clusters based on the data provided. Generated spreadsheet from the result obtained by the algorithm and created the graph based on it. References: Data Mining: Introductory and Advanced Topics by Margaret H. Dunham. Data Mining: Concepts and Techniques by Jiawei Han. http://en.wikipedia.org/wiki/Cluster_analysis http://people.revoledu.com/kardi/tutorial/kMean/Algorithm.htm http://www.rob.cs.tu-bs.de/content/04-teaching/06interactive/Kmeans/Kmeans.html library.witpress.com/pages/PaperInfo.asp?PaperID=16701 Page | 28