Download project reportclustering - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
PROJECT REPORT
ON
CLUSTERING
CS-267
Submitted To:
Dr. T. Y. Lin
Submitted By:
SUMITKUMAR S. GHOSH
(SJSU ID: 006522360)
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Abstract:
This report is mainly focusing on the K-means algorithm, a popular heuristic for k-means
clustering is Lloyd's algorithm. Firstly, there is an introduction of clustering, its benefits and
classification. Then there is a description of the K-means algorithm followed by a demonstration
with the help of example and the approach to implement the algorithm, result of the
implemented algorithm in the form of a graph and finally, the limitations of clustering.
CLUSTERING:
What is Clustering?
Clustering is the grouping of data. However, in clustering groups are not predefined,
instead they are formed according to the similarities in the characteristics of data. This forming
of groups is called Clustering.
Some proposed definitions of clustering:
1) Set of like elements.
2) A group in which distance between points is less than that of the points
outside the group.
Basic features of clustering:
1) The (best) number of clusters is not known.
2) There may not be any prior knowledge concerning the clusters.
3) Cluster results are dynamic.
General Example:
1) Clustering of customers in a company according to their merits:
Page | 2
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
2) Clustering of houses according to attributes (size, density and distance)
Similarity and distance measures for clustering:
Centroid is the middle of a cluster, it is not necessary that it has to be an actual point in the
cluster.
Radius is the square root of the average mean squared distance between all pairs of points in
the cluster.
Diameter is the square root of the average mean squared distance between all pairs of points in
the cluster.
Classifications of clustering algorithm:
Mathematical definition of clustering:
Page | 3
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to
define a mapping f:D -> {1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k.
Where Kj = { ti | f(ti)= Kj, 1 ≤ i ≤ n, and ti ∈ D}
Hierarchical Algorithms:
 These algorithms differ in how the clustering sets are created.
 A key data structure dendrogram can be used to illustrate the hierarchical
techniques and the set of different clusters.
 The root in dendrogram tree contains one cluster where all elements are
together.
 Each leave consists of a single element cluster.
 Each internal node represents new clusters formed by merging the clusters that
appear as its children in the tree.
 Each level in the tree is associated with the distance measure that was used to
merge the cluster.
Agglomerative Algorithm:
 It starts with each individual item in its own cluster and iteratively merges clusters
unless all items belong to one cluster.
 Algorithms differ in how the clusters are merged at each level.
 It assumes that a set of elements and distances between them is given as input.
 The output of the algorithm is a dendrogram which is represented as a set of
ordered triples (d,k,K) where d is the threshold distance, k is the number of clusters
and K is the set of clusters.
Page | 4
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
 This algorithm uses a procedure called NewCluster to determine how to create the
next level of clusters from the previous levels. This is where the different type of
agglomerative algorithms differ.
 The most well known agglomerative techniques are:
I.
Single link
II.
Complete link
III.
Average link
Single link technique:
 It is based on the idea of finding maximal connected components in a graph.
 Maximal connected component is a graph in which there exists a path between any two
vertices.
 Clusters are merged if there is at least one edge that connects the two clusters.
 That is if the minimum distance between any two points is less than or equal to the threshold
distance being considered.
Page | 5
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Single link approach has several problems:
1) It’s not very efficient in terms of time complexity (O(n^2)) in terms of time complexity and space
complexity(O(n^2)) for each iteration so the modification could be developed by looking at
which clusters from an earlier level can be merged at each step.
2) Another problem is that clustering creates clusters with long chains.
3) One alternative variation of single link algorithm is based on the use of minimum spanning tree.
MST Single link Algorithm:
 It produces a minimum spanning tree given an adjacency matrix as input.
 The clusters are merged in increasing order of the distance found in the MST.
 Once two clusters are merged the distance between them in the tree becomes ∞.
Page | 6
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Complete Link Algorithm:
 Similar to single link algorithm
 Looks for cliques rather than connected components.
 Clique is a maximal graph in which there is an edge between any two vertices.
 Here maximum distance between any cluster is looked for so that the clusters can be
merged if the maximum distance is less than or equal to the distance threshold.
 Time complexity and space complexity (O(n^2)).
Average link algorithm:
 It merges the two clusters if the average distance between any two points in the two clusters is
below the distance threshold.
 Here the complete graph is examined at each stage (Not just the threshold graph).
Page | 7
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Divisive clustering:
 Items are initially placed in one cluster.
 Idea is to split up clusters where some elements are not sufficiently close to other
elements.
 The above in top to bottom fashion is the example of divisive clustering.
Partitional Algorithms:




Only one set of clusters is created.
Various algorithms have internally different set of clusters.
User must input the desired number of clusters.
In addition, some criterion function or metric must be used to determine the goodness
of any proposed solution.
 Metric could be the average distance between clusters.
Page | 8
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Minimal Spanning Tree:




It’s a very simplistic approach which illustrates how partitional algorithms work.
Clustering problem is to define a mapping.
The output of this algorithm is a set of ordered pairs (ti,j) where f(ti)=Kj.
Time complexity O(n2)
Squared error clustering algorithm:
 Square error is defined as
 Given a cluster Ki= {ti1,ti2,…,tim} and k={K1,k2,…,Kk}
 Ck is the cluster centroid.
Nearest Neighbor algorithm:
 Similar to single link
 Items are iteratively merged into the existing clusters that are closest.
Page | 9
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
 Threshold is taken as t.
PAM (Partitioning around medoids) Algorithm:




Medoid is the centrally located object in the cluster.
It handles outliers well.
It determines whether there is an item that should replace one of the existing medoids.
By looking at all pair of medoids, non-medoid objects, the algorithm chooses the pair that
improves the overall quality of clustering the best and exchanges them.
 Quality is the sum of all distances from non-medoid objects to the medoid of the cluster.
PAM Cost Calculation




At each step in algorithm, medoids are changed if the overall cost is improved.
Cjih – cost change for an item tj associated with swapping medoid ti with non-medoid th.
We use Cjih a cost change for an item tj while we swap medoid ti with non-medoid item th.
Four cases are to be considered while calculating the cost:.
Total change to the quality by medoid changes is given by:
Page | 10
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
PAM Algorithm:
 PAM does not scale well to large data sets because of its computational complexity
BEA (Bond Energy Algorithm):
 Database design (physical and logical)
 Vertical fragmentation
 Determine affinity (bond) between attributes based on common usage.
 Basic steps for the algorithm are:
1) Create an attribute affinity matrix in which each entry indicates the affinity
between the two associated attributes. The entries in the similarity matrix are
based on the frequency of common usage of attribute pairs.
2) The BEA then converts the similarity matrix to a BOND matrix in which the
entries represent a type of nearest neighbor bonding based on probability of coaccess. The BEA algorithm rearranges rows or columns so that similar attributes
appear close together in the matrix.
3) Finally the designer draws boxes around regions in the matrix with high
similarity.
Page | 11
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Clustered affinity matrix for BEA.
Clustering with genetic algorithms.
{A,B,C,D,E,F,G,H}
Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or
10101000, 01000100, 00010011
Suppose crossover is performed at point four and choose 1st and 3rd individuals
10100011,
01000100,
00011000. This gives the new solution
00011000,01000100 and 10100011.
GA Algorithm:
K-Means Clustering:
 The most common algorithm that uses an iterative refinement technique. Due to its
ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd's
algorithm, particularly in the computer science community.
 In this the items are more among sets of clusters unless the desired sets are reached.
 High degree of similarity as well as dissimilarity among elements in clusters is obtained.
 Criterion function is not necessary.
 Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … + tim).
Page | 12
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
 K-means Clustering algorithm there is a starting point it depends on the data that is
provided.
 If the order of the data is changed then the cluster values can change.
Example:
Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4
K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16
K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6
K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
Stop as the clusters with these means are the same.
Limitations of K-means clustering:
 It does not handle outliers well.
 It’s not time efficient and does not scale well though it often produces good
relation.
Standard K-means Algorithm Demonstration
1) k initial "means" (in this case k=3, it is the number of clusters) are randomly selected
from the set of data (shown in color).
Page | 13
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
2) k clusters are created by associating every observation with the nearest mean. The
partitions here represent the Voronoi diagram generated by the means.
3) Now the centroid of each of the k clusters becomes the new means.
4) Steps 2 and 3 are repeated until convergence has been reached and the final clusters
are formed.
Steps to write the code for k means clustering algorithm:
Page | 14
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Step 1. Input the value of k = number of clusters. Ask for the data file else input data.
Step 2. Use the provided value k and initially partition the data into k clusters. Training samples
may be assigned randomly or systematically:
1. Take the first k training sample provided and assign them as single-element clusters.
2. According to the nearest centroid assign each of the remaining (N-k) training sample to
the cluster. Each time a training sample is assigned to a cluster, recalculate the centroid
of the clusters that is gaining the sample.
Step 3 . Compute the distance of each sample from the centroid of each of the clusters in a
sequential manner. If a sample of a cluster is closer to the centroid of another cluster then
switch this sample to that cluster with centroid closer to the sample and update the centroid of
the cluster gaining the new sample and the cluster losing the sample.
Step 4 . Repeat step 3 until convergence is achieved, that is until it becomes steady; a pass
through the training sample causes no new assignments.
Exceptions: If the number of clusters is more than or equal to the number of data then we
assign each data as the centroid of the cluster. Each of the centroid will have a cluster number.
Page | 15
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
If the number of data is more than the number of cluster, calculate the distance to all centroid
for each data, obtain the minimum distance. This data is said belong to the cluster that has the
centroid with minimum distance from this data.
Since we are not sure about the location of the centroid, we need to adjust the centroid
location based on the current updated data. Then we assign all the data to this new centroid.
This process is repeated until no data is moving to another cluster anymore. Mathematically
this loop can be proved to be convergent. The convergence will always occur if the following
condition satisfied:
1. Each switch in step 2 the sum of distances from each training sample to that training
sample's group centroid is decreased.
2. There are only finitely many partitions of the training examples into k clusters.
Source Code Developed in C:
/*
This K-means Code is developed by
Ashutosh Singh, Graduate Student, Computer Science Department,
&
Sumit Ghosh, Graduate Student, Software Engineering Department,
as part of the Data Mining Project in course CS267 under the,
guidance of professor Dr. T Y Lin, Computer Science, Department.
San Jose State University.
*/
#include<stdio.h>
#include<math.h>
#include<stdlib.h>
#include<malloc.h>
#include<string.h>
#define TRUE 1
#define FALSE 0
//No of data
int N;
//No of Clusters that you want to form
int K;
//Centroid index Array Initialization
int * CenterIndex;
Page | 16
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
//Set of all Centorids
double * Center;
//Collection of Copies of Centoroid
double * CenterCopy;
//Data Collection
double * AllData;
//The set of clusters.
double ** Cluster;
//No of elements in the collection
//will be used as stack processing.
int * Top;
FILE *fp;
int lcount;
char line[100];
char *key;
char file_name[80] = "\0";
//Number of randomly generated k x(0 <= x <= n-1)
//set as the starting center of cluster
void CreateRandomArray(int n, int k,int * center){
int i=0;
int j=0;
//No of randomly generated k
for( i=0;i<k;++i){
//random data selected from the input
int a=rand()%n;
for(j=0;j<i;j++)
//repeat.
if(center[j]==a)
break;
//add if unrepeated.
if(j>=i)
center[i]=a;
Page | 17
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
/*if repeated, this time center
are randomly generated.*/
else
i--;
}
}
//Back to the center of cluster from the smallest number
int GetIndex(double value,double * center){
int i=0;
//Center of mass of the smalles number
int index=i;
//minimum distance from the center of cluster
double min=fabs(value-center[i]);
for(i=0;i<K;i++){
/*if the distance is smalled than the current,
updating the minimum value of the mean number
and the distance.*/
if(fabs(value-center[i])<min){
index=i;
min=fabs(value-center[i]);
}
}
return index;
}
//to copy the array centroid.
void CopyCenter(){
int i=0;
for(i=0;i<K;i++)
CenterCopy[i]=Center[i];
}
//Centorid initialization, random generation method
void InitCenter(){
int i=0;
//Generate a random K
CreateRandomArray(N,K,CenterIndex);
for(i=0;i<K;i++){
//Assigned to the corresponding data array centroid
Center[i]=AllData[CenterIndex[i]];
}
CopyCenter();//to copy the array centroid.
Page | 18
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
}
//Add a data to a Cluster [index] collection
void AddToCluster(int index,double value){
//into the stack with the operations here.
Cluster[index][Top[index]++]=value;
}
//Update the cluster set.
void UpdateCluster(){
int i=0;
int tindex;
//The set of all empty cluster mean initialized to 0.
for(i=0;i<K;i++)
Top[i]=0;
for(i=0;i<N;i++){
/*Data obtained with the current center
of cluster with the smallest index. */
tindex=GetIndex(AllData[i],Center);
//Add to the appropriate collection
AddToCluster(tindex,AllData[i]);
}
}
/*Update the centroid set for each cluster of
elements in the collection.*/
void UpdateCenter(){
int i=0;
int j=0;
double sum=0;
for(i=0;i<K;i++){
sum=0;
for(j=0;j<Top[i];j++)
sum+=Cluster[i][j];
//If cluster is not empty
if(Top[i]>0)
//Finding the average.
Center[i]=sum/Top[i];
}
}
Page | 19
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
/*Verifying the two array of elements
whether they are same.*/
int IsEqual(double * center1 ,double * center2){
int i;
for(i=0;i<K;i++){
if(fabs(center1[i]!=center2[i]))
return FALSE;
}
return TRUE;
}
//Printing the clustering result.
void Print(){
int i,j,n=0,x,y,count=1,centroid_num=0;
double temp;
double f[100];
float p,hun;
FILE* fw;
fw=fopen("object_output.xlsx","w+");
fprintf(fw,"No of objects: %d\n", N);
for(i=0;i<K;i++){
printf("\nS%d Group: Centroid(%f)\n",i,Center[i]);
fprintf(fw,"\nS%d Group: Centroid(%f)\n",i,Center[i]);
centroid_num++;
f[i]= Center[i];
for(j=0;j<Top[i];j++){
printf("%f\n",Cluster[i][j]);
fprintf(fw,"%f\n", Cluster[i][j]);
}
}
fclose(fw);
/*Sort the Centroid.
f[] is the sorted array.*/
for(i=K-2;i>=0;i--){
for(j=0;j<=i;j++){
if(f[j]<f[j+1]){
temp=f[j];
f[j]=f[j+1];
f[j+1]=temp;
}
}
//end for 1.
}
Page | 20
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
hun= f[0]-f[centroid_num-1];
for(i=0;i<centroid_num-1;i++){
p= ((f[i]-f[i+1])/hun)*100;
if(p>20){
count++;
}
}
}
//Initializing data in various clusters
void InitData(){
int i=0;
float a;
char* fname = file_name;
printf("Enter the no. of clusters: ");
scanf("%d",&K);
fp=fopen(fname, "r");
i = 0;
key = '\0';
line[100] = '\0';
while(fgets(line, sizeof(line), fp) != NULL ) {
//Get each line from the infile
key = strtok(line, " \n");
while(key){
key = strtok(NULL, " \n");
i++;
}
}
N = i;
fclose(fp);
printf("****************************************************\n");
printf("No of Objects tested: %d\n", N);
//Memory allocation for the center of clusters
Center=(double *)malloc(sizeof(double)*K);
//set memory for the indexes of clusters.
CenterIndex=(int *)malloc(sizeof(int)*K);
//Memory allocation for the copy of the centers of clusters.
CenterCopy=(double *)malloc(sizeof(double)*K);
Page | 21
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Top=(int *)malloc(sizeof(int)*K);
//Memory allocation for the data inputs
AllData=(double *)malloc(sizeof(double)*N);
//Memory allocation for the center of the clusters.
Cluster=(double **)malloc(sizeof(double *)*K);
//Initialize the K clusters.
for(i=0;i<K;i++){
Cluster[i]=(double *)malloc(sizeof(double)*N);
Top[i]=0;
}
fp=fopen(fname,"r");
i = 0; key = '\0'; line[100] = '\0';
while(fgets(line, sizeof(line), fp) != NULL ) {
// Get each line from the infile
key = strtok(line, " \n");
while(key){
//printf("next string:%f\n", atof(key));
AllData[i]= atof(key);
//printf("%f\n", AllData[i]);
key = strtok(NULL, " \n");
i++;
}
}
fclose(fp);
//Initizlize the means of the clusters.
InitCenter();
UpdateCluster();//updating the clusters.
}
/*Given the no. of clusters K, K to N
objects assigned to a cluster,
Making the clusters based on the
maximum and minimum similarty
among the clusters.*/
int main(int argc, char* argv[]){
int Flag = 1;
int i = 0;
if(argc != 2){
Page | 22
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
//argv[0] is the program name
printf("Usage: %s\t <Input_File>\n", argv[0]);
exit(0);
}
strcpy(file_name, argv[1]);
//Initializing the data values inputted.
InitData();
//starting the iteration.
while(Flag){
//Updating the clusters.
UpdateCluster();
//Updating the centers.
UpdateCenter();
/*if the previous means and the current means
of the clusters are same then do nothing.*/
if(IsEqual(Center,CenterCopy)){
Flag=0;
}
/*otherwise a copy of the clusters based
on the current iteration is pbtained.*/
else{
/*calling the copycenter method to obtain a copy
of the clusters based on the current iteration.*/
CopyCenter();
}
}
//Printing the outputs.
Print();
}
Test Input Files and their Output results:
1. First Input Data:
Output of the clusters formed:
Page | 23
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
No of objects: 56
S0 Group: Centroid(7.745833)
2
2
4.4
5
6
7
7
8
11.4
12
12.15
16
S2 Group: Centroid(80.878125)
62
66
72.6
73
74.4
76
77
77.2
81
83
84
S1 Group: Centroid(35.245200)
22.2
22.3
22.9
23
23.7
24
27.26
28
28
28.8
29
32
32.8
33.3
37
38.8
39.9
43
45
45
47
47.47
49.9
55
55.8
S3 Group: Centroid(313.666667)
266
292
383
Page | 24
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
85.85
91
95
97
99
2. Second Input Data:
Output File has been provided separately with the report. ( Not including here due to the space
constraint.)
Graphs:
The curves plotted for the both the outputs of the clusters are as below:
First Case:
450
400
350
300
250
200
150
100
50
0
2
4
6
8 10 12 14 16 18 20 22 24
1
3
5
7
9 11 13 15 17 19 21 23 25
Second Case:
Page | 25
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Here, in both the graphs we can see the clusters formed. The desired cluster numbers passed is
4. So, we find the cluster groups formed S0, S1, S2 and S3.
The Result derived for larger set of data:
Page | 26
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
Applications
Image segmentation: In computer vision as a form of image segmentation.
The results of the segmentation are used to aid border detection and object recognition. In this
context, the standard Euclidean distance is usually insufficient in forming the clusters. Instead, a
weighted distance measure utilizing pixel coordinates, RGB pixel color and/or intensity, and
image texture is commonly used.
k-means Clustering Algorithm in Image Retrieval System
Retrieval is according to feature similarities with respect to the query, ignoring the similarities
among images in database. Combining the low-level visual features and high-level concepts,
the proposed approach fully explores the similarities among images in database, using such
clustering algorithm and optimizes the relevance results from traditional image retrieval system
by firstly clustering the similar images in the images database to improve the efficiency of
images retrieval system. The results on the testing images show that the proposed approach
can greatly improve the efficiency and performances of image retrieval.
Limitations and Challenges in clustering:
1) Outlier handling :
Outliers are those elements which generally do not belong into any cluster. However, if
a clustering algorithm attempts to form a cluster with outliers then it will be a very large
cluster and this may result in the formation of poor clusters with respect to the
attributes.
2) Dynamic Data:
If data is dynamic or changing continuously then their membership will reassessed and
this may lead to reform the cluster over a period of time.
3) Interpretation of the semantic meaning of cluster:
The labeling of the classes is unknown in advance. So, when the clustering process
finishes creating a set of clusters the exact interpretation of each cluster may not be
obvious.
4) No unique solution to a clustering problem:
Exact number of clusters required is not easy to determine. If attempts are made to
divide the data into similar grouping it would not be clear how many groups should be
created in advance.
5) No supervised prior learning:
Page | 27
A REPORT ON CLUSTERING
BY SUMITKUMAR S. GHOSH
In clustering there is no prior knowledge concerning what the attributes of each classification should
be. So, clustering can be viewed similar to unsupervised learning.
My Contribution to the Project:




Studied the various clustering algorithms.
Studied, designed and implemented the basic k-means algorithm in C.
Tested the code with the help of already implemented code.
Modified the code to improve the clustering result as per the requirements and
test results.
 Researched and implemented the technique to find the actual number of clusters
based on the data provided.
 Generated spreadsheet from the result obtained by the algorithm and created the graph
based on it.
References:






Data Mining: Introductory and Advanced Topics by Margaret H. Dunham.
Data Mining: Concepts and Techniques by Jiawei Han.
http://en.wikipedia.org/wiki/Cluster_analysis
http://people.revoledu.com/kardi/tutorial/kMean/Algorithm.htm
http://www.rob.cs.tu-bs.de/content/04-teaching/06interactive/Kmeans/Kmeans.html
library.witpress.com/pages/PaperInfo.asp?PaperID=16701
Page | 28