* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Unsupervised Learning: Clustering
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined
classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
General Applications of Clustering
 Pattern Recognition
 Spatial Data Analysis
 create thematic maps in GIS by clustering feature spaces
 detect spatial clusters and explain them in spatial data
mining
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access
patterns
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
What Is Good Clustering?
 A good clustering method will produce high quality clusters
with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
Requirements of Clustering in Data Mining
 Scalability : work good on small sets only
 Ability to deal with different types of attributes
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Interpretability and usability
Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
Major Clustering Approaches
 Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
 Hierarchy algorithms: Create a hierarchical decomposition of
the set of data (or objects) using some criterion
 Density-based: based on connectivity and density functions
 Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
 Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 k-means : Each cluster is represented by the center of the
cluster.
The K-Means Clustering Method
k-means algorithm is implemented in 5 steps:
 Step 1: Ask the user how many clusters k the data set should be
partitioned into.
 Step 2: Randomly assign k records to be the initial cluster center
locations.
 Step 3: For each record, find the nearest cluster center. Thus, in a sense,
each cluster center “owns” a subset of the records, thereby representing a
partition of the data set. We therefore have k clusters, C1,C2, . . . ,Ck .
 Step 4: For each of the k clusters, find the cluster centroid, and update the
location of each cluster center to the new value of the centroid.
 Step 5: Repeat steps 3 to 5 until convergence or termination.
The K-Means Clustering Method
 Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Equations required
Euclidean : to calculate the nearest value to the center of
cluster.
Sum of squared errors
Data Mining: Concepts and Techniques
K Mean Steps
 Step 1: Ask the user how many clusters k the data set should be
partitioned into. We have already indicated that we are interested in k =
2 clusters.
 Step 2: Randomly assign k records to be the initial cluster center
locations. For this example, we assign the cluster centers to be m1 = (1,1)
and m2 = (2,1).
 Step 3: For each record, find the nearest cluster center.
 Step 4 : For each of the k clusters find the cluster centroid and update the
location of each cluster center to the new value of the centroid.
 Step 5: Repeat steps 3 and 4 until convergence or termination. The
centroids have moved, so we go back to step 3 for our second pass
through the algorithm.
Data Mining: Concepts and Techniques
Example
 Suppose that we have the eight data points in two-
dimensional space shown in the following table:
 lets say k = 2 clusters.
1  22  3  12
 5  2.23
1-Take c1=(1,1) and c2=(2,1) as initial
center points for the 2 clusters
2- calculate the distance between each point and the 2 centers
for example :Point a(1,3):
Distance (a,c1)=
Distance (a,c2)=
1  1  3  1
2
2
1  22  3  12
Data Mining: Concepts and Techniques
 4 2
 5  2.24
Example
Step 3 results:
Data Mining: Concepts and Techniques
Example
 Step 4 (first pass): For each of the k clusters find the cluster centroid
and update the location of each cluster center to the new value of
the centroid.
 Cluster1 points= {a,e,g} , Cluster 2 Points ={b,c,d,f,h}
 centroid for cluster 1 is [(1 + 1 + 1) /3, (3 + 2 + 1) /3] = (1,2).
 The centroid for cluster 2 is [(3 + 4 + 5 + 4 + 2) /5, (3 + 3 + 3 + 2 + 1)
/5] = (3.6, 2.4).
 Step 5: Repeat steps 3 and 4 until convergence or termination. The
centroids have moved, so we go back to step 3 for our second pass
through the algorithm.
Data Mining: Concepts and Techniques
Example
Since there is no change in the cluster
points , we stop here
Data Mining: Concepts and Techniques