Download Unsupervised Learning: Clustering

Cluster Analysis  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods What is Cluster Analysis?  Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Cluster analysis  Grouping a set of data objects into clusters  Clustering is unsupervised classification: no predefined classes  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms General Applications of Clustering  Pattern Recognition  Spatial Data Analysis  create thematic maps in GIS by clustering feature spaces  detect spatial clusters and explain them in spatial data mining  Image Processing  Economic Science (especially market research)  WWW  Document classification  Cluster Weblog data to discover groups of similar access patterns Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location What Is Good Clustering?  A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation. Requirements of Clustering in Data Mining  Scalability : work good on small sets only  Ability to deal with different types of attributes  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Interpretability and usability Cluster Analysis  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods Major Clustering Approaches  Partitioning algorithms: Construct various partitions and then evaluate them by some criterion  Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion  Density-based: based on connectivity and density functions  Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Data Mining: Concepts and Techniques Data Mining: Concepts and Techniques Cluster Analysis  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods Partitioning Algorithms: Basic Concept  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  k-means : Each cluster is represented by the center of the cluster. The K-Means Clustering Method k-means algorithm is implemented in 5 steps:  Step 1: Ask the user how many clusters k the data set should be partitioned into.  Step 2: Randomly assign k records to be the initial cluster center locations.  Step 3: For each record, find the nearest cluster center. Thus, in a sense, each cluster center “owns” a subset of the records, thereby representing a partition of the data set. We therefore have k clusters, C1,C2, . . . ,Ck .  Step 4: For each of the k clusters, find the cluster centroid, and update the location of each cluster center to the new value of the centroid.  Step 5: Repeat steps 3 to 5 until convergence or termination. The K-Means Clustering Method  Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Equations required Euclidean : to calculate the nearest value to the center of cluster. Sum of squared errors Data Mining: Concepts and Techniques K Mean Steps  Step 1: Ask the user how many clusters k the data set should be partitioned into. We have already indicated that we are interested in k = 2 clusters.  Step 2: Randomly assign k records to be the initial cluster center locations. For this example, we assign the cluster centers to be m1 = (1,1) and m2 = (2,1).  Step 3: For each record, find the nearest cluster center.  Step 4 : For each of the k clusters find the cluster centroid and update the location of each cluster center to the new value of the centroid.  Step 5: Repeat steps 3 and 4 until convergence or termination. The centroids have moved, so we go back to step 3 for our second pass through the algorithm. Data Mining: Concepts and Techniques Example  Suppose that we have the eight data points in two- dimensional space shown in the following table:  lets say k = 2 clusters. 1  22  3  12  5  2.23 1-Take c1=(1,1) and c2=(2,1) as initial center points for the 2 clusters 2- calculate the distance between each point and the 2 centers for example :Point a(1,3): Distance (a,c1)= Distance (a,c2)= 1  1  3  1 2 2 1  22  3  12 Data Mining: Concepts and Techniques  4 2  5  2.24 Example Step 3 results: Data Mining: Concepts and Techniques Example  Step 4 (first pass): For each of the k clusters find the cluster centroid and update the location of each cluster center to the new value of the centroid.  Cluster1 points= {a,e,g} , Cluster 2 Points ={b,c,d,f,h}  centroid for cluster 1 is [(1 + 1 + 1) /3, (3 + 2 + 1) /3] = (1,2).  The centroid for cluster 2 is [(3 + 4 + 5 + 4 + 2) /5, (3 + 3 + 3 + 2 + 1) /5] = (3.6, 2.4).  Step 5: Repeat steps 3 and 4 until convergence or termination. The centroids have moved, so we go back to step 3 for our second pass through the algorithm. Data Mining: Concepts and Techniques Example Since there is no change in the cluster points , we stop here Data Mining: Concepts and Techniques

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Unsupervised Learning: Clustering