Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering methods Partitional clustering in which clusters are represented by their centroids (proc FASTCLUS) Agglomerative hierarchical clustering in which the closest clusters are repeatedly merged (proc CLUSTER) Density-based clustering in which core points and associated border points are clustered (proc MODECLUS) Data mining and statistical learning lecture 14 Proc FASTCLUS Select k initial centroids Repeat the following until the clusters remain unchanged: Form k clusters by assigning each point to its nearest centroid Update the centroid of each cluster Data mining and statistical learning lecture 14 Identification of water samples with incorrect total nitrogen levels Total nitrogen (Kjeldahl) m g/l 25000 20000 15000 10000 5000 0 0 5000 10000 15000 20000 25000 30000 Total nitrogen (persulfate) mg/l Data mining and statistical learning lecture 14 Identification of water samples with incorrect total nitrogen levels - 2-means clustering Cluster 1 Cluster 2 Total nitrogen (Kjeldahl) 25000 20000 15000 Initialization problems? 10000 5000 0 0 5000 10000 15000 20000 25000 Total nitrogen (persulfate digestion) Data mining and statistical learning lecture 14 30000 Limitations of K-means clustering 1. Difficult to detect clusters with non-spherical shapes 2. Difficult to detect clusters of widely different sizes 3. Difficult to detect clusters of different densities Data mining and statistical learning lecture 14 Proc MODECLUS Use a smoother to estimate the (local) density of the given dataset A cluster is loosely defined as a region surrounding a local maximum of the probability density function Data mining and statistical learning lecture 14 Identification of water samples with incorrect total nitrogen levels - proc MODECLUS, R = 1000 Smoothing parameter R = 1000 Tot_N (ps) mg/l 25000 20000 Cluster 1 Cluster 2 15000 Cluster 3 Cluster 4 10000 Cluster 5 5000 Other clusters 0 0 10000 20000 30000 Tot_N (Kj) mg/l Data mining and statistical learning lecture 14 What will happen if R is increased? Identification of water samples with incorrect total nitrogen levels - proc MODECLUS, R = 4000 Smoothing parameter R = 4000 Tot_N (ps) mg/l 25000 20000 15000 Cluster 1 Cluster 2 10000 5000 0 0 10000 20000 Tot_N (Kj) mg/l Data mining and statistical learning lecture 14 30000 Identification of water samples with incorrect total nitrogen levels - proc MODECLUS, method 6 Total nitrogen (Kjeldahl) 25000 20000 Cluster 1 Cluster 2 15000 Cluster 3 Cluster 4 Cluster 5 10000 Clusters 6 - 18 No cluster assigned 5000 0 0 5000 10000 15000 20000 25000 30000 Total nítrogen (persulfate digestion) Data mining and statistical learning lecture 14 Why did the clustering fail? Limitations of density-based clustering 1. Difficult to control (requires repeated runs) 2. Collapses in high dimensions Data mining and statistical learning lecture 14 Strength of density-based clustering Given a sufficiently large sample, nonparametric density-based clustering methods are capable of detecting clusters of unequal size and dispersion and with highly irregular shapes Data mining and statistical learning lecture 14 Identification of water samples with incorrect total nitrogen levels - transformed data Total N (ps) -Total N (Kj) 15000 10000 5000 0 -5000 -10000 0 5000 10000 15000 20000 Total N (Kj) Data mining and statistical learning lecture 14 25000 Identification of water samples with incorrect total nitrogen levels - proc MODECLUS, R = 2000, transformed data Total N (ps) -Total N (Kj) 15000 10000 Cluster 1 5000 Cluster 2 Cluster 3-6 0 -5000 -10000 0 5000 10000 15000 20000 Total N (Kj) Data mining and statistical learning lecture 14 25000 Preprocessing 1. Standardization 2. Linear transformation 3. Dimension reduction Data mining and statistical learning lecture 14 Postprocessing 1. Split a cluster • Usually, the cluster with the largest SSE is split 2. Introduce a new cluster centroid • Often the point that is farthest from any cluster center is chosen 3. Disperse a cluster • Remove one centroid and reassign the points to other clusters 4. Merge two clusters • Typically, the clusters with the closest centroids are chosen Data mining and statistical learning lecture 14 Profiling website visitors 1. A total of 296 pages at a Microsoft website are grouped into 13 homogenous categories • • • • • • • • Initial Support Entertainment Office Windows Othersoft Download ….. 2. For each of 32711 visitors we have recorded how many times they have visited the different categories of pages 3. We would like to make a behavioural segmentation of the users ( a cluster analysis) that can be used in future marketing decisions Data mining and statistical learning lecture 14 Profiling website visitors - the dataset client_codeinitial 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 10011 10012 10013 10014 10015 10016 10017 10018 10019 10020 10021 help 1 1 2 0 0 2 0 1 0 1 2 0 0 0 0 0 1 1 4 0 3 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 entertainment office 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 windows 0 0 0 0 0 0 1 0 0 1 3 0 0 0 0 0 0 0 1 1 1 othersft 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 download otherint development hardware business information area 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 2 0 0 1 1 0 1 0 0 0 0 0 1 3 2 0 1 1 Why is it necessary to group the pages into categories? Data mining and statistical learning lecture 14 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Profiling website visitors - 10-means clustering Data mining and statistical learning lecture 14 Profiling website visitors - cluster proximities Data mining and statistical learning lecture 14 Profiling website visitors - profiles Data mining and statistical learning lecture 14 Profiling website visitors - Kohonen Map of cluster frequencies Data mining and statistical learning lecture 14 Profiling website visitors - Kohonen Maps of means by variable and grid cell Data mining and statistical learning lecture 14 Characteristics of Kohonen maps The centroids vary smoothly over the map • The set of clusters having unusually large (or small) values of a given variable tend to form connected spatial patterns Clusters with similar centroids need not be close to each other in a Kohonen map The sizes of the clusters in Kohonen maps tend to be less variable than those obtained by K-means clustering Data mining and statistical learning lecture 14