Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Principal component analysis wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
Exploratory factor analysis wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Human genetic clustering wikipedia , lookup
K-means clustering wikipedia , lookup
Session 13 Clustering Section 1 Background Section 2 Distance Measures 2.1 Euclidean Distance 2.2 Distance Measures with Weights Section 3 Clustering Methods 3.1 Hierarchical Clustering 3.2 Partition Based Clustering Section 4 Variable Clustering Section 5 Practical Considerations Section 6 Clustering with SAS Enterprise Miner Section 7 Case Study 1: K-Means Clustering with the Clustering Node Section 8 Case Study 2: Clustering with KBM Data Set 2 4 4 5 6 6 9 10 11 11 14 23 Appendix 1 Distance Measures after Standardization Appendix 2 Covariance and the Pearson Correlation Coefficient Appendix 3 Data Used in Session 13, Section 7 Appendix 4 Data Used in Session 13, Section 8 Appendix 5 Cubic Clustering Criterion Appendix 6 References Appendix 7 Exercises 29 30 31 32 33 38 43 Section 1 Background Cluster analysis (also known as data segmentation in the data mining community) has a variety of goals. The goals are invariably related to grouping or segmenting a collection of objects into disjoint subsets or “clusters” such that those objects within each cluster are “similar” to each other while those objects assigned to different clusters “dissimilar”. Data mining applications typically deal with a humongous volume of data and it is likely that the data are heterogeneous. This means that the data might fall into several distinct groups, with objects within each subgroup being similar to each other but different from objects from other groups. Since it is possible that there are different models or patterns pertinent to each group, it is very challenging to spot any single pattern or model that is germane to the whole data set. Creating clusters of similar objects reduces the model complexity within the clusters which then enhances the chances for data mining techniques to perform more successfully within each cluster. Even if the data do not have natural groupings, partitioning data into homogeneous groups (empirically without regard to a specific explanation for each cluster) can be very useful. For example, it is well known that customer preferences for products depend on geographic and demographic factors. Thus, we can use geographic and demographic factors to group customers into several segments and to develop marketing strategies in each segment. Although customers do not form these marketing segments naturally, it is much easier and more effective to develop efficient marketing strategies for each segment separately than to identify a one-size-fits-all one single marketing strategy to target all customers. Clustering is one important “unsupervised” data-mining tool. Unlike supervised data mining tools that are driven by user direction, cluster analysis has no priori assumptions concerning the number of clusters or cluster structure. The basic objective in clustering is to discover natural groupings of the cases or variables based on some similarity or distance (dissimilarity) measures. Although there is no target variable to be predicted explicitly at the clustering stages, a clustering technique can be used in many ways. First, it can be used in missing value imputation. This method was illustrated in the SHOES example in the previous session in which we used clustering to impute missing values for the interval input variables (age, miles/week, races/year, years running). Although we could have replaced the missing values by the population average of the non-missing variables, such an approach would likely mask the basic structure among variables while potentially interjecting a diminished relationship between the input variables and the target variable Number of Purchases (one pair or at least two pairs). Second, one can use clustering to detect outliers because outliers typically belong to clusters with only one case. Third, one can use clustering to discover the number of clusters if one suspects that there are meaningful groupings that may represent groups of cases. After finding these meaningful groups, one can then develop different ways to deal with each group such as target marketing that will be discussed later. Fourth, we can use cluster analysis to partition a complex data structure into several subsets in order to give supervised datamining techniques such as decision tree or neural network a better chance of finding a good predictive model. © Morgan C. Wang and Mark E. Johnson 2 Since the benefits of cluster analysis are evident, there have been extensive research efforts in the past several decades on cluster analysis and there are many “automatic clustering” techniques available. However, different clustering techniques lead to different types of clusters and it is very difficult to tell whether a cluster analysis exercise has been successful because cluster analysis is an unsupervised data mining exercise. To use cluster techniques effectively, we need to at least understand several important aspects of choosing clustering techniques--especially the selection of a similarity measure (or dissimilarity measure). Select Similarity Measure In order to decide whether a set of objects can be split into subgroups, where members from the same group are more “similar” to one another than they are to members from different groups, we need to define what “similar” means. Suppose we want to group a deck of ordinary playing cards into clusters. There are many ways to do so such as • Two clusters: One cluster has all the face cards and another cluster has all other cards. • Four clusters: Each suit of thirteen cards forms a cluster. • Two clusters: Red suits are in one cluster and black suits are in another cluster. • Thirteen clusters: Each cluster has all cards that have the same face value. Obviously, the clusters obtained are very different with different “similarity measure”. Thus, we need to know how to choose a similarity measure before selecting the clustering technique. Select the Right Number of Clusters Another important question to ask in cluster analysis is “what is the right number of clusters?” The choice of the number of clusters depends on the goal of the study. For example, in market segmentation analysis, the number of appropriate clusters depends on the number of divisions in the company. However, the number of “natural clusters” can be decided using descriptive statistics to guide the choice of the natural number of groups. In the well known k-means clustering algorithm, the original chosen number of k determines the number of clusters that will be found. If this number does not match to the natural structure of the data, the technique will obtain poor results. Unless there is good prior knowledge on how many clusters exist in the data, it is very difficult to choose the number k before applying k-means clustering. We will discuss how to choose the right number in Section 5. In general, the best set of clusters is the one that does the best job of keeping the distance between objects within the same cluster small and the distance between objects of adjacent clusters large. However, if the purpose of clustering is to detect unexpected patterns, the right number of clusters might be the one that can find unexpected patterns from the data. Cluster Interpretation Clustering is a powerful, unsupervised knowledge discovery technique; however, it has some weaknesses and limitations. For example, if one does not know what one is looking for, one may not recognize it when one finds it. Although the clustering technique can help to find clusters, it is up to the user to interpret them. The following approaches can help the user to understand clusters: © Morgan C. Wang and Mark E. Johnson 3 • • • Use graphical tools or summary statistics to exam the within cluster distribution for each variable (StatExplore node) Use graphical tools such as box plots to study the within cluster distribution for each continuous variable Use graphical tools such as the Mosaic plot to study the within cluster distribution for each categorical variable Study the within cluster statistics for each variable Use other visualization tools to see how the clusters are affected by changes for each variable with the normalization mean plot (Clustering node) Build a decision tree (or other supervised data mining tools) with the cluster label as the target variable and use it to derive rules explaining how to assign new records to the correct cluster Section 2 Distance Measures (or Dissimilarity Measures) Many data mining techniques such as association analysis, clustering analysis, multidimensional scaling, and classification analysis are based on similarity measure between objects. One can measure the similarity directly through a survey. For example, one can conduct a survey to find out the similarity between two different brands of beer. However, typically the similarity between two objects cannot be measured directly. Instead, one measures the similarity between two objects through the corresponding vectors of property measurements. Section 2.1 Euclidean Distance Let x(i ) = ⎡⎣ x1 ( i ) , x2 ( i ) , , xm ( i ) ⎤⎦ and x( j ) = ⎡⎣ x1 ( j ) , x2 ( j ) , , xm ( j ) ⎤⎦ be any two objects with m variables (features). If all variables are quantitative (or interval type), i.e., can be represented by continuous real-valued numbers. The most common choice of distance measure is the Euclidean distance, a special case of the Minkowski metric (Lp norm). The Minkowski metric is defined as 1/ p p ⎡m ⎤ (1) d ( x(i ), x( j ) ) = d ( i, j ) = ⎢ ∑ xk ( i ) − xk ( j ) ⎥ . ⎢⎣ k =1 ⎥⎦ For p=2, d(i,j) becomes the Euclidean distance. For p=1, d(i,j) becomes the mean absolute deviation between the two objects. For p = ∞ , d(i,j) becomes to the maximum absolute deviation between the two objects. The Minkowski metric satisfies the following properties • d(i,j) = d(j,i) • d(i,j) > 0 if i≠j (positive valued) • d(i,j)=0 if i=j • d(i,j) ≤ d(i.k) + d(j.k). (triangle inequality) © Morgan C. Wang and Mark E. Johnson 4 Another popular similarity measure for numerical variables is the Pearson correlation coefficient defined by: ∑ ( x (i ) − x ) ( x ( j ) − x ) m ρ ( x(i ), x( j ) ) = k =1 k i j k , (2) 2 1/ 2 ⎛ m ⎞ 2 m ⎜⎜ ∑ xk ( i ) − xi ∑ xk ( j ) − x j ⎟⎟ k =1 ⎝ k =1 ⎠ m x (i ) m x ( j) where xi = ∑ k and x j = ∑ k are the sample averages over variables for k =1 m k =1 m object i, and j, respectively. It is worth noting that clustering based on correlation is equivalent to that based on Euclidean distance if the inputs are standardized first (Problem 1 in Appendix 8). ( ) ( ) If some features are not quantitative (say they are nominal or ordinal), Euclidean distance may not be appropriate (as well as being unsuitable for applying equation (2)). Typically, we can create dummy variables for each level of nominal variables and replace each of the original categories of the ordinal variable with i− 1 2 , i = 1, 2, , M , (3) M where M is the number of categories. The transformed variables can be treated as quantitative variables on this scale. Section 2.2 Distance Measure with Weights Distance measures presume some degree of commensurability between the different variables. Thus, it would be effective if each variable were measured using the same units and hence, each variable would be equally important. However, it is very unlikely that all variables in a data mining exercise were measured with the same units. Recalling the SHOES example from an earlier session we had a variable given in miles per week while another variable was age in years of the respondent. One way to deal with this incommensurability is to standardize the data by dividing each of the variables by its sample standard deviation. Standardization is equivalent to setting all variables, irrespective to the data type, to have the same influence to the overall dissimilarity between pairs of cases. Although this approach is reasonable and often recommended from a pure statistical perspective, it can cause problems because variables might not contribute equally to the notion of dissimilarity between cases. Thus, we can weight them (after standardization) to yield the weighted standardize distance measure, if we have a notion of the relative importance of these variables based on domain knowledge. The weighted standardized similarity measure (Appendix 1) becomes p ⎡m ⎤ d ws ( i, j ) = ⎢ ∑ wk xk' ( i ) − xk' ( j ) ⎥ ⎢⎣ k =1 ⎥⎦ © Morgan C. Wang and Mark E. Johnson 1/ p . (4) 5 Suppose the goal of clustering is to discover the natural grouping of the data, some variables may exhibit more of a grouping tendency than others. Variables that are more relevant in separating the groups should be assigned a higher weight. Suppose that the goal of clustering is to segment the data into groups of “similar” cases, all variables might not contribute equally to the notion of dissimilarity (problem dependent) between cases. The domain expert should play an important role in assigning the weight to each variable. Section 3 Clustering Methods The guaranteed way to find the best set of clusters is to examine all possible clusters. However, complete consideration is computationally prohibitive, even with the fastest computers and optimized algorithms. Because of this problem, a wide variety of heuristic clustering algorithms have emerged that find “reasonable” clusters without examining exhaustively all possible clusters. 3.1 Hierarchical Clustering Hierarchical clustering techniques proceed by either a series of successive merges or a series of successive partitions. Agglomerative hierarchical clustering methods start with the individual cases. Thus, initially there are as many clusters as cases. Most “similar” cases are first merged to form a reduced number of clusters. This is repeated until just one cluster exists with all cases. Let D = {x(1), x(2), …, x(n)} be n cases and D(Ci, Cj) be the distance measure between any two clusters Ci and Cj. Then, an agglomerative algorithm for clustering can be described as follows: for i = 1 to n let Ci = {x(i)} while there is more than one cluster left do let Ci and Cj be the two clusters minimizing the distance between any two clusters; Ci = Ci ∪ C j remove Cj; end; In the “Single Linkage” method, the distance between two clusters is defined as DSL ( Ci , C j ) = min{d ( x, y ) | x ∈ Ci and y ∈ C j } . The clusters formed by the single linkage method will not be affected by the distance measures used if these distance measures have the same relative ordering. It also has the property that if two pairs of clusters are equidistant it does not matter which one is merged first. The overall result will be the same. Single linkage method is the only clustering method that can find nonellipsoidal clusters. The tendency of single linkage method to pick up long string-like clusters is knows as chaining. This tendency and its sensitivity to outliers and perturbation of the data combined makes it less useful in customer segmentation applications than other methods to be described subsequently. © Morgan C. Wang and Mark E. Johnson 6 In “Complete Linkage” method, the distance between two clusters is defined as DCL ( Ci , C j ) = max{d ( x, y ) | x ∈ Ci and y ∈ C j } . In other words, the distance between two clusters is the distance between the two most dissimilar points (one from each cluster). The clusters formed by complete linkage method will not be affected by the distance measures used if these distance measures have the same relative ordering. Complete linkage method tends to find clusters of equal size in terms of the volume of space occupied, making it particularly suitable in customer segmentation application. In “Average Linkage” method, the distance between two clusters is defined as Ci DAL ( Ci , C j ) = Cj ∑∑ d ( x, y ) k =1 h =1 nCi + nC j . The clusters formed by average linkage method will be affected by the distance measures used even if these distance measures have the same relative ordering. This makes average linkage unappealing in data mining applications. Agglomerative clustering only requires a distance matrix to initiate the clustering procedure. This means that it does not need to store all variable values for each case. If we can compute the “distance” between variables, these methods can be applied in variable clustering as well. We will address this issue in Section 4. The methods mentioned here have several drawbacks. First, a case will stay in the same cluster once it is assigned to this cluster. This indicates that reallocation is not allowed in the clustering process even if one case has been wrongly assigned to a cluster. Second, these methods are sensitive to outliers and “noise”. Thus, we need to try several different cluster methods and, within each method, try several distance measures. If the outcomes from all methods are roughly consistent with one another, perhaps a set of good clusters has been identified. Also, we can add a small error to each case before applying clustering method to see how stable the clusters are. Other than linkage methods, there are centroid methods (the distance between two clusters is the Euclidean distance between their centroids) and Ward statistics (the distance between two clusters is the ANOVA sum of squares between two clusters added up over all variables). Example 1 Single Linkage, Complete Linkage, and Average Linkage The distances between pairs of five cases are given below. 1 2 3 4 5 1⎡0 ⎤ ⎢ ⎥ 2⎢ 9 0 ⎥ ⎥ 3⎢ 3 7 0 ⎢ ⎥ 4⎢ 6 5 9 0 ⎥ 5 ⎢⎣11 10 2 8 0 ⎥⎦ © Morgan C. Wang and Mark E. Johnson 7 Cluster the five cases using each procedure and draw the dendograms (tree structures) and compare the results. (a) Single linkage hierarchical procedure. (b) Complete linkage hierarchical procedure. (c) Average linkage hierarchical procedure. <Solutions>: (a) Step 1: Merge case 3 and case 5 since min ( d ij ) = d 35 = 2 Step 2: d( 3,5),1 = min ( d31 , d51 ) = 3 d( 3,5),2 = min ( d32 , d52 ) = 7 d( 3,5),4 = min ( d34 , d54 ) = 8 Thus, the new distance matrix is (35) 1 2 4 (35) ⎡ 0 ⎤ ⎢ ⎥ 1 ⎢3 0 ⎥ 2 ⎢7 9 0 ⎥ ⎢ ⎥ 4 ⎣8 6 5 0⎦ Step 3: Merge case (3,5) and case 1 since minimum distance is 3. Step 4: The new distance matrix is (135) 2 4 (135) ⎡ 0 ⎤ ⎢ 2 ⎢7 0 ⎥⎥ 4 ⎢⎣ 6 5 0 ⎥⎦ Step 5: Merge case 2 and case 4 since the minimum distance is 5. Step 6: Merge all cases together and the minimum distance is 6. (b) Step 1: Merge case 3 and case 5 since min ( d ij ) = d 35 = 2 Step 2: d( 3,5),1 = max ( d31 , d51 ) = 11 d( 3,5),2 = max ( d32 , d 52 ) = 10 d( 3,5),4 = max ( d34 , d54 ) = 9 Thus, the new distance matrix is (35) 1 2 4 (35) ⎡ 0 ⎤ ⎢ ⎥ 1 ⎢11 0 ⎥ 2 ⎢10 9 0 ⎥ ⎢ ⎥ 4 ⎣ 9 6 5 0⎦ © Morgan C. Wang and Mark E. Johnson 8 Step 3: Merge case 2 and case 4 since minimum distance is 5. Step 4: The new distance matrix is (35) ( 24 ) 1 (35) ⎡ 0 ⎤ ⎢ ( 24 ) ⎢10 0 ⎥⎥ 1 ⎢⎣11 9 0 ⎥⎦ Step 5: Merge case (24) and case 1 since the minimum distance is 9. Step 6: Merge all cases together and the minimum distance is 11. (c) Step 1: Merge case 3 and case 5 since min ( dij ) = d35 = 2 Step 2: d( 3,5),1 = avg ( d31 , d51 ) = 7 d( 3,5),2 = avg ( d32 , d52 ) = 8.5 d( 3,5),4 = avg ( d34 , d54 ) = 8.5 Thus, the new distance matrix is (35) 1 2 4 (35) ⎡ 0 ⎤ ⎢ ⎥ 1 ⎢7 0 ⎥ 2 ⎢8.5 9 0 ⎥ ⎢ ⎥ 4 ⎣8.5 6 5 0 ⎦ Step 3: Merge case 2 and case 4 since minimum distance is 5. Step 4: The new distance matrix is (35) ( 24 ) 1 (35) ⎡ 0 ⎤ ⎢ ( 24 ) ⎢8.5 0 ⎥⎥ 1 ⎣⎢ 7 7.5 0 ⎦⎥ Step 5: Merge case (35) and case 1 since the minimum distance is 7. Step 6: Merge all cases together and the minimum distance is 8.5. 3.2 Partition Based Clustering In partition based clustering, the task is to partition the data set into k disjoint clusters of cases such that the cases within each cluster are as homogeneous as possible. Given a set of n cases D = {x(1), x(2), …, x(n)}, our task is to find k clusters C = {C1, C2, …, Cn} such that each case x(i) is assigned to a unique cluster Ck. There are many score functions can be used to measure the quality of clustering. Centroid method uses the distance of two cluster centroids to measure the distance between two clusters. Average method uses the average distance between all pairs of points (one point from each cluster) to measure the distance between two clusters. Wald © Morgan C. Wang and Mark E. Johnson 9 statistics use the between clusters sum of squares to measure the distance between two clusters. Once the score function is selected, it is an optimization process to find clusters. Many optimization procedures can be applied. Here, we only introduce the popular Kmeans clustering method. The K-means algorithm is intended for situations in which all variables are of the quantitative type, and squared Euclidean distance is chosen as the dissimilarity measure. Let D = {x(1), x(2), …, x(n)} be n cases and our task is to find K clusters C = {C1, C2, …, CK}: Let {r(k): k = 1, 2, …, K} be K randomly selected points in D. for k = 1, 2, … K; form clusters: for k = 1, 2, …, K do Ck={x∈D|d(r(k),x) ≤d(r(j),x) for all j = 1, 2, …, K, j≠k} end; compute the new cluster centers; for k = 1, 2, …, K do r (k)= the vector mean of the cases in CK end; end; The time complexity of K-means algorithm is O(KIn), where I is the number of iterations. Since K, the number of clusters, is fixed in partition based clustering methods, the selection of K is critical. If the number of natural clusters is different from K, the partition based clustering algorithm cannot obtain good results. We will address the way to select the right number of K in Section 5. Section 4 Variable Clustering All methods discussed in Section3.1 can be used in variable clustering except the average linkage method if the distance between variables can be computed. The most popular distance measure between two quantitative variables is the Pearson correlation coefficient (Appendix 2). For categorical variables, we typical use the association to measure the distance between them. It can be shown that the correlation and association are equivalent if both variables are binary (see Problem 2 in Appendix 8). Example 2 Suppose the correlation matrix is © Morgan C. Wang and Mark E. Johnson 10 ⎛ X1 ⎜ ⎜ 1 ⎜ .643 ⎜ ⎜ −.103 ⎜ −0.82 ⎜ ⎜ −.259 ⎜ −.152 ⎜ ⎜ .045 ⎜ ⎝ −.013 X2 1 −.348 −.086 −.260 −.010 .211 −.328 X3 X4 X5 X6 X7 1 .100 1 .435 .034 1 .028 −.288 .176 1 .115 −.164 −.019 −.374 1 .005 .486 −.007 −.561 −.185 X8 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ 1 ⎠⎟ Use Single linkage and complete linkage to find the clusters. <Solutions>: Single Linkage 1 2 3 4 5 6 7 8 1 1.000 0.643 -0.103 -0.820 -0.259 -0.152 0.045 -0.013 Step 1. step 2. 38 1 2 4 5 6 7 2 1.000 -0.348 -0.086 -0.260 -0.010 0.211 -0.328 min (d38)= d(38),1= d(38),2= d(38),4= d(38),5= d(38),6= d(38),7= 38 1 -0.013 -0.328 0.100 -0.007 0.028 0.115 Step 3. 3 4 5 6 7 8 1.000 0.100 1.000 0.435 0.034 1.000 0.028 -0.288 0.176 1.000 0.115 -0.164 -0.019 -0.374 1.000 0.005 0.486 -0.007 -0.561 -0.185 1.000 0.005 min(d31,d81)= -0.013 min(d32,d82)= -0.328 min(d34,d84)= 0.100 min(d35,d85)= -0.007 min(d36,d86)= 0.028 min(d37,d87)= 0.115 1 1 0.643 -0.820 -0.259 -0.152 0.045 min (d38,5)= © Morgan C. Wang and Mark E. Johnson 2 4 5 6 7 1 -0.086 1 -0.260 0.034 1 -0.010 -0.288 0.176 1.000 0.211 -0.164 -0.019 -0.374 1.000 -0.007 11 d(385,1)= d(385,2)= d(385, 4)= d(385, 6)= d(385, 7)= 358 1 2 4 6 7 358 1 -0.013 -0.259 0.034 0.028 -0.019 Step 4. 358 26 1 4 7 1358 26 4 7 1358 1 0.028 0.034 -0.019 Step 6. 13578 26 4 1 1 0.643 -0.820 -0.152 0.045 min (d26)= d(26,358)= d(26,1)= d(26,4)= d(26,7)= 358 1 0.028 -0.013 0.034 -0.019 Step 5. 13578 1 0.028 0.034 -0.013 -0.259 0.034 0.028 -0.019 2 4 6 7 1 -0.086 1 -0.010 -0.288 1 0.211 -0.164 -0.374 1 -0.010 0.028 -0.152 -0.086 0.211 26 1 -0.152 -0.086 0.211 min (d358,1)= d(3581,26)= d(3581,4)= d(3581,7)= 1 4 7 1 -0.820 1 0.045 -0.164 1 -0.013 0.028 0.034 -0.019 26 4 7 1 -0.086 0.211 1 -0.164 1 min (d1358,7)= d(13587,26)= d(13587,4)= -0.019 0.028 0.034 26 4 1 -0.086 1 © Morgan C. Wang and Mark E. Johnson 12 step 7. min (d13578,26)= minimum Distance d(1235678,4)= 0.028 0.034 Complete Linkage 1 2 3 4 5 6 7 8 1 1.000 0.643 -0.103 -0.820 -0.259 -0.152 0.045 -0.013 Step 1. step 2. 38 1 2 4 5 6 7 38 26 1 3 4 1.000 -0.348 -0.086 -0.260 -0.010 0.211 -0.328 1.000 0.100 0.435 0.028 0.115 0.005 1.000 0.034 -0.288 -0.164 0.486 0.005 max(d31,d81)= max(d32,d82)= max(d34,d84)= max(d35,d85)= max(d36,d86)= max(d37,d87)= -0.103 -0.348 0.486 0.435 -0.561 -0.185 2 4 min (d38)= d(38),1= d(38),2= d(38),4= d(38),5= d(38),6= d(38),7= 38 1 -0.103 -0.348 0.486 0.435 -0.561 -0.185 Step 3. 2 1 1 0.643 -0.820 -0.259 -0.152 0.045 min (d26)= d(26,38)= d(26, 1)= d(26, 4)= d(26, 5)= d(26, 7)= 38 1 -0.561 -0.103 © Morgan C. Wang and Mark E. Johnson 5 6 7 8 1.000 0.176 1.000 -0.019 -0.374 1.000 -0.007 -0.561 -0.185 1.000 5 6 7 1 -0.086 1 -0.260 0.034 1 -0.010 -0.288 0.176 1.000 0.211 -0.164 -0.019 -0.374 1.000 -0.010 -0.561 0.643 -0.288 -0.260 -0.374 26 1 1 0.643 1 4 5 7 13 4 5 7 0.486 0.435 -0.185 Step 4. 38 26 57 1 4 1 -0.374 0.643 -0.288 min (d138)= d(138,26)= d(138,57)= d(138,4)= 1 57 1 4 1 -0.260 1 -0.164 -0.820 1 -0.103 0.643 0.435 -0.820 26 57 4 1 -0.374 -0.288 1 -0.164 1 min (d26,4)= d(264,138)= d(264,57)= 138 1 -0.820 0.435 -0.820 1 -0.259 0.034 1 0.045 -0.164 -0.019 -0.019 0.435 -0.374 -0.259 -0.164 26 138 1 0.643 0.435 -0.820 Step 6. 138 246 57 min (d57)= d(57,38)= d(57,26)= d(57, 1)= d(57, 4)= 38 1 -0.561 0.435 -0.103 0.486 Step 5. 138 26 57 4 -0.288 -0.260 -0.374 -0.288 -0.820 -0.374 246 57 1 -0.374 1 step 7. min (d138,24657)= minimum Distance d(24567,138)= -0.374 -0.820 Section 5 Practical Considerations Other than the selection of distance measures between cases and distance measures between clusters, there are four important practical issues – to impute the missing values, to convert qualitative variable to quantitative variables, to select initial clusters, and to decide the right number of clusters. © Morgan C. Wang and Mark E. Johnson 14 • • • • To impute missing values: We can either exclude cases with one or more missing variables from analysis or impute these missing values with methods discussed in data preparation course. To convert to quantitative variable: Ordinal variable: Replace the ordered categories by numerical values defined in equation (3). Nominal variable: Replace each category by one binary dummy variable. Group of related binary variables: Use method similar to obtain missing value pattern (MVP) to group related binary variables first and then consider the MVP as an ordinary variable. To select the initial seeds: The initial seeds must be complete cases, that is, cases that have no missing values. And, the initial seeds should be chosen to be as far apart as possible. It can be selected either at random or with some optimization strategies recommended by Hastie et al. (Page 470, 2001). To decide the right number of clusters: The choice of the right number of clusters depends on the goal of the clustering. Data driven methods for estimating the right number “K” typically examine the within cluster dissimilarity measure WK as a function of the number of clusters K. Separate within cluster dissimilarity measures W1, W2 , , WK max are calculated for K ∈ {1, 2, …, Kmax}. The { sequence {W1, W2 , , WK max } } is a monotone decreasing sequence with a sharp drop at the optimal number of cluster K*. This means that we have {WK − WK −1 | K < K *} << {WK − WK −1 | K ≥ K *} if the optimal number of cluster is K*. Consequently, an estimated value of K̂ * can be obtained by identifying a sharp drop of the value WK − WK +1 or by identifying a “kick” in the plot of WK as a function of K. The gap statistics proposed by Tibshirani et al. (2001) is based on this idea. Section 6 Clustering with Enterprise Miner Clustering node uses two SAS procedures, FASTCLUS and CLUSTER. The FASTCLUS procedure is design to handle a much larger data set than PROC CLUSTER can accommodate. The FASTCLUS procedure performs nonhierarchical cluster analysis that means the clusters obtained do not have the tree structure as they do in hierarchical cluster analysis algorithm such as the CLUSTER procedure. In order to obtain hierarchical clusters for a very large data set, one can use PROC FASTCLUS to find initial clusters and then use those initial clusters as input to PROC CLUSTER to find clusters with the final tree structure. © Morgan C. Wang and Mark E. Johnson 15 • By default, FASTCLUS procedure uses K-means clustering method discussed in Section 3.2. K is the number of clusters that can be determined either in advance by the user or as part of the clustering procedure. By default the clustering node uses CLUSTER procedure with Cubic Clustering Criterion (CCC, Appendix 5) based on a sample of 5000 observations to estimate the appropriate number of clusters (between 2 and 50). If you do not want to use the default setting to choose the number of clusters, you can change the default setting of “Specification Method” from “Automatic” to “User Specify”. • Enterprise Miner has three methods for calculating cluster distances: Average: the distance between two clusters is the average distance between pairs of observations, one in each cluster. Centroid: the distance between two clusters is the Euclidean distance between their centroids or means. Ward: the distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables. • K-means clustering is very sensitive to the scale of measurement of different input variables. Consequently, it is advisable to use one of the standardization options if the data has not been standardized. Two standardization methods discussed in Appendix 1 are available in Enterprise Miner. • Dummy variable representation of nominal variables can be problematic in kmeans clustering since they tend to dominate the analysis. One way to reduce their domination is to use rank representation discussed in Section 5. • Five “Seed Initialization Methods” are available in Clustering node: © Morgan C. Wang and Mark E. Johnson 16 First: Select the first k complete cases as the initial seeds MacQueen: Select the initial seeds based on MacQueen k-means algorithm Full Replacement: Select initial seeds that are very well separated using a full replacement algorithm Princomp: Select the evenly-spaced seeds along the first principle component Partial Replacement: Select initial seeds that are well separated using a partial replacement algorithm Section 7 Case Study 1: K-Means Clustering with Clustering Node This case study shows you how to use the popular K-means clustering method in Clustering node and how to use Clustering Result Browser to identify interesting patterns. The Diagram suggests the steps that will be taking place in the course of this case study. Data Source: • Select PROSPECT from Lec12 library • Since variable Climate is a grouping of the variable Location, we need to set the model role of Location to “rejected”. Climate supercedes Location. © Morgan C. Wang and Mark E. Johnson 17 StatExplore Node: We can use the StatExplore node to perform data exploration. Impute Node: • Since the amount of missing values is only about 2% of the data, the missing value indicator variables are not very important and so we do not need to create missing value indicator variables. • Set the default imputation method for both class and interval variables to “Tree” Clustering Node: • Select “Standardization” as “Internal Standardization” Property because K-means clustering is very sensitive to the scale of measurement units of different input variables. Use of this option means that all variables have the same influence on the overall dissimilarity between pairs of cases. • • If either we want to put different weights on standardized variables or to change the values of each level of an ordinal scale variable, we can add a SAS code node to do so. Since we do not know the optimal number of clusters, we keep the defaults for both “Number of Clusters” and “Selection Criterion”. This allows Enterprise Miner using CCC to pick up the optimal number of clusters (between 2 and 50). © Morgan C. Wang and Mark E. Johnson 18 • We keep the defaults for all options in “Training Options” • Since we already imputed all missing values in Impute node, we keep the default for all “Missing Values” properties. © Morgan C. Wang and Mark E. Johnson 19 Results: 1. Examine the Segment Size pie chart. The observations are divided fairly evenly between the four segments. 2. Maximize the Segment Plot in the Results window to begin examining the differences between the clusters. © Morgan C. Wang and Mark E. Johnson 20 When you use your cursor to point at a particular section of a graph, information on that section appears in a pop-up window. Some initial conclusions you might draw from the segment plot are: • The segments appear to be similar with respect to the variables CLIMATE and FICO. • The individuals in segment 3 are all homeowners. There are some homeowners in segment 4, and a few homeowners in segment 2. • Most of the individuals in segment 1 are married, while most of those in segment 4 are unmarried. • Younger individuals are in segment 4. • Most of the individuals in segment 1 are males while most of those in segment 2 are females. • Income appears to be lower in segment 2. 3. Restore the Segment Plot window to its original size and maximize the Mean Statistics window. The window gives descriptive statistics and other information about the clusters such as the frequency of the cluster, the nearest cluster, and the average value of the input variables in the cluster. Scroll to the right to view the statistics for each variable for each cluster. These statistics confirm some of the conclusions drawn based on the graphs. For example, the average age of individuals in cluster 4 is approximately 35.5, while the average ages in clusters 1, 2, and 3 are approximately 49.2, 45.5, and 48.8 respectively. You can use the Plot feature to graph some of these mean statistics. For example, create a graph to compare the average income in the clusters. © Morgan C. Wang and Mark E. Johnson 21 4. With the Mean Statistics window selected, select View Ö Graph Wizard, or select the plot button . © Morgan C. Wang and Mark E. Johnson 22 5. Select Bar as the Chart Type and then select Next>. 6. Select Response as the Role for the variable IMP_INCOME. 7. Select Category as the Role for the variable _SEGMENT_. 8. Select Finish. © Morgan C. Wang and Mark E. Johnson 23 Another way to examine the clusters is with the cluster profile tree. 9. Select View Ö Cluster Profile Ö Tree. You can use the ActiveX feature of the graph to see the statistics for each node and you can control what is displayed with the Edit menu. The tree lists the percentages and numbers of cases assigned to each cluster and the threshold values of each input variable displayed as a hierarchical tree. It enables you to see which input variables are most effective in grouping cases into clusters. 10. Close the Cluster results window when you have finished exploring the results. In summary, the four clusters can be described as follows: Cluster 1 married males Cluster 2 lower income females Cluster 3 married homeowners Cluster 4 younger, unmarried people. These clusters may or may not be useful for marketing strategies, depending on the line of business and planned campaigns. © Morgan C. Wang and Mark E. Johnson 24 Section 8 Case Study 2: Clustering with KBM Data Set This data set is described in Appendix 4. Briefly, it represents descriptive information on educational institutions ranging from large fully accredited institutions to some schools for profit. Data Tab: Be careful since the measurement levels for many variables have been changed in the course of the analysis and some variables are excluded from the clustering analysis after looking at the clustering results. The subsequent screen shots are self-explanatory and consistent with how we proceeded in Case Study 7. © Morgan C. Wang and Mark E. Johnson 25 Clusters Tab: Seeds Tab: Missing Tab: © Morgan C. Wang and Mark E. Johnson 26 Clustering Results: Selecting Screen Shots: (1) CCC Plot: (2) Variable Importance: © Morgan C. Wang and Mark E. Johnson 27 (3) Cluster Statistics: (4) Distance Plot: © Morgan C. Wang and Mark E. Johnson 28 (5) Means for Numerical Variables: Generated Typical Corrected total for Cluster board charge Graduate Undergraduate fall faculty ID for academic credit hour credit hour enrollment 9/10 month Number _FREQ_ year activity activity count contract 1 2 3 4 1016 1069 100 1383 Degree of Urbanization 115.46 241.15 1654.32 2239.33 Number of meals per week in board charge 2.30 3.85 3.03 3.30 9.14 11.91 18.42 18.35 1555.85 1941.12 4348.51 18976.15 17289.37 120738.89 73477.46 124593.82 Percent Black, non-Hispanic 499.78 5748.35 2727.59 5340.09 Percent American Indian/Alaskan Native 11.59 11.52 85.26 6.27 1.23 2.57 0.17 0.67 5.35 94.51 111.54 193.47 Percent Asian/Pacific Islander 4.72 3.99 0.58 3.48 Combined Graduate FT 1st Typical # of charge for Total unduplicated UG 12-month time crd. Hrs. Percent room and dormitory count in unduplicated degree FTFY UG Hispanic board capacity 12-month count seek UG student 8.59 8.44 1.14 3.99 2822.03 1732.45 2166.06 3045.08 29.52 100.05 1066.41 1402.38 298.16 2489.48 794.19 2071.03 Tuition & fees FTFY UG in-state Tuition & fees FTFY UG out-of-state Tuition & fees FTFY Grad in-state 7172.84 1840.04 4570.38 10290.44 7342.03 4376.55 6486.61 11919.57 8194.95 3099.89 3948.60 7324.72 © Morgan C. Wang and Mark E. Johnson 662.22 8987.07 3182.31 5012.71 255.67 921.58 656.03 838.55 35.15 27.95 25.72 29.54 Tuition & fees FTFY Grad out-of-state 8463.69 6708.77 6790.36 9227.44 29 (6) Cluster Definitions and Interpretation: Segment 1: Private Institutions that does not provide “Board and Meal Plan” (1016 institutions) Low degree of urbanization Most institutions do not have “Board and Meal Plan” For these private institutions that provide “Board and Meal Plan” High percentage of graduate students In-state and out-state students pay same tuition Most institutions are not ranked by “US News and World Report” Segment 2: Public Institutions that do not have “Board and Meal Plan” or Non-state Public Institutions that have “Board and Meal Plan” (1069 institutions) Highest degree of urbanization Most institutions do not have “Board and Meal Plan” In-state tuition is much cheaper than out-of-state tuition Most institutions are not ranked by “US News and World Report” Segment 3: Historically Black College or University (100 institutions) High percentage of African American students and low percentage of other minority students Most institutions provide “Board and Meal Plan” Low percentage of graduate students High dormitory-student ratio Segment 4: State and private Institutions that provide “Board and Meal Plan” (1383 institutions) Second highest degree of urbanization More number of meals provided than historical black colleges and universities © Morgan C. Wang and Mark E. Johnson 30 Appendix 1 Distance Measure after Standardization The measure of distance (dissimilarity) typically assumes some degrees of commensurability between variables. It would be effective if variables are measured using the same units and are equally important. However, it is very unlikely that all variables in a data mining exercise were measured in the same units. One way to deal with this incommensurability is to standardize the data by dividing each variable by its sample standard deviation 1/ 2 2 ⎛1 n ⎞ σ k = ⎜⎜ ∑ ( xk ( i ) − μ k ) ⎟⎟ . ⎝ n i =1 ⎠ or ⎛1 n σˆ k = ⎜⎜ ∑ xk ( i ) − x k ⎝ n i =1 ( ) 2 1/ 2 ⎞ ⎟⎟ ⎠ if the population mean is unknown. Another way to perform standardization is to divide each variable by its sample range rangek = max ( xk ( i ) ) − min ( xk ( i ) ) . all i all i Then, the similarity measure after standardization becomes p ⎡m ' ⎤ ' d std ( i, j ) = ⎢ ∑ xk ( i ) − xk ( j ) ⎥ ⎣⎢ k =1 ⎦⎥ where xk' ( i ) = 1/ p , xk ( i ) − x k . σˆ k © Morgan C. Wang and Mark E. Johnson 31 Appendix 2 Covariance and Pearson Correlation Coefficient The covariance is a measure of how two numerically valued variables X1 and X2 vary together. The large values of X1 tend to associate with the large values of X2 if the covariance has a large positive value. The large values of X1 tend to associate with small values of X2 if covariance has a large negative value. Since the covariance depends on the measurement units used in measuring both X1 and X2, the definition of “large” is problematic. To overcome this weakness, one can use correlation instead of covariance. Let x(i ) = ⎡⎣ x1 ( i ) , x2 ( i ) , , xm ( i ) ⎤⎦ and x( j ) = ⎡⎣ x1 ( j ) , x2 ( j ) , , xm ( j ) ⎤⎦ be any two objects with m variables (features). The covariance between two variables Xi and Xj is defined as 1 n Cov ( xi , x j ) = ∑ xi ( k ) − x i x j ( k ) − x j , n k =1 ( n n k =1 k =1 )( ) where xi = ∑ xi ( k ) and x j = ∑ x j ( k ) . The correlation between two variables Xi and Xj is defined as ∑( x (k ) − x )( x (k ) − x ) n ρ ( xi , x j ) = k =1 ⎛ n ⎜⎜ ∑ xi ( k ) − x i ⎝ k =1 ( © Morgan C. Wang and Mark E. Johnson i i j j ) ∑( x (k ) − x ) 2 n k =1 j j 2 1/ 2 ⎞ ⎟⎟ ⎠ . 32 Appendix 3 Data Used in Section 7 The data set, PROSPECT has 5,055 observations and 9 variables from a catalog company. The company periodically purchases demographics information from outside sources. They want to use this data set to design a testing mail campaign to know the preference of their potential customers of several of their new products. Based on their experience, they know that customer preference for their products depends on several geographical and demographical variables. They want to segment their customers with respect to these variables. After the potential customers have been segmented, a random sample of prospective customers within each segment will be mailed one or several offers. The results of the test mail campaign can provide the company an estimate of their potential profits for these new products. The output from PROC CONTENTS is, as follows: Alphabetic List of Variables and Attributes # Variable Type Len Format Informat Label 2 Age Num 8 BEST12. F12. 9 Climate Char 2 $F2. $F2. Climate Code for Residence 6 FICO Num 8 BEST12. F12. Credit Score 4 Gender Char 1 $F1. $F1. 7 HomeOwner Num 8 BEST12. F12. 1 ID Char 9 $F9. $F9. Identification Code 3 Income Num 8 BEST12. F12. Income ($K) 8 Location Char 1 $F1. $F1. Location Code for Residence 5 Married Num 8 BEST12. F12. © Morgan C. Wang and Mark E. Johnson 33 Appendix 4 Data Used in Section 8 This data set is part of IC98_HD from IPEDS (Integrated Postsecondary Education Data System). Interested students can check IPEDS web site to find out more about this data set. # Variable Type Len Pos Format Label ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 18 ACCRD1 Num 8 136 YESNOA. National or specialized accrediting 19 ACCRD2 Num 8 144 YESNOA. Regional accrediting agency 20 ACCRD3 Num 8 152 YESNOA. State accrediting or approval agency 3 AFFIL Num 8 16 PRIFMTA. Affiliation of institution 33 BOARD Num 8 256 YESNOA. Institution provides board or meal plan 36 BOARDAMT Num 8 280 Typical board charge for academic year 22 CALSYS Num 8 168 Calendar system 43 CDACTGA Num 8 336 Graduate credit hour activity 42 CDACTUA Num 8 328 Undergraduate credit hour activity 38 ENROLMNT Num 8 296 Corrected fall enrollment count 44 GSAA154 Num 8 344 Generated total for faculty 9/10 month contract 10 HBCU Num 8 72 YESNOB. Historically Black College or University 4 HLOFFER Num 8 24 HLOFFERF. Highest level of offering 1 ID Num 8 0 11 LOCALE Num 8 80 Degree of Urbanization 35 MEALSVRY Num 8 272 FIX. Number meals/wk/BORDAMT/ROOMAMT 34 MEALSWK Num 8 264 Number of meals per week in board charge 24 MIL1INSL Num 8 184 YESNOA. MILI in states and/or territories 25 MIL2INSL Num 8 192 YESNOA. MILI at military installations abroad 23 MILI Num 8 176 YESNOA. Courses at military installations 6 PCTMIN1 Num 8 40 Percent Black, non-Hispanic 7 PCTMIN2 Num 8 48 Percent American Indian/Alaskan Native 8 PCTMIN3 Num 8 56 Percent Asian/Pacific Islander 9 PCTMIN4 Num 8 64 Percent Hispanic 12 PEO1ISTR Num 8 88 YESNOB. Occupational 13 PEO4ISTR Num 8 96 YESNOB. Recreational or avocational 14 PEO5ISTR Num 8 104 YESNOB. Adult basic remedial or HS equivalent 26 PG300 Num 8 200 YESNOA. Programs at least 300 contact hrs. 17 PRIVATE Num 8 128 PRIFMT. Private control 15 PUBLIC1 Num 8 112 YESNOA. Federal 16 PUBLIC2 Num 8 120 YESNOA. State 37 RMBRDAMT Num 8 288 Combined charge for room and board 32 ROOMCAP Num 8 248 Total dormitory capacity 21 SACCR Num 8 160 YESNOA. Accrd by US Dept Ed recognized agency 2 SECTOR Num 8 8 SECTORF. Sector of institution 41 TOSTUCG Num 8 320 Graduate unduplicated count in 12-month 39 TOSTUCU Num 8 304 UG 12-month unduplicated count 40 TOSTUFR Num 8 312 FT 1st time degree seek UG 29 TPUGCRED Num 8 224 Typical # of crd. Hrs. FTFY UG student 27 TUITION2 Num 8 208 Tuition & fees FTFY UG in-state 28 TUITION3 Num 8 216 Tuition & fees FTFY UG out-of-state 30 TUITION6 Num 8 232 Tuition & fees FTFY Grad in-state 31 TUITION7 Num 8 240 Tuition & fees FTFY Grad out-of-state 5 UGOFFER Num 8 32 YESNOA. Undergraduate offering 45 USTIER Num 8 352 US News and World Report Rating © Morgan C. Wang and Mark E. Johnson 34 Appendix 5 Cubic Cluster Criterion The best way to use the CCC is to plot its value against the number of clusters, ranging from one cluster up to about one-tenth the number of observations. The CCC may not behave well if the average number of observations per cluster is less than ten. The following guidelines should be used for interpreting the CCC: • Peaks on the plot with the CCC greater than 2 or 3 indicate good clusterings. • Peaks with the CCC between 0 and 2 indicate possible clusters but should be interpreted cautiously. • There may be several peaks if the data has a hierarchical structure. • Very distinct nonhierarchical spherical clusters usually show a sharp rise before the peak followed by a gradual decline. • Very distinct nonhierarchical elliptical clusters often show a sharp rise to the correct number of clusters followed by a further gradual increase and eventually a gradual decline. • If all values of the CCC are negative and decreasing for two or more clusters, the distribution is probably unimodal or long-tailed. • Very negative values of the CCC, say, -30, may be due to outliers. Outliers generally should be removed before clustering and their removal documented. • If the CCC increases continually as the number of clusters increases, the distribution may be grainy or the data may have been excessively rounded or recorded with just a few digits. A final and very important warning: neither the CCC nor R2 is an appropriate criterion for clusters that are highly elongated or irregularly shaped. If you do not have prior substantive reasons for expecting compact clusters, use a nonparametric clustering method such as Wong and Lane's (1983) rather than Ward's method or k-means clustering. © Morgan C. Wang and Mark E. Johnson 35 Appendix 6 References David Hand, Heikki Mannila, and Padhraic Smyth (2001) Chapter 9 of “Principles of Data Mining”, Massachusetts Institute of Technology. Michael J. A. Berry and Linoff, Gordon S. (2000) Chapter 5 of “Mastering Data Mining”, John Wiley & Sons, Inc.: New York, New York. Richard A. Jognson and Dean W. Wichern (1982) Chapter 11 of Applied Multivariate Statistical Analysis, Prentice-Hall, Inc.: Englewood Cliffs, New Jersey. Rud, O. P. (2001), “Data Mining Cook Book”, John Wiley & Sons, Inc.: New York, N.Y. Hastie, T., Tibshirani, R., and Friedman, J. (2001) Chapter 14 of “The Elements of Statistical Learning”, Springer. Tibshirani, R. Walther, G. and Hastie, T. (2001) Estimating the Number of Clusters in a Dataset via the Gap Statistics, Journal of Royal. Statistics Soc. B. Wong, M. A. and Lane, T. (1983), "A kth Nearest Neighbor Clustering Procedure," Journal of the Royal Statistical Society, Series B, 45, 362-368. © Morgan C. Wang and Mark E. Johnson 36 Appendix 7 Exercises Problem 1 Suppose x(i ) = ⎡⎣ x1 ( i ) , x2 ( i ) , , xm ( i ) ⎤⎦ and x( j ) = ⎡⎣ x1 ( j ) , x2 ( j ) , , xm ( j ) ⎤⎦ m x (i ) m x ( j) and x j = ∑ k be any two objects with m variables (features). Let xi = ∑ k k =1 m k =1 m be the average over variables for objects i, and j, respectively. Also, let ( k =1 m si = ∑ xk (i ) − xi variables m ) m −1 for ( k =1 m 2 and s j = objects i, ∑ xk ( j ) − x j m −1 and j, ) 2 be the standard deviation over respectively. Show that 2 ∑ ( wk ( i ) − wk ( j ) ) = 2(1 − ρ ( w(i ), w( j ) ) if we first standardized all inputs, i.e., k =1 x ( i ) − xi x ( j) − x j wk ( i ) = k and wk ( j ) = k . si sj Problem 2 Show that the sample correlation coefficient, r, can be written as r= ad − bc ⎡⎣( a + b )( a + c )( b + d )( c + d ) ⎤⎦ 1/ 2 For two binary variables with the contingency table 0 1 © Morgan C. Wang and Mark E. Johnson 0 a c 1 b d 37