Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Survey

Document related concepts

Mixture model wikipedia , lookup

Human genetic clustering wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript

6、Cluster Analysis (6hrs) 6.1 What is Cluster Analysis? 6.2 Types of Data in Cluster Analysis 6.3 A Categorization of Major Clustering Methods 6.4 Partitioning Methods 6.5 Grid-Based Methods 6.6 Model-Based Methods 6.7 Clustering High-Dimensional Data 6.8 Outlier Analysis 6.9 Summary Key Points：Clustering, Partition, Hierarchical method, Outlier Analysis Reading：Chapter 8 Q&A: 1. Briefly outline how to compute the dissimilarity between objects described by Categorical variables. Answer: A categorical variable is a generalization of the binary variable in that it can take on more than two states. The [old: dissimilarity between the simple d(i,j)=(p-m)/p, where two objects i and j can be matching approach][new: on the ratio m is the number of matches (i.e., computed based of mismatches]: the number of variables for which i and j are in the same state), and p is the total number of variables. Alternatively, we can use a large number of binary variables by creating a new binary variable for each of the M nominal states. For an object with a given state value, the binary variable representing that state is set to 1, while the remaining binary variables are set to 0. 2. Briefly outline how to compute the dissimilarity between objects described by Ratio-scaled variables. Answer: Three methods include: • Treat ratio-scaled variables as interval-scaled variables, so that the Minkowski, Manhattan, or Euclidean distance can be used to compute the dissimilarity. • Apply a logarithmic transformation to a ratio-scaled variable f having value for object i by using the formula yif xif = log(xif). The yif values can be treated as interval-valued, • Treat xif as continuous ordinal data, and treat their ranks as interval-scaled variables. 3. Given the following measurements for the variable age: 18, 22, 25, 42, 28, 43, 33, 35, 56, 28, Standardize the variable by the following: (a) Compute the mean absolute deviation of age. (b) Compute the z-score for the first four measurements. Answer: (a) Compute the mean absolute deviation of age. The mean absolute deviation of age is 8.8, which is derived as follows. (b) Compute the z-score for the first four measurements. According to the z-score computation formula, 4. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): (a) Compute the Euclidean distance between the two objects. (b) Compute the Manhattan distance between the two objects. (c) Compute the Minkowski distance between the two objects, using p = 3. Answer: 5. Briefly describe the concepts of clustering and list several approaches to clustering. Clustering is the process of grouping data objects but into classes, or clusters, so that within a cluster have high similarity in comparison are very dissimilar to objects in other clusters. to one another, There are several approaches to clustering: Partitioning methods, Hierarchical methods, Density-based methods, Grid-based methods, Model-based methods,Constraint-based methods 6. What is the mainly concept of model-based methods for clustering? Model-based methods: clusters and This approach hypothesizes finds the best fit of the data a model for each of the to the given model. A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points. way of automatically determining the statistics. to the It number of the approach. feature maps are examples of model-based also leads to a of clusters based on standard takes “noise” or outliers into account, robustness It COBWEB therein and contributing self- organizing clustering. 7. Suppose that the data mining task is to cluster the following eight points (with (x, y) representing location) into three clusters. A1 (2, 10), A2 (2, 5), A3 (8, 4), B1 (5, 8), B2 (7, 5), B3 (6, 4), C1 (1, 2), C2 (4, 9). The distance function is Euclidean distance. A1 , B1 , and C1 as the center Suppose initially we assign of each cluster, respectively. Use the k-means algorithm to show only (a) The three cluster (b) The final three Answer: centers clusters after the first round of execution and (a) After the first round, the three new clusters are: B2 , B3 , C2 }, (3) {C1 , A2 }, and their centers (1) {A1 }, (2) {B1 , A3 , are (1) (2, 10), (2) (6, 6), (3) (1.5, 3.5). (b) The final three clusters are: (1) {A1 , C2 , B1 }, (2) {A3 , B2 , B3 }, (3) {C1 , A2 }. 8. Both k-means and k-medoids algorithms can perform effective clustering. Illustrate the strength and weakness of k-means in comparison with the k-medoids algorithm. Also, illustrate the strength and weakness of these schemes in comparison with a hierarchical clustering scheme (such as AGNES). Answer: (a) Illustrate the strength and weakness of k-means in comparison with the k-medoids algorithm. The k-medoids algorithm is more robust than k-means in the presence of noise and outliers, because a medoid is less influenced by outliers or other extreme values than a mean. However, its processing is more costly than the k-means method. (b) Illustrate the strength and weakness of these schemes in hierarchical comparison with a clustering scheme (such as AGNES). Both k-means and k-medoids perform partitioning-based clustering. An advantage of such partitioning approaches is that they can undo previous clustering steps (by iterative relocation), unlike hierarchical methods, which cannot make adjustments once a split or merge has been executed. This weakness of hierarchical methods can cause the quality of their resulting clustering to suffer. Partitioning-based 9. Clustering has been popularly recognized as an important data mining task with broad applications. In the content of business, give an application example that takes clustering as a major data mining function. An example that takes clustering as a major data mining function could be a system that identifies groups of houses in a city according to house type, value, and geographical CLARANS location. More specifically, a clustering algorithm like can be used to discover that, say, the most in Vancouver can be grouped into just expensive housing units a few clusters. 10. Why is outlier mining important? Briefly describe the different approaches behind statistical-based outlier detection. Answer: Data objects remaining that are set of data grossly are different from, inconsistent with, the called “outliers”. Outlier mining is useful for detecting fraudulent activity (such as credit customer or card or telecom fraud), as well as segmentation and medical analysis. Computer-based outlier analysis may be statistical- based, distance-based, or deviation-based. The statistical-based the given data approach set and assumes a distribution or probability model for then identifies model using a discordancy test. outliers. The with The discordancy test distribution, distribution parameters (e.g., of expected outliers drawbacks mean, of this variance), method are respect to the is based on data and the number that most tests are for single attributes, and in many cases, the data distribution may not be known.