Survey

Survey

Transcript

Cluster Description A. D. Gordon Mathematical Institute, University of St Andrews North Haugh, St Andrews KY16 9SS, Scotland [email protected] For the purposes of data simplification, storage and retrieval, it can be useful to partition a set of n ’objects’ into c disjoint classes of objects so as to ensure that objects in the same class are similar to one another. Methodology for carrying out this exercise has usually been referred to as ’classification’ or ’cluster analysis’, although parts of the recent topic of ’data mining’ are also concerned with similar investigations. Such methodology is becoming increasingly important due to the growth in the size of data sets that are being collected and stored electronically. There are many stages in a classification study, as described by, amongst others, Milligan (1996) and Gordon (1999). An overview is provided of one of the later stages, that of the description of valid classes or clusters that have been found to exist in the data. Such descriptions facilitate the efficient storage and retrieval of information and allow the assignment of new objects to one of the existing classes. The level of detail in cluster description depends first on the information provided about the objects. The two main input formats for the objects are as (i) an (n × p ) pattern matrix, describing the value or state of the kth variable for the ith object (i = 1,..., n; k = 1,..., p), or (ii) a symmetric (n × n) (dis)similarity matrix, containing measures of the (dis)similarity between the ith and jth objects (i, j = 1,..., n). A limited set of cluster descriptions is available if the only information provided about a set of objects is that contained in a (dis)similarity matrix. Each cluster can be described by k ( 1) ’representative’ objects and by a measure of its ’spread’. An example of a single representative for a cluster is its ’star centre’ or ’medoid’, the object for which the sum of its dissimilarities with all other objects in the cluster is a minimum (Kaufman and Rousseeuw, 1990; Hansen and Jaumard, 1997). Plots of the dissimilarity between each object and each cluster’s representative object (Gnanadesikan et al., 1977; Fowlkes et al., 1988) and other measures of the strength with which an object is perceived as belonging to its cluster (Rousseeuw, 1987) can assist in distinguishing between ’core’ and ’outlying’ members of a cluster. Each cluster can then be described by one or more representative objects and by selected quantiles of the distribution of dissimilarity values of its members with a representative object. When the information about the set of objects is provided in a pattern matrix, a wider range of options is available. Methods of providing descriptions of classes, possibly after their outlying members have been deleted, can be categorized in several different ways, reflecting the fact that the aims of cluster description and assignment of new objects are not identical. Thus, one may seek (i) to derive rules for assigning objects to the class which they most resemble; (ii) to specify for each class, properties that are satisfied by at least α % of the members of the class; (iii) to specify for each class, properties that are satisfied by at least α % of the members of the class and by no more than β % of the members of other classes in the partition. Within category (i), the set of methods known collectively as decision trees (Breiman et al., 1984) also provides a (parsimonious) description of class properties. Relevant work within categories (ii) and (iii) has been proposed within the fields of machine learning, conceptual clustering, and knowledge discovery and data mining (e.g., Michalski et al., 1983; Fisher, 1987; Ho et al., 1988; Fayyad et al., 1996). An overview of relevant methodology is presented. REFERENCES Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Regression Trees. Wadsworth, Belmont, CA. Classification and Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R. (Eds) (1996). Advances in Knowledge Discovery and Data Mining. AAAI Press / MIT Press, Menlo Park, CA. Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning 2, 139-172. Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering. Journal of Classification 5, 205-228. Gnanadesikan, R., Kettenring, J. R. and Landwehr, J. M. (1977). Interpreting and assessing the results of cluster analyses. Bulletin of the International Statistical Institute 47(2), 451-463. Gordon, A. D. (1999). Classification (Second Edition). Chapman & Hall, London. Hansen, P. and Jaumard, B. (1997). Mathematical Programming 79, 191-215. Cluster analysis and mathematical programming. Ho, T. B., Diday, E. and Gettler-Summa, M. (1988). Generating rules for expert systems from observations. Pattern Recognition Letters 7, 265-271. Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York. Michalski, R. S., Carbonell, J. G. and Mitchell, T. M. (Eds) (1983). Machine Learning: An Artificial Intelligence Approach. Tioga Publishing Company, Palo Alto, CA. Milligan, G. W. (1996). Clustering validation: results and implications for applied analyses. In Clustering and Classification (eds P. Arabie, L. J. Hubert and G. De Soete), 341-375. World Scientific, Singapore. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53-65. RÉSUMÉ Nous passons en revue les méthodes ayant pour but de décrire des classes générées par une classification. Lorsque les données se présentent sous la forme d'une matrice de (dis)similarités, il n'existe qu'un nombre limité de méthodes pour décrire les classes. Par contre, lorsque les objets à classer sont décrits par un ensemble de variables, le choix de méthodes de description d'une classe est plus grand; certaines de ces méthodes fournissent une description pertinente et détaillée des classes.