Download Cluster Description

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia, lookup

Cluster analysis wikipedia, lookup

Transcript
Cluster Description
A. D. Gordon
Mathematical Institute, University of St Andrews
North Haugh, St Andrews KY16 9SS, Scotland
[email protected]
For the purposes of data simplification, storage and retrieval, it can be useful to partition a
set of n ’objects’ into c disjoint classes of objects so as to ensure that objects in the same class are
similar to one another. Methodology for carrying out this exercise has usually been referred to as
’classification’ or ’cluster analysis’, although parts of the recent topic of ’data mining’ are also
concerned with similar investigations. Such methodology is becoming increasingly important
due to the growth in the size of data sets that are being collected and stored electronically. There
are many stages in a classification study, as described by, amongst others, Milligan (1996) and
Gordon (1999). An overview is provided of one of the later stages, that of the description of
valid classes or clusters that have been found to exist in the data. Such descriptions facilitate the
efficient storage and retrieval of information and allow the assignment of new objects to one of
the existing classes.
The level of detail in cluster description depends first on the information provided about
the objects. The two main input formats for the objects are as
(i)
an (n × p ) pattern matrix, describing the value or state of the kth variable for the ith object
(i = 1,..., n; k = 1,..., p), or
(ii) a symmetric (n × n) (dis)similarity matrix, containing measures of the (dis)similarity
between the ith and jth objects (i, j = 1,..., n).
A limited set of cluster descriptions is available if the only information provided about a
set of objects is that contained in a (dis)similarity matrix. Each cluster can be described by k (•
1) ’representative’ objects and by a measure of its ’spread’. An example of a single representative
for a cluster is its ’star centre’ or ’medoid’, the object for which the sum of its dissimilarities with
all other objects in the cluster is a minimum (Kaufman and Rousseeuw, 1990; Hansen and
Jaumard, 1997). Plots of the dissimilarity between each object and each cluster’s representative
object (Gnanadesikan et al., 1977; Fowlkes et al., 1988) and other measures of the strength with
which an object is perceived as belonging to its cluster (Rousseeuw, 1987) can assist in
distinguishing between ’core’ and ’outlying’ members of a cluster. Each cluster can then be
described by one or more representative objects and by selected quantiles of the distribution of
dissimilarity values of its members with a representative object.
When the information about the set of objects is provided in a pattern matrix, a wider
range of options is available. Methods of providing descriptions of classes, possibly after their
outlying members have been deleted, can be categorized in several different ways, reflecting the
fact that the aims of cluster description and assignment of new objects are not identical. Thus,
one may seek
(i)
to derive rules for assigning objects to the class which they most resemble;
(ii) to specify for each class, properties that are satisfied by at least α % of the members of the
class;
(iii) to specify for each class, properties that are satisfied by at least α % of the members of the
class and by no more than β % of the members of other classes in the partition.
Within category (i), the set of methods known collectively as decision trees (Breiman et al.,
1984) also provides a (parsimonious) description of class properties. Relevant work within
categories (ii) and (iii) has been proposed within the fields of machine learning, conceptual
clustering, and knowledge discovery and data mining (e.g., Michalski et al., 1983; Fisher, 1987;
Ho et al., 1988; Fayyad et al., 1996). An overview of relevant methodology is presented.
REFERENCES
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984).
Regression Trees. Wadsworth, Belmont, CA.
Classification and
Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R. (Eds) (1996). Advances in
Knowledge Discovery and Data Mining. AAAI Press / MIT Press, Menlo Park, CA.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine
Learning 2, 139-172.
Fowlkes, E. B., Gnanadesikan, R. and Kettenring, J. R. (1988). Variable selection in clustering.
Journal of Classification 5, 205-228.
Gnanadesikan, R., Kettenring, J. R. and Landwehr, J. M. (1977). Interpreting and assessing the
results of cluster analyses. Bulletin of the International Statistical Institute 47(2), 451-463.
Gordon, A. D. (1999). Classification (Second Edition). Chapman & Hall, London.
Hansen, P. and Jaumard, B. (1997).
Mathematical Programming 79, 191-215.
Cluster analysis and mathematical programming.
Ho, T. B., Diday, E. and Gettler-Summa, M. (1988). Generating rules for expert systems from
observations. Pattern Recognition Letters 7, 265-271.
Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster
Analysis. Wiley, New York.
Michalski, R. S., Carbonell, J. G. and Mitchell, T. M. (Eds) (1983). Machine Learning: An
Artificial Intelligence Approach. Tioga Publishing Company, Palo Alto, CA.
Milligan, G. W. (1996). Clustering validation: results and implications for applied analyses. In
Clustering and Classification (eds P. Arabie, L. J. Hubert and G. De Soete), 341-375. World
Scientific, Singapore.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of
cluster analysis. Journal of Computational and Applied Mathematics 20, 53-65.
RÉSUMÉ
Nous passons en revue les méthodes ayant pour but de décrire des classes générées par
une classification. Lorsque les données se présentent sous la forme d'une matrice de
(dis)similarités, il n'existe qu'un nombre limité de méthodes pour décrire les classes. Par contre,
lorsque les objets à classer sont décrits par un ensemble de variables, le choix de méthodes de
description d'une classe est plus grand; certaines de ces méthodes fournissent une description
pertinente et détaillée des classes.