Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 8A : GLOBAL DESCRIPTIVE MODELS 8a.1 Introduction In Chapter 1XXX we defined what is meant, in the context of data mining, by the terms ‘model’ and ‘pattern’. A model is a high level description, summarising a large collection of data and describing its important features. In contrast, a ‘pattern’ is a local description, perhaps showing how just a few data points behave or characterising some persistent but unusual structure within the data. We described the distinction in more detail in other chapters - for example, in Section 5.1XXX. As well as distinguishing between models and patterns, earlier chapters also noted the distinction between descriptive and predictive models. A descriptive model presents, in convenient form, the main features of the data. It is, essentially a summary of the data, permitting one to study the most important aspects of the data without these being obscured by the sheer size of the data set. In contrast, a predictive model has the specific objective of allowing one to predict the value of some target characteristic of an object on the basis of observed values of other characteristics. This chapter is concerned with descriptive models, presenting outlines of some of those which are most important in data mining contexts. Chapter 8BXXX describes descriptive patterns and Chapter 9XXX describes predictive models. In Chapter 5 we noted the distinction between mechanistic and empirical models - the former being based on some underlying theory about the mechanism through which the data arose and the latter being simple a description. Data mining is usually concerned with the latter situation. The fundamental objective is to produce insight and understanding about the structure of the data, and to enable one to see what are its important features. Beyond this, of course, one hopes one might discover unsuspected structure and structure which is interesting and valuable in some sense. A good model can also be thought of as ‘generative’, in the sense that data randomly generated according to the model will have the same characteristics as the real data from which the model was produced. If such randomly generated data has features not possessed by the original data, or does not possess features which the original data does (such as, for example, correlations between variables), then the model is a poor one: it is failing in its job of adequately summarising the data. This chapter is built on Chapter 5. There we described how one went about building a model for data, how one decided whether a model was good or not, and the nature of fundamental problems such as overfitting. We illustrated with some basic forms. In this chapter we explore more complex model forms, of the kind needed to handle large multivariate data sets. There are, in fact, many different types of model, each related to the others in various ways (special cases, generalisations, different ways of looking at the same structure, and so on). In a single chapter we cannot hope to examine all possible models types in detail. What we have done is look at just some of the more important types. One point is worth making at the start. Since we are concerned here with global models, with structures which are representative of a mass of objects in some sense, then we do not need to worry about failing to detect just a handful of objects possessing some property (we are not concerned with patterns). This means that we can apply the methods to a (random) sample from the data set and hope to obtain good results. 8a.2 Mixture models In Chapter 5XXX, we described how to summarise univariate samples in terms of just a handful of numbers (such as the mean and standard deviation), and also how to estimate parameters describing the shapes of overall distributions, such as the Poisson and normal distribution. Details of particular important distributions are given in Appendix XXX. In Chapter 7XXX we illustrated the use of histograms and kernel smoothing methods to provide simple graphical displays of univariate samples. Both of these ideas - simple summarising distributions and smoothing methods - can be generalised to multivariate situations (indeed, Appendix XXX includes an outline of the multivariate normal distribution). In particular, smoothing methods have been widely used for predictive models based on multiple predictors, and these are examined in Chapter 9XXX. Here, however, we look at models which are intermediate between simple distributions and nonparametric smoothing methods. In particular, we study models in which each distribution is assumed to be composed of several component distributions, each relatively simple (so-called mixture distributions). Distributions of the kind discussed in Chapter 5XXX and Appendix XXX are very useful, but they do not solve all problems. Firstly, they need to be extended to the multivariate case, and secondly, they may not be flexible enough to describe situations which occur in practice. To illustrate the latter, consider Figure XXX2 in Chapter 7. This is a histogram of the number of weeks owners of a particular credit card used that card to make supermarket purchases in 1996. As we pointed out there, the histogram appears to be bimodal, with a large and obvious mode to the left and a smaller, but nevertheless possibly important mode to the right. An initial stab at a model for such data might be that it follows a Poisson distribution (despite being bounded above by 52), but this would not have a sufficiently heavy tail and would fail to pick up the right hand mode. Likewise, a binomial model would also fail to follow the right hand mode. Something more sophisticated and flexible is needed. An obvious suggestion here is that the empirical distribution should be modelled by a theoretical distribution which has two components. Perhaps there are two kinds of people, those who are unlikely to use their credit card in a supermarket and those who do so most weeks. The first set of people could be modelled by a Poisson distribution with a small probability. The second set could be modelled by a reversed Poisson distribution with its mode around 45 or 46 weeks (the position of the mode would be a parameter to be estimated in fitting the model to the data). This leads us to an overall distribution of the form: f x p 1x e 1 x! 1 p2 52 x e 2 52 x! (1) Here p is the probability that a person belongs to the first group. Then, given that they do, the expression 1x e 1 x! gives the probability that they will use their card x times in the year. Likewise, (1-p) is the probability that they belong to the second group and 1 p2 52 x e2 52 x! gives the probability that such a person will use their card x times in the year. Expression (1) is an example of a mixture distribution. The overall distribution consists of a mixture of two Poisson components. Clearly it leads to a much more flexible model than a simple single Poisson distribution - at the very least, it involves three parameters instead of just one. However, by virtue of the argument which led to it, it may also be a more realistic description of what is underlying the data. (In fact, as it happens even this model is not a very good fit, and deeper exploration is required.) These two aspects - the extra flexibility of the models consequent on the larger number of parameters and arguments based on suspicion of a heterogeneous - mean that mixture models are widely used for modelling distributions which are more complicated than simple standard forms. (Mixture models are also often used to yield a flexible family of conjugate distributions in Bayesian analyses.) The general form of a mixture distribution is c f x p k f k x; k k 1 where p k is the probability that an observation will come from the kth component (the so-called kth mixing proportion), c is the number of components, f k x;k is the distribution of the kth component, and k is the vector of parameters describing the kth component (in the Poisson mixture example above, each k consisted of a single term, k ). In most applications the f k x have the same form, but there are situations where this is not the case. The most widely used form of mixture distribution has normal components. Note that the p k must lie between 0 and 1 and sum to 1. Some examples of situations in which mixture distributions might be expected on theoretical grounds are the length distribution of fish (since they hatch at a specific time of the year), failure data (where there may be different causes of failure, and each cause results in a distribution of failure times), time to death, and the distribution of characteristics of heterogeneous populations of people. Over the years, many methods have been applied in estimating the parameters of mixture distributions. Nowadays the most popular seems to be the EM approach (Chapter XXX), leading to maximum likelihood estimates. Display 8a.1XXX illustrates the application of the EM algorithm in estimating the parameters of a normal mixture. DISPLAY 8a.1XXX We wish to fit a normal mixture distribution c f x p k f k x; k , k k 1 k is the mean of the ith component and k is the standard deviation of the kth component. Suppose for the moment that we knew the values of the k and the k . Then, for a given value of x, where the probability that it arose from the kth class would be P k | x pk f k x; k , k f x From this, we could then estimate the values of the p k , k and k as (2) 1 n p k P k | xi n i 1 k 1 np k k 1 np k (3a) n P k | x x i i 1 (3b) i n P k | x x i i 1 i k 2 (3c) where the summations are over the n points in the data set. This set of equations leads to an obvious iterative procedure. We pick starting values for the k and k , plug them into (2) to yield estimates P k | x , use these estimates in (3a) to (3c), and then iterate back using the updated estimates of k and k , cycling round until some convergence criterion has been satisfied. Equations (3) are very similar to those involved in estimating parameters of a single normal distribution, except that the contribution of each point are split across the separate components, in proportion to the estimated size of that component at the point. Of course, procedures such as the above will give results which depend on the chosen starting values. Because of this it is worthwhile repeating the procedure from a number of different starting positions. DISPLAY 8a.2XXX For a Poisson mixture c f x pk k 1 kx e k x! , the equations for the iterative estimation procedure analogous to Display 8a.2xxx take the form P k | xi p k P k | xi f x 1 n p k P k | xi n i 1 1 n k Pk | xi xi np k i 1 Sometimes caution has to be exercised with maximum likelihood estimates of mixture distributions. For example, in a normal mixture, if we put the mean of one component equal to one of the sample points and let its standard deviation tend to zero then the likelihood will increase without limit. The maximum likelihood solution in this case is likely to be of limited value. There are various ways round this. The largest finite value of the likelihood might be chosen to give the estimated parameter values. Alternatively, if the standard deviations are constrained to be equal, the problem does not arise. Another problem which can arise is due to lack of identifiability. A family of mixture distributions is said to be identifiable if and only if the fact that two members of the family are equal, p f x; p ' f x; ' , c k 1 c' k k j 1 j j implies that c = c’, and that for all k there is some j such that pk p' j and k ' j . If a family is not identifiable, then two different members of it may be indistinguishable, which can lead to problems in estimation. Non-identifiability is more of a problem with discrete distributions than continuous ones because, with m categories, only m-1 independent equations can be set up. Think of a mixture of several Bernoulli components. The only observation here is the single proportion of 1s, so how can we estimate the mixing proportions and the parameters of the separate components? Although mixture distributions are useful for analysing single variables, they are also useful in multivariate situations. In Chapter 1XXX we briefly noted the difference between mixture decomposition, segmentation, and cluster analysis. Here, however, we want to draw attention to one important distinction which can characterise the difference between the mixture models and cluster analysis. The aim of cluster analysis is to divide the data into naturally occurring regions in which the points are closely or densely clustered, with relatively sparse regions between them. From a probability density perspective, this will correspond to regions of high density separated by valleys of low density, so that the probability density function is fundamentally multimodal. However, mixture distributions, even though they are composed of several components, may not be multimodal. Consider the case of a two-component univariate normal mixture. Clearly, if the means are equal, then this will be unimodal. More interestingly, a sufficient condition for the mixture to be unimodal (for all values of the mixing proportions) when the means are different is 1 2 2 min1 , 2 . Furthermore, for every choice of values of the means and standard deviations in a two-component normal mixture there exist values of the mixing proportions for which the mixture is unimodal. DISPLAY 8a.3XXX The shape of the distribution of red blood cell volumes depends on the state of health of the individual. People with chronic iron deficiency anaemia have a lognormal distribution of microcytic cells, while healthy people have a lognormal distribution of normocytic cells. The distribution of red blood cell volumes in the patients may still be unimodal even after iron therapy has started. Mixture decomposition, using lognormal components, can detect the effect of the therapy early on. In many situations where one wants to fit a mixture distribution one will be uncertain about how many components are appropriate. After all, in data mining, one is hunting the novel, so one might hope to through up interesting and unexpected insights about the population structure. One way to choose the number of components would be to fit models with different numbers of components and choose between them using likelihood ratio tests (Chapter 5XXX): the size of the increase in likelihood between one model and another with more components will indicate whether the extra component reflects a real aspect of the underlying structure. Although fine in principle, such an approach is not valid because conditions required for the likelihood ratio test are contravened (essentially arising from identifiability problems). Various other proposals have been made (see Section XXX’further reading’ below), but in data mining contexts perhaps more attention should be paid to attempted interpretations of the components than significance tests (with large data sets all the data are likely to deviate significantly from all models except for the most complex). 8a.3 Cluster analysis This section discusses techniques for decomposing or partitioning a (usually multivariate) data set into groups so that those points in one group are similar to each other and are as different as possible from the points in other groups. Although the same techniques may often be applied, we should distinguish between two different objectives. In one, which we might call segmentation or dissection, the aim is simply to partition the data in a way which is convenient. ‘Convenient’ here might refer to administrative convenience, practical convenience, or any other kind. For example, a manufacturer of shirts might want to choose just a few sizes and shapes so as to maximise coverage of the male population. He will have to choose those sizes in terms of collar size, chest size, arm length, and so on, so that no man has a shape too different from that of a well-fitting shirt. To do this, he will partition the population of men into a few groups in terms of the variables collar size, chest size, and arm length and shirts of one size will be made for each group. In contrast to this, one might want to see if a sample of data is composed of ‘natural’ subclasses. For example, whiskies can be characterised in terms of colour, nose, body, palate, and finish and one might want to see if they fall into distinct classes in terms of these variables. Here one is not partitioning the data for practical convenience, but rather is hoping to discover something about the nature of the sample or the population from which it arose - to discover if the overall population is, in fact, heterogeneous. Technically, this is what cluster analysis seeks to do - to see if the data fall into distinct groups, with members within each group being similar to other members in that group but different from members of other groups. Having said that, the term ‘cluster analysis’ is often used in general to describe both segmentation and cluster analysis problems (and we shall also be a little lax in this regard). In each case the aim is to split the data into classes, so perhaps this is not too serious a misuse. It is resolved, as we shall see below, by the fact that there is now a huge number of methods for partitioning data in this way. The important thing is to match one’s method with one’s objective. This way, mistakes will not arise, whatever one calls the activity. DISPLAY 8a.4XXX Owners of credit cards can be split into subgroups according to how they use their card - what kind of purchases they make, how much money they spend, how often they use the card, where they use the card, and so on. It can be very useful for marketing purposes to identify the group to which a card owner belongs, since he or she can then be targeted with promotional material which might be of interest to them (this clearly benefits the owner of the card, as well as the card company). Market segmentation in general is, in fact, a heavy user of the kind of techniques discussed in this section. The segmentation may be in terms of lifestyle, past purchasing behaviour, demographic characteristics, or other features. A chain store might want to study whether outlets which are similar, in terms of social neighbourhood, size, staff numbers, vicinity to other shops, and so on, have similar turnovers and yield similar profits. A starting point here would be to partition the outlets, in terms of the above variables, and then to examine the distributions of turnover within each group. Cluster analysis has been heavily used in some areas of medicine, such as psychiatry, to try to identify whether there are different subtypes of diseases lumped together under a single diagnosis. Cluster analysis methods are used in biology to see if superficially identical plants or creatures in fact belong to different species. Likewise, geographical locations can be split into subgroups on the basis of the species of plants or animals which live there. As an example of where the different between dissection and clustering analysis might matter, consider partitioning the houses in a town. If we are organising a delivery service, we might want to split them in terms of their geographical location. We would want to dissect the population of houses so that those within each group are as close as possible to each other. Delivery vans could then be packed with packages to go to just one group. On the other hand, if a company marketing DIY products might want to split the houses into ‘naturally occurring’ groups of similar houses. One group might consist of small starter homes, another of three and four bedroom family homes, and another (presumably smaller) of executive mansions. It will be obvious from the above that cluster analysis (which, for convenience, we are taking as including dissection techniques) hinges on the notion of distance. In order to decide whether a set of points can be split into subgroups, with members of a group being closer to other members of their group than to members of other groups, we need to say what we mean by ‘closer to’. The notion of ‘distance’, and different measures of it, have been discussed in Chapter 2XXX (This will be transferred from Ch 7) Any of the measures described there, or, indeed, any other distance measure, can be used as the basis for a cluster analysis. As far as cluster analysis is concerned, the concept of distance is more fundamental than the coordinates of the points. In principle, to carry out a cluster analysis all we need to know is the set of interpoint distances, and not the values on any variables. However, some methods make use of ‘central points’ of clusters, and so require the raw coordinates be available. Cluster analysis has been the focus of a huge amount of research effort, going back for several decades, so that the literature is now vast. It is also scattered. Considerable portions of it exist in the statistical and machine learning literatures, but other publications may be found elsewhere. One of the problems is that new methods are constantly being developed, sometimes without an awareness of what has already been developed. More seriously, for very few of the methods is a proper understanding of their properties and the way they behave with different kinds of data available. This has been a problem for a long time. In the late 1970s it was suggested that a moratorium should be declared on the development of new methods while the properties of existing methods were studied. This did not happen. One of the reasons is that it is difficult to tell if a cluster analysis has been successful. It is very rare indeed that single application of a method (of exploratory data analysis or modelling, in general, not merely cluster analysis) leads, by itself and with no other support, to an enlightening discovery about the data. More typically, multiple analyses are needed, looking at the data this way and that, while an interesting structure gradually comes to light. Cluster analysis may contribute to the discovery of such structure, but one cannot typically point to the application and say this is an example of successful application, out of the context of the other ways the data have been examined. Bearing all this in mind, while we encourage the use of newly developed methods, some caution should be exercised, and the results examined with care, rather than being taken at face value. As we shall see below, different methods of cluster analysis are effective at detecting different kinds of cluster, and one should consider this when choosing a method. That is, one should consider what it is one means by a ‘cluster’. To illustrate, one might take a ‘cluster’ as being a collection of points such that the maximum distance between all pairs of points in the cluster is as small as possible. Then each point will be similar to each other point in the cluster. An algorithm will be chosen which seeks to partition the data so as to minimise this maximum interpoint distance (more on this below). One would clearly expect such a method to produce compact, roughly spherical, clusters. On the other hand, one might take a ‘cluster’ as being a collection of points such that each point is as close as possible to some other member of the cluster - although not necessarily to all other members. Clusters discovered by this approach need not be compact or roughly spherical, but could have long (and not necessarily straight) sausage shapes. The first approach would simply fail to pick up such clusters. The first approach would be appropriate in a segmentation situation, while the second would be appropriate if the objects within each hypothesised group could have been measured at different stages of some evolutionary process. For example, in a cluster analysis of people suffering from some illness, to see if there were different subtypes, one might want to allow for the possibility that the patients had been measured at different stages of the disease, so that they had different symptom patterns even though they belonged to the same subtype. The important lesson to be learnt from this is that one must match the method to the objectives. In particular, one must adopt a cluster analytic tool which is effective at detecting clusters which conform to the definition of what one wants to mean by cluster in the problem at hand. Having said that, it is perhaps worth adding that one should not be too rigid about it. Data mining, after all, is about discovering the unexpected, so one must not be too determined in imposing one’s preconceptions on the analysis. Perhaps a search for a different kind of cluster structure will throw up things one had not previously thought of. Broadly speaking, we can identify two different kinds of cluster analysis method: those based on an attempt to find the optimal partition into a specified number of clusters, and those based on a hierarchical attempt to discover cluster structure. We discuss each of these in turn in the next two subsections. 8a.3.1 Optimisation methods In Chapter 1XXX we described how data mining exercises were often conveniently thought of in four parts: the task, the tool, the criterion function, and the search method. When the task is to partition a data set, one approach is to define a criterion function measuring the quality of a given partition and then search the space of possible partitions to find the optimal, or at least a good, partition. A large number of different criteria have been proposed, and a wide range of algorithms adopted. Many criteria are based on standard statistical notions of between and within cluster variation. The aim, in some form or another, is to find that partition which minimises within cluster variation, while maximising between cluster variation. For example, we can define the within cluster variation as c c W Wk k 1 x x x x ' k 1 xX k k k where c is the number of clusters, X k is the set of points in the kth cluster, and x k is the vector of means of the points in the kth cluster. This matrix summarises the deviation of the points from the mean of the clusters they are each in. Likewise, we can define the between cluster variation as c B n k x k x x k x ' k 1 where n k is the number of points in the kth cluster, and x is the vector of overall means of all the points. This matrix summarises the sum of squared differences between the cluster centres. Traditional criteria based on W and B are the trace of W, tr(W), the determinant of W, W , and tr(BW-1). A disadvantage of tr(W) is that it depends on the scaling adopted for the separate variables. Alter the units of one of them and a different cluster structure may result. (See the example in Section 2.XXX (section on distances) where we compared the relative distances between three objects measured in terms of weight and length which resulted when different units were used.) Of course, this can be overcome by standardising the variables prior to analysis, but this is often just as arbitrary as any other choice. This criterion tends to yield compact spherical clusters. It also has a tendency to produce roughly equal groups. Both of these properties may make this criterion useful in a segmentation context, but they are less attractive for discovering natural clusters (where, for example, discovery of a distinct very small cluster may represent a major advance). The W criterion does not have the same scale dependence as tr(W), so that it also detects elliptic structures as clusters, but does also favour equal sized clusters. Adjustments which take cluster size into account have been suggested (for example, dividing by nk2nk ), so that the equal sized cluster tendency is counteracted, but it might be better to go for a different criterion altogether than adjust an imperfect one. Note also that the original criterion, W , has optimality properties if the data are thought to arise from a mixture of multivariate normal distributions, and this is sacrificed by the modification. (Of course, if one’s data are thought to be generated in that way, one might contemplate fitting a formal mixture model, as outlined in Section 8a.2.2XXX.) Finally, the tr(BW-1) criterion also has a tendency to yield equal sized clusters, and this time of roughly equal shape. Note that since this criterion is equivalent to summing the eigenvalues of BW-1 it will place most emphasis on the largest eigenvalue and hence have tendency to yield collinear clusters. The property that the clusters obtained from using the above criteria tend to have similar shape is not attractive in all situations (indeed, it is probably a rare situation in which it is attractive). Criteria based on other ways of combining the separate within cluster matrices W k can relax this, for example, W nk k and W 1p k , where p is the number of variables. Even these criteria, however, have a tendency to favour similar sized clusters. (A modification to the W nk k criterion, analogous to that to the W criterion, which can help to overcome this property, is to divide each Wk by to letting the distance vary between different clusters.) n 2 nk k . This is equivalent A variant of the above methods uses the sum of squared distances not from the cluster means, but from particular members of the cluster. The search (see below) then includes a search over cluster members to find that which minimises the criterion. In general, of course, measures other than the sum of squared distances from the cluster ‘centre’ can be used. In particular, the influence of the outlying points of a cluster can be reduced by replacing the sum of squared distances by the simple distances. The L1 norm has also been proposed as a measure of distance. Typically this will be used with the vector of medians as the cluster ‘centre’. Methods based on minimising a within class matrix of sums of squares can be regarded as minimising deviations from the centroids of the groups. Maximal predictive classification, developed for use with binary variables in taxonomy but able to be applied more widely, can also be regarded as minimising deviations from group ‘centres’, though with a different definition of centres. Suppose that each object has given rise to a binary vector, such as (0011…1), and suppose we have a proposed grouping into clusters. Then, for each group we can define a binary vector which consists of the most common value, within the group, of each variable. This vector of modes (instead of means) will serve as the ‘centre’ of the group. Distance of a group member from this centre is then measured in terms of how many of the variables have values which differ from those in this central vector. The criterion optimised is then the total number of differences between the objects and the centres of the groups they belong to. The ‘best’ grouping is that which minimises the overall number of such differences. Hierarchical methods of cluster analysis, described in the next section, do not construct a single partition of the data, but rather construct a hierarchy of (typically) nested clusters. One can then decide where to cut the hierarchy so as to partition the data in such a way as to obtain the most convincing partition. For optimisation methods, however, it is necessary to decide at the start how many clusters one wants. Of course, one can rerun the analysis several times, with different numbers of clusters, but this still requires one to be able to choose between competing numbers. There is no ‘best’ solution to this problem. One can, of course, examine how the clustering criterion changes as one increases the number of clusters, but this may not be comparable across different numbers (for example, perhaps the criterion shows apparent improvement as the number increases, regardless of whether there is really a better cluster structure - in just the same way as the sum of squared deviations between a model and data decreases as the number of parameters increases, discussed in Section 5.XXX). For a multivariate uniform distribution divided optimally into c clusters, the criterion c2 W asymptotically takes the same value for all c, so this could be used to compare partitions into different numbers. It will be apparent from the above that cluster analysis is very much a data-driven tool, with relatively little formal model-building underlying it. Some researchers have attempted to put it on a sounder model-based footing. For example, one can supplement the procedures by assuming that there is also a random process generating sparsely distributed points uniformly across the whole space. This makes the methods less susceptible to outliers. So much for the criteria which one may adopt. Now what about the algorithms to optimise those criteria? In principle, at least, the problem is straightforward. One simply searches through the space of possible assignments of points to clusters to find that which minimises the criterion (or maximises it, depending on the chosen criterion). A little calculation shows, however, that this is infeasible except for the smallest of problems. (The number of possible c 1 c 1 i c i n , so that, for example, there are allocations of n objects into c classes is c! i 0 i 30 some 10 possible allocations of 100 objects into 2 classes.) Since, by definition, data mining problems are not small, such crude exhaustive search methods are not applicable. For some clustering criteria methods have been developed which permit exhaustive coverage of all possible clusterings without actually carrying out an exhaustive search. These include branch and bound methods, which eliminate potential clusterings on the grounds that they have worse criterion values than alternatives already found, without actually evaluating the criterion values for the potential clusterings. Such methods, while extending the range over which exhaustive evaluation can be made, still break down for large data sets. For this reason, we do not examine them further here. If exhaustive search is infeasible, one must resort to methods which restrict the search in some way. Iterative and sequential algorithms are particularly popular for cluster analysis, and they often make use of a stochastic component. For example, the k-means algorithm is based on the tr(W) criterion above, which gives the sum of squared deviations between the sample points and their respective cluster centres. There are several variants of the k-means algorithm. Essentially, it begins by picking cluster centres, assigns the points to clusters according to which is the closest cluster centre, computes the mean vectors of the points assigned to each cluster, and uses these as new centres in an iterative approach. A variation of this is to examine each point in turn and update the cluster centres whenever a point is reassigned, repeatedly cycling through the points until the solution does not change. If the data set is very large, one can simply add in each data point, without the recycling. Further extensions (e.g. the ISODATA algorithm) include splitting and/or merging clusters. Since many of the algorithms hinge around individual steps in which single points are added to a cluster, updating formulae have often been developed. In particular, such formulae have been developed for all of the criteria involving W above. Although each step of such algorithms leads to an improvement in the clustering criterion, the search is still restricted to only a part of the space of possible partitions. It is possible that a good cluster solution will be missed. One way to alleviate (if not solve) this problem is to carry out multiple searches from different randomly chosen starting points for the cluster centres. One can even take this further and adopt a simulated annealing strategy, though these are slow at the best of times and will be infeasible if the data set is large. Since cluster analysis is essentially a problem of searching over a huge space of potential solutions to find that which optimises some objective function, it will come as no surprise to learn that various kinds of mathematical programming methods have been applied. These include linear programming, dynamic programming, and linear and non-linear integer programming. 8a.3.2 Hierarchical methods Whereas optimisation methods of cluster analysis begin with a specified number of clusters and search through possible allocations of points to clusters to find an allocation which optimises some clustering criterion, hierarchical methods gradually merge points or divide superclusters. In fact, on this basis we can identify two distinct types of hierarchical method: the agglomerative (which merge) and the divisive (which divide). We shall deal with each in turn. The agglomerative are the more important and widely used of the two. Note that hierarchical methods can be viewed as a particular (and particularly straightforward) way to reduce the size of the search. They are analogous to stepwise methods used for model building in other parts of this book. (maybe refs to other sections hereXXX) Hierarchical methods of cluster analysis permit a convenient graphical display, in which the entire sequence of merging (or splitting) or clusters is shown. Because of its tree-like nature, such a display is called a dendrogram. We illustrate below. DISPLAY 8a.5XXX Cluster analysis is of most use when there are more than two variables, of course. If there are only two, then one can eyeball a scatterplot, and look for structure. However, to illustrate the ideas on a data set where we can see what is going on, we apply a hierarchical method to some two dimensional data. The data are extracted from a larger data set given in Azzalini and Bowman (1990). Figure XXX1 shows a scatterplot of the data. The vertical axis is the time between eruptions and the horizontal axis is the length of the following eruption, both measured in minutes. The points are given numbers in this plot merely so that we can relate them to the dendrogram in this exposition, and have no other substantive significance. Figure XXX2 shows the dendrogram which results from merging the two clusters which leads to the smallest increase in within cluster sum of squares. The height of the crossbars in the dendrogram (where branches merge) shows value of this criterion. Thus, initially, the smallest increase is obtained by merging points 11 and 33, and from Figure XXX1 we can see that these are indeed very close (in fact, the closest). The next merger comes from merging points 2 and 32. After a few more mergers of individual pairs of neighbouring points, point 31 is merged with the cluster consisting of the two points 8 and 15, this being the merger which leads to least increase in the clustering criterion. This procedure continues until the final merger, which is of two large clusters of points. This structure is evident from the dendrogram. (It need not always be like this. Sometimes the final merger is of a large cluster with one single outlying point - as we shall see below.) The hierarchical structure displayed in the dendrogram also makes it clear that one could terminate the process at other points. This would be equivalent to making a horizontal cut through the dendrogram at some other level, and would yield a different number of clusters. Figure XXX1: Time between eruptions versus duration of following eruption for Old Faithful geyser. [Data in file SDF17] 7 5 16 9 513 12 28 30 18 1022 2 3 34 26 1432 19 20 21 24 Duration 4 3 17 25 2 27 35 29 6 31 1 4 15 33 11 8 50 60 23 1 70 Wait 80 90 25 1 4 11 33 31 8 15 6 27 29 35 2 32 28 34 16 14 21 5 12 13 7 9 26 30 3 19 22 18 24 10 20 17 23 0 10 20 30 Figure XXX2: Dendrogram for a cluster analysis applied to the data in Figure XXX1. 8a.3. 2.1 Agglomerative methods Agglomerative methods are based on measures of distance between clusters. Essentially, given an initial clustering, they merge those two clusters which are nearest, to form a reduced number of clusters. This is repeated, each time merging the two closest clusters, until just one cluster, of all the data points, exists. Usually the starting point for the process is the initial clustering in which each cluster consists of a single data point, so that the procedure begins with the n points to be clustered. Analogously to optimisation methods, the key to agglomerative methods will be seen to have two parts: the measure of distance between clusters, and the search to find the two closest clusters. The latter part is less of a problem here than was the search in optimisation methods, since there are at most n(n-1)/2 pairs to examine. Many measures of distance between clusters have been proposed. All of the criteria described in Section XXX8a.2 can be used, using the difference between the criterion value before merger and that after merging two clusters. However, other distance measures, are especially suited to hierarchical methods. One of the earliest and most important of these is the nearest neighbour or single link method. This defines the distance between two clusters as the distance between the two closest points, one from each cluster. The single link method is susceptible (which may be a good or bad thing, depending upon one’s objectives) to the phenomenon of ‘chaining’, in which long strings of points are assigned to the same cluster (contrast the production of compact spherical clusters discussed in Section XXX8a3.1). This means that the single link method is of limited value for segmentation. It also means that the method is sensitive to small perturbations of the data and to outlying points (which, again, may be good or bad, depending upon what one is trying to do). The single link method also has the property (for which it is unique - no other measure of distance between clusters possesses it) that if two pairs of clusters are equidistant it does not matter which is merged first. The overall result will be the same, regardless of the order of merger. DISPLAY 8a.6XXX The dendrogram from the single link method applied to the data in Figure XXX1 is shown in Figure XXX3. Note that although the initial merges are the same as those in Figure XXX2, the two methods soon start to differ, and the final high level structure is quite different. Note in particular, that the final merge of the single link method is to combine a single point (number 30) with the cluster consisting of all of the other points. Figure XXX3: Dendrogram of the single link method applied to the data in Figure XXX1. 30 20 10 30 25 1 4 31 8 15 11 33 6 27 29 35 17 23 21 16 2 32 28 34 14 5 12 13 26 7 9 3 19 22 18 24 10 20 0 At the other extreme from single link, furthest neighbour or complete link, takes as the distance between two clusters the distance between the two most distant points, one from each cluster. This imposes a tendency for the groups to be of equal size in terms of the volume of space occupied (and not in terms of numbers of points), so making this measure particularly appropriate for segmentation problems. Other important measures, intermediate between single link and complete link, include the centroid measure (the distance between two clusters is the distance between their centroids), the group average measure (the distance between two clusters is the average of all the distances between pairs of points, one from each cluster), and Ward’s measure (the distance between two clusters is the difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters - see the tr(W) criterion discussed in Section XXX8a3.1). Each such measure has slightly different properties. Other variants also exist - the median measure ignores the size of clusters, taking the ‘centre’ of a combination of two clusters to be the mid-point of the line joining the centres of the two components. Since one is seeking the novel in data mining, it may well be worthwhile experimenting with several measures, in case one throws up something unusual and interesting. 8a.3.2.2 Divisive methods Just as stepwise methods of variable selection can start with none and gradually add variables according to which lead to most improvement (analogous to agglomerative cluster analysis methods) so they can also start with all the variables and gradually remove those whose removal leads to least deterioration in the model. This second approach is analogous to divisive methods of cluster analysis. Divisive methods begin with a single cluster composed of all of the data points, and seek to split this into components. These further components are then split, and the process is taken as far as necessary. Ultimately, of course, it will end with a partition in which each cluster consists of a single point. Monothetic divisive methods split clusters using one variable at a time (so they are analogous to the basic form of tree method of supervised classification discussed in Chapter XXX). This is a convenient (though restrictive) way to limit the number of possible partitions which must be examined. It has the attraction that the result is easily described by the dendrogram - the split at each node is defined in terms of just a single variable. The term association analysis is sometimes uses to describe monthetic divisive procedures applied to multivariate binary data. This is not the same as the ‘association analysis’ described in Chapter XXX. Polythetic divisive methods make splits on the basis of all of the variables together. Any inter-cluster distance measure can be used. The difficulty comes in deciding how to choose potential allocations to clusters - that is, how to restrict the search through the space of possible partitions. In one method, objects are examined one at a time, and that one selected for transfer from a main cluster to a subcluster which leads to greatest improvement in the clustering criterion. Divisive methods are more computationally intensive and seem to be less widely used than agglomerative methods. 8a.3.3 Fuzzy clusters and overlapping clusters By definition, a partitioning of a set of points divides them into subsets such that each point belongs to only one subset. Sometimes it can be advantageous to relax this, and permit objects to belong to more than one group. This is clearly closely related to the ideas of mixture models, in which a given position in the space spanned by the variables can be given probabilities of belonging to each of the components. (Proponents of fuzzy set theory avoid the probability interpretation and adopt axioms which differ slightly from those of probability.) One such method generalises the sum of squared distances criterion c tr W tr c x x x x ' x k k 1 xX k k k 1 xX k r x kr 2 r (where r refers to summation over the variables) to c u x k 1 xX k kx r x kr 2 r where u kx is the membership function for object x in the kth cluster. (The values of u kx are all positive and the sum over the groups, for each point, is 1.) While fuzzy clustering procedures decide to what extent each point belongs to several clusters, another model is to let each point (possibly) belong to more than one cluster. One method for this begins with a similarity matrix S and seeks to approximate it by the form DUD’, where U is a diagonal matrix of weights and D is an n by c indicator matrix in which the columns correspond to clusters and the rows to points. Dij takes the value 1 whenever point i is in cluster j, and 0 otherwise. That clustering which leads to the best approximation of S is the solution, where the approximation is in terms of the sum of squared devations of DUD’ from S. 8a.4 Modelling multivariate distributions through graph structures In many data mining problems, our aim is to identify and model relationships between variables. Graphical representations have obvious attractions for this purpose. The word ‘graphical’ here is intended in the mathematical sense: the models can be represented as graphs in which the nodes represent variables and the edges represent relationships between variables. Several different types of graphical representation have been developed, and we shall explore the more important ones in this section. They are characterised by the nature of the relationships that the edges of the graph represent. We begin, in the next section, with the earliest formal model of this type, that of path analysis. Path analysis decomposes relationships between variables which have been explicitly measured. In Section XXX we generalise the ideas, into other covariance structure models. More recently, there has been tremendous interest in another, related class of models, based not on the structure of the relationships between variables, but on decomposing the overall multivariate distribution of the variables. These models go under various names, including Bayesian belief networks and conditional independence models. These are described in Section XXX. 8a.4.1 Path analysis and causal models A crude summary (exact for ellipsoidal distributions) of the relationships, in a population or sample, between multiple variables is given by the covariance matrix. With p variables, this is a p by p matrix in which the (i,j)th off-diagonal element contains the covariance between the ith and jth variables and the ith diagonal element contains the variance of the ith variable. The off-diagonal elements thus provide information about the two-dimensional marginal distributions of the variables: they tell us if larger values of one variable are associated with larger values of another, and so on. Such a matrix provides a quantitative summary of the information seen graphically in a scatterplot matrix (see Chapter 7XXX). Two dimensional marginal relationships are all very well, but they can be deceptive. It is entirely possible that such a relationship is spurious, in the sense that both variables are related to a third, so that the observed marginal relationship is a consequence of their joint relationship to this third. Perhaps, if the third variable is fixed at any particular value, the relationship between the two variables will vanish. What is needed is some way to tease apart the relationships between variables. What we need then, is some way to find convenient, and realistic, summaries of the covariances between variables. Can these covariances be explained by a causal structure relating the variables, on which we can base improved understanding and prediction? Is there, in fact, just one single underlying variable (perhaps explicitly measured, perhaps not) which explains the non-zero covariances between a hundred other variables? And so on. Moreover, as we discussed in Chapter 5XXX, there are advantages to replacing a model with a large number of parameters by one with few (for example, replacing the 5050 parameters in an unconstrained 100 by 100 covariance matrix by one with just 100 parameters will lead to much less variability in the estimates.) Path analysis, having its origins in the 1920s with work in genetics by Sewell Wright (1921, 1923, 1934), was the earliest formal model of this kind. It is a way of decomposing linear relationships between measured variables, in such a way that strengths can be attributed to causal paths between them. We should begin by saying what we mean by ‘causal’. For us, in this section, x is a cause of y if and only if the value of y can be changed by manipulating x and x alone. Subtleties arise because the effect of x on y may be mediated by some other, intermediate, variable z. That is, x may influence z which in turn influences y. Or, even more complicated, x and z may both influence y and perhaps their effects cancel out so that there appears to be no marginal effect of x on y. Clearly some care is needed in teasing apart the relationships. The basic form of path analysis is based on two fundamental assumptions: (a) that the variables have a given weak causal order. That is, we can order the variables so that earlier ones may (but need not) influence later ones, but later ones cannot influence earlier ones. The direction of cause is denoted by an arrow in a path model. This order is determined from outside the model, and is not derived from the analysis. The existence of this order means that the graphs of a path analysis model are directed. (b) the set of relationships between the observed variables is causally closed. This means that a non-zero covariance between two variables x and y can be explained either by the direct effect of one of them on the other, or by indirect effects due to other variables in the model. We shall say more about these two requirements below. The marginal effect of some variable x on another variable y can be determined from the simple regression equation y = bx + e. Here e is the error term and b is the regression coefficient of x on y. b tells us the expected difference between y values for two observations which happen to differ in one unit on x. Note, however, that this does not necessarily mean that changing the value of x by one unit will cause a change of b units in y for a particular object. The regression relationship may or may not be a causal relationship. It might simply reflect the effect of a third variable: x and y happen to be correlated because of the way the sample was drawn. (For example, let x be the size of vocabulary and y the complexity of arithmetic problems that children in a sample covering various ages can solve. Then we will probably find a positive value for b. However, it would be unwise to claim that increasing a child’s vocabulary would enhance their arithmetic ability. It is more likely that both vocabulary and arithmetic skill are causally related to number of years of schooling.) In general, we may find that some component of the relationship between two variables x and y is attributable to the direct causal effect of one on the other, while the remainder is attributable indirectly to other variables. Path analysis separates these components. Just as, in regression, one can work either with standardised or with unstandardised variables, so one can in path analysis. Conventionally, in path analysis, coefficients based on unstandardised variables are called effect coefficients, while those based on standardised variables are called path coefficients. The difference is not important for us, though, of course, it can be important when interpreting the results. The basic structures which arise in path analysis are shown in Figure XXX4, where the arrows indicate the direction of causation. Figure XXX4(a) shows a situation in which variable x may be a cause of y, which may be a cause of z. From the model we see that x may have an indirect effect on z. We also see that none of the reverse relationships hold - y cannot be a cause of x, and so on - as a result of the specified weak causal order. Note that, while y has a direct effect on z, x only has an indirect effect: the effect of x on z is via (and only via) y. There is no arrow going directly from x to z. In Figure XXX4(b), variable x is a potential cause of y and also a potential cause of z. That is, if we change the value of x then the value of y and the value of z may change. The reverse does not hold - changing the value of y or z does not influence x. This model is interesting because there is no direct link between y and z, but there may be a nonzero marginal covariance between these two variables. (The vocabulary and arithmetic skill example above illustrates, this, with a relabelling of the variables.) The relationship between y and z is, for obvious reasons, termed noncausal. This sort of relationship is responsible for many misinterpretations in data mining, arising when variable x is not explicitly measured. If objects with a range of x values are measured, one will observe a covariance between y and z which cannot be explained (since x is not observed). One might, therefore, be tempted to deduce the existence of a causal relationship when in fact it is entirely spurious. (Of course, sometimes it is the marginal relationship between y and z which is of interest. In this case one clearly would not want to control for x, even if one could.) In Figure XXX4(c) variables x and y each have a separate causal influence on z. Note that the lack of an edge linking x and y means that they are conditionally independent, given z. However, if z is uncontrolled, then there may be an induced relationship between x and y: in a random sample of objects, there may be a nonzero covariance between x and y. Yet again the path and regression coefficients are identical. Finally, in Figure XXX4(d), x has a direct effect on y, both a direct and an indirect effect on z, and y has a direct effect on z. This is the most interesting of the four structures in Figure XXX4. The regression model y ~ x yields the path coefficient of x on y. The regression model z ~ x + y yields the path coefficients of x and y on z. The coefficient from this model tell us the direct effect of x on z. The indirect effect arises from the path via y and is the product of the effect of x on y and y on z. Figure XXX4: Some basic causal structures involving three variables. (a) x y (b) y x z (c) x y z (d) y x z Figure XXX5: An example of direct and indirect causal relationships. The sizes of the path coefficients are shown. y 0.2 0.5 x z 0.6 DISPLAY 8a.7 XXX The numbers in Figure XXX5 were derived from the following regression models y = 0.2x + e1 z = 0.5y + 0.6x + e2 z = 0.7x + e3 The total effect of x on z is 0.7. That is, unit change in x induces a total change of 0.7 units in z. On the other hand, the direct effect of x on z is only 0.6. This is the effect not mediated by other variables. The difference between the direct and total effect of x on z is the indirect effect, via y. The size of this is 0.2 (the direct effect of x on y) times 0.5 (the direct effect of y on z), that is 0.1. Thus the total effect of x on z is 0.7 = 0.5×0.2 + 0.6. The path coefficient between y and z is 0.5. This shows the direct causal effect of y on z. However, this is not what would be obtained from a simple regression of z on y. The covariance between these two variables has an additional component due to the fact that both are related to x. If we fit a simple linear regression, predicting z from x, we obtain x = 0.62y + e4 . The coefficient here is composed of the direct effect of y on z plus the spurious effect induced by x, of magnitude 0.2×0.6 = 0.12. DISPLAY 8a.8 XXX Figure XXX6 shows a more complicated situation with four variables. Here w is a direct cause of x, y, and z; x is a direct cause of y and z; and y is a direct cause of z. Or, to put it in a way which makes clearer the recursive regression relationships: x is caused by w; y is caused by x and w; and z is caused by w, x, and y. The various direct and indirect causal paths are decomposed in Table XXX1. In this table the notation r(x,y) stands for the total covariation between x and y, p(x,y) stands for the path coefficient between two variables (that part of the total covariation which is attributable to the direct causal effect of x on y), and b(x,y|z) is the regression coefficient of a model predicting y from x, controlling for w. Figure XXX6: A more complicated causal model involving four variables. x z w y ____________________________________________________________________________ Table XXX1: Decomposition of causal paths in Figure XXX6, showing the direct and indirect contributions to the total covariation between each pair of variables. ____________________________________________________________________________ Relation between Total covariation Direct Causal Indirect Total wx r(w,x) r(w,x) 0 r(w,x) 0 wy r(w,y) p(w,y) p(x,y)p(w,x) r(w,y) 0 wz r(w,z) p(w,z) xy r(x,y)-p(x,y) r(x,y) p(x,y) p(y,z)p(x,y)p(w,x) + p(y,z)p(w,y) + p(x,z)p(w,x) 0 Noncausal r(w,z) p(x,y) =p(w,y)p(w,x) xz r(x,z) p(x,z) p(y,z)p(x,y) yz r(y,z) p(y,z) 0 b(x,z|w) r(x,z)-b(x,z|w) p(y,z) r(y,z)-p(y,z) =b(y,z|w,x) ____________________________________________________________________________ The models we have discussed above are very simple. They require that assumptions (a) and (b) be satisfied, and when this is the case, they decompose the overall covariation between two variables into direct causal influences, indirect causal influences, and noncausal relationships in which both variables are jointly influenced by other variables. When the assumptions are justified, there is no ambiguity or uncertainty about the model or its interpretation. However, assumptions (a) and (b) are often difficult to justify. As far as (a) is concerned, if x precedes y in time, then the direction of any possible causal relationship is clear, but when this is not the case it is common to find that one cannot unambiguously decide whether a change in one variable is causing a change in another or vice versa. Although path analysis has been extended to these more general situations this is not always without difficulty. The clear interpretation of the model is sometimes lost. The relaxation of assumption (b), by seeking to explain relationships between observed variables in terms of other, unmeasured (or latent) variables, has led to a broad class of models called linear structural relational models or covariance structure models. We discuss these in the next section. Such models are not without their problems (of fitting, of interpretation, and so on), but then what models are? They have been described as ‘the most important and influential statistical revolution to have occurred in the social sciences’ and, handled with care and applied appropriately, can be immensely useful and revealing. When the graph is complete, so that there is an edge connecting each pair of variables, then each variable is potentially influenced by all those that precede it in the order given by assumption (a). Estimation of the path coefficients is then straightforward: one simply regresses each variable on those preceding it - obtaining the kind of decomposition illustrated in Display XXX8a.8. With p variables, this means that p-1 multiple regressions provide enough information to compute all the path coefficients. (For this reason, such models are called univariate recursive regression models.) Things are a little more subtle, however, if one want to impose prior restrictions on some of the paths; if one knows, for example, that some possible pairs of variables are not linked. (Corresponding to a missing edge in the graph.) Then the system involves more equations than there are unknowns and becomes overidentified. The simplest illustration of this arises in Figure XXX4(a), where x may cause y, which, in turn, may cause z, but where it is known that there is no direct link from x to z. Suppose the sample covariation between x and y is c1, that between y and z is c2, and that between x and z is c3. The equations from which we must estimate the path coefficients p(a,b) are then p(x,y) = c1, p(y,z)=c2, and p(x,y)p(y,z)=c3. It seems that this could lead to a contradiction: p(x,y)=c1 and p(x,y)=c3/c2. Various ways of tackling this are used. One can simply estimate the parameters using the regressions of variables for which there are explicit links in the model. More generally, a measure of the discrepancy between the sample and theoretical covariance matrices is minimised - for example, a weighted sum of squares of the differences between the elements of the two matrices. 8a.4.2 Structural equation models Broadly speaking, structural equation models seek to explain the relationships between observed variables in terms of unobserved or latent variables, which may be explanatory and responses. The simplest model of this kind is a factor analysis model, which explains the covariances between observed variables in terms of their linear relationship to one or more unobserved ‘factors’. That is, the factor analysis model has the form x = f + u (1) Here x is the vector of p measurements on an object, f is a vector of length k containing the (unobserved) scores that this object has on the k latent vectors, is a matrix of coefficients or ‘factor loadings’ linking x and f and u is a random vector (typically taken to be multivariate normal with zero expectation and diagonal covariance matrix . The diagonality means that the variables are conditionally independent given the factors.). From this model one can derive the theoretical form of the covariance matrix of the observed scores to be =’ + (2) Nowadays, maximum likelihood methods are probably the most popular ways of estimating the factor loadings (typically through iteratively reweighted least squares methods of minimising the discrepancy between observed and theoretical covariance matrices), though software packages often provide alternatives. Factor analysis has not had an altogether untroubled history. There are three main reasons for this. One is that, as mentioned in Chapter 7XXX, the solutions are not unique. This is easily seen since the matrix M, with M orthogonal, (a rotation of the original solution, in the space spanned by the factors) also satisfies (2), =M(M)’ + . In fact, this can be an advantage, rather than a disadvantage, since one can choose the solution (by varying M) to satisfy some additional criterion. In particular, one can choose it so that the factor solutions facilitate interpretation. For example, one can try to arrange things so that each factor has either large or small weights on each of the variables (and few intermediate weights). Then a factor is clearly interpretable in terms of the variables it contributes substantially to. This notion of ‘interpretability’ is the second reason why the technique has, in the past, had its critics. There is sometimes a tendency to overinterpret the results, to assign descriptive names to the factors, which then take on an unjustified authority of their own. Finally, the third reason arises from the fact that, in the early days, factor analysis was an essentially exploratory tool (this is no longer the case - see below). It was often used as an attempt to see if observed relationships could be explained by latent factors, without their being theoretical reasons for this. As we have been at pains to point out in this book, this is fine, provided one does not assert too much about the reality of unearthed relationships: they need to be checked and verified to ensure that they are not chance phenomena associated with the particular data set to hand. Having said all of the above, factor analysis is now well understood and is a respectable data analytic technique. It is widely used in areas ranging from psychology and sociology through botany, zoology, ecology, to geology and the earth sciences. The factor analysis model (1) can be represent graphically as in Figure XXX7. The common factor, at the bottom of the graph, each contribute to each of the manifest variables. Additional, specific components (the ui ) also contribute to the manifest variables. Figure XXX7: A graphical representation of a factor analysis model. U1 1 U2 Up X1 X2 Xp F1 Fk This representation immediately shows us how such models may be generalised. Note, in particular, that there are no links between the factors. These can easily be added to the graph, producing a different model, the parameters of which can again be estimated. (Again we do not go into details here - estimation will be handled by the software package.) (PADHRAIC - you may have described iteratively reweighted least squares methods for ML estimation in your chapters 6a, 6b) Adding links between factors means that the factors are correlated. The basic factor analysis model is straightforward enough, but as additional complexities are introduced, so additional problems can arise. In particular, models can become non-identified. The issue is the same as that with mixture distributions discussed above: two different models, parameterised in different ways, lead to empirically observable structures which are indistinguishable. In this case one cannot choose between the models and, perhaps even worse, there may be estimation problems. The basic factor model in (1) permitted all factors to load on all variables. There were no restrictions imposed. This is entirely appropriate in an exploratory context, where one is seeking to discover the structure in the data. In other situations, however, one has some idea of what structure may exist. For example, one might believe, a priori, that there are two factors, one of which loads on some of the variables, and the other on a disjoint set of variables. In this case one would want to restrict certain factor loadings to zero at the start. Such confirmatory factor analysis can be a very powerful tool. DISPLAY 8a.9 XXX Dunn, Everitt, and Pickles (1993) carried out a confirmatory factor analysis of the data on patterns of consumption of legal and illegal psychoactive substances described in Chapter 7. The postulated a three factor model for the observed data. Their three factors were: Alcohol use: in which the model permitted non-zero loadings on beer, wine, liquor, and cigarettes. Cannabis use: in which the model permitted non-zero loadings on marijuana, hashish, cigarettes, and wine. Hard drug use: in which the model permitted non-zero loadings on amphetamines, tranquillizers, hallucinogenics, hashish, cocaine, heroin, drug-store medication, inhalants, and liquor. If we assume that the factors are independent, we see that this model will imply zero correlations between certain variables. For example, marijuana and beer consumption have no factor in common, and hence will be predicted to have zero correlation. In fact the empirical correlation matrix shows them to have a correlation of 0.445. There are other pairs like this, so perhaps the model needs to be modified. Dunn et al suggest permitting non-zero correlations between three factors. The factor analysis model shown in Figure XXX7 shows how the latent factors are related to the observed, manifest variables, and hence permits us to estimate characteristics of the latent variables. But this principle can be applied more widely. In many situations no measurements are made without error. This suggests that any theory relating the concepts which we are trying to measure is really a theory relating latent variables. We can set up a system of equations showing the relationships between these latent variables (that is what our ‘theory’ is), but we then need another system showing how these latent variables are related to the things we actually measure. The system describing the relationships between the latent variables (the theoretical constructs) is called the structural model, since it describes the structure of our model. The system describing the relationships between the latent variables and the things we actually measure is the measurement model. Separate systems of equations are set up for the two models, being interlocked via the latent variables. The term LISREL, standing for linear structural relational models is used to describe the overall model which results when the measurement and structural models are put together (and also for a particular software package for fitting such models). These models can be extremely complicated, since the theoretical relationships can be complicated. An example is the MIMIC model, an acronym for multiple indicator - multiple cause, in which it is postulated that several causes influence some latent construct, and that the effects are observed on several indicators. Models of change over time can also be developed using structural relational models. Here variables are measured at repeated times and the models show the links between the times. The theory of structural equation models has been developed with multivariate normal distributions in mind (it is, after all, based on second order statistics). If the distributions deviate substantially from normal distributions, then the results should be interpreted with caution - prior transformation of the data is to be recommended, if this is possible. Alternatively, one can estimate the parameters using standard approaches, but then estimate their standard errors using more advanced techniques such as bootstrap methods (since it is typically the measures of accuracy of the estimates which are more susceptible to departures from normality). 8a.4.3 Bayesian belief networks Recursive regression models have an attractive conditional independence interpretation (provided the residuals are uncorrelated). That is, the absence of an edge between two nodes of the graph means that the variables represented by those nodes are independent, conditional on the other variables in the graph. Unfortunately, this property does not hold for some of the more sophisticated extensions of linear structural relational models. However, since the property is so attractive, it has served as the motivating force for a different class of models, which have attracted much interest in recent years. These models are based on reducing the overall multivariate probability distribution by representing (or approximating) it in terms of interlocking distributions each involving only a few variables. Obviously this is equivalent to assuming that some variables are conditionally independent (given the other variables in the graph), so that a simplified model results. They were developed in parallel for multivariate normal data and for categorical data (where they can be regarded as providing convenient representations of log-linear models). Such models essentially decompose the joint distribution of all of the variables in a way which leads to simplification. Consider the example involving five variables, U, V, W, X, and Y, illustrated in Figure XXX9. The complete joint distribution is f(U, V, W, X, Y), but judicious rearrangement leads to the factorisation f(U, V, W, X, Y) = f(Y|W, X) f(W|U, V) f( X) f(U) f(V) Such decompositions are perfectly general, and may be made whatever the nature of the distribution. In the case of multivariate normal distributions, each factor is also a normal distribution. In the case of categorical data the overall distribution, involving the cross-classification of all of the variables (and hence perhaps with a vast number of cells, making estimation of all of the probabilities infeasible) is replaced by a product of terms each of which involve only a few variables. In the above example, the five-way table of probabilities is replaced by two three-way tables (f(Y|W, X) and f(W|U, V)) and three one-way tables. In this example the improvement is not dramatic, but in real problems, which may involve hundreds of variables, the improvement may be from the completely impossible to the relatively straightforward. Figure XXX9: A simple example of a conditional independence graphical model. U V W X Y DISPLAY 8a.10 XXX Figure XXX8 shows the graphical model derived from a sample of 23,216 applicants for unsecured personal loans, using data kindly provided by a major UK bank. The graph shows the relationships between the variables: LNAMT: the total amount of the loan, in £; MRTST: the marital status of the applicant, coded as married, divorced, single, widowed, or separated; RESCODE: housing tenure status of the applicant, coded as home-owner, living with parents, tenant, or other; DISPINC: disposable monthly income in £, calculated from data on the applicant’s income, the income of the applicant’s spouse, other income, monthly mortgage repayments, and other repayments; CUSTAGE: the applicant’s age in years; INSIND: a binary indicator of whether the applicant has taken out the loan protection insurance sold by the bank alongside the loan; BANKIND: a binary indicator of whether the applicant has a current account with the bank providing the loan; CARDDEL: a binary indicator of whether the applicant’s credit card account with the bank has ever been delinquent; DEL: a binary indicator of whether the loan account was categorized as ‘bad’ using the bank’s standard definition. This model is quite complicated, in that relatively few of the possible edges between pairs of variables have been deleted. (That is, relatively few pairs of variables are conditionally independent, given the other variables in the model.) However, despite this, the model is revealing. It shows, for example, that the two ‘delinquency’ variables, CARDDEL and DEL, are conditionally independent of LNAMT, MRTST, and RESCODE, given the other four variables. Put another way, once we know the disposable income, age, and whether or not the customer has a current account with the bank and took out the insurance loan, then information on the size of the loan, the applicant’s marital status, and the nature of their living accommodation is irrelevant to their delinquency risk category. This separation of two sets of nodes {CARDDEL, DEL} and {LNAMT, MRTST, RESCODE} by a third set {DISPINC, CUSTAGE, BANKIND, INSIND}, in the sense that all paths from the first set to the second pass through the third, is termed the Markov property. There are various versions of this property, including versions for directed graphs. Figure XXX8: An undirected model for bank loans. MRTST RESCODE LNAMT DISPINC BANKIND CUSTAGE CARDDEL INSIND DEL The multivariate normal distribution has the useful property that two variables are conditionally independent, given the other variables, if and only if the corresponding element of the inverse of the covariance matrix is zero. This means that the inverse covariance matrix reveals the pattern of relationships between the variables. (Or, at least, it does in principle: in fact, of course, it will be necessary to decide whether a small value in the inverse covariance matrix is sufficiently small to be regarded as zero.) It also means that we can draw a graph in which there are no edges linking the nodes corresponding to variables which have a small value in this inverse matrix. There is an important difference between the type of graph illustrated in Figure XXX8 and the ones described in the preceding subsections. This is that this new graph is undirected. There are no natural directions to the edges linking nodes (so they are denoted by mere lines, instead of arrows). Of course, if we have external information about the direction of causation (as was assumed in the case of path analysis, for example) then we might be able to use arrows, though this does lead to certain complications. As far as directed graphs go, ones without cycles (that is, ‘acyclic’ directed graphs) are the most important. A cycle is a loop of nodes and directed edges, such that one is led back to the start: UVWXU. Acyclic directed graphs can be converted into univariate recursive regression graphs (though there may be more than one way to do this, and if there is then they correspond to different statistical models) and we have already seen that such models are convenient. We also referred to a ‘Markov’ property for undirected graphs in Display 8a.10XXX. A similar property exists for directed graphs, though it is slightly more complicated. A set of variables X is independent of a set of variables Y, given a set Z, if (i) there are no directed paths between X and Y lying wholly outside Z, and (ii) there are no pairs of directed paths lying outside Z, one leading to X and one leading to Y, both starting from a common node W outside Z, and (iii) all nodes which have multiple parents on paths connecting X and Y either lie in or have a descendant in Z. Construction of conditional independence models involves essentially two steps. The first is the construction of the topology, and the second is the determination of the conditional distributions. Padhraic, I was not sure how much we wanted to go into this, since there are many different approaches, and it will depend on what package the user adopts. What is your view? Padhraic: We need an example of a directed graph needed here. I couldn’t find anything convincing in the papers from the last two KDD meetings. I have one example which is the continuation of the display example on loans above, but it’s very complicated (I think too complicated for this chapter). Do you have anything suitable? As a last resort we could use a toy example, like the one in Heckerman’s ‘Data Mining and Knowledge Discovery’ paper. Discussion Mixture models and factor analytic models are examples of latent structure models (we said at the start of the chapter that there were relationships between the various kinds of models). We are trying to explain the observed pattern of data in terms of a postulated latent or underlying structure. There are also other forms of latent structure model. For example, latent class analysis deals with the situation in which both the observed and the postulated latent variables are categorical, latent trait analysis the situation when the observed variables are categorical but the latent ones are continuous, and latent profile analysis the situation when the observed variables are continuous but the latent ones are categorical. Padhraic, if you do decide to add something to this chapter about methods which make Markov assumptions, then I would like to add a paragraph here about variance components, multilevel (hierarchical) models, and repeated measures data. I didn’t include it in the above (could easily have had separate sections) simply because we had to stop somewhere and I’m not sure how important such methods are in the context of data mining. (I agree they are very important in general.) Further reading Books on mixture distributions include Everitt and Hand (1981), Titterington, Smith, and Makov (1985), McLachlan and Basford (1988). Mixture models in haematology are described in McLaren (1996). Studies of tests for numbers of components of mixture models are described in Everitt (1981), McLachlan (1987), and Mendell, Finch, and Thode (1993). There are now many books on cluster analysis. Recommended ones include Anderberg (1973), Späth (1985), and Kaufman and Rousseeuw (1990). The distinction between dissection and finding natural partitions is not always appreciated, and yet it can be an important one and it should not be ignored. Examples of authors who have made the distinction include Kendall (1980), Gordon (1981), and Späth (1985). Banfield and Raftery (1993) proposed the idea of adding a ‘cluster’ which pervades the whole space by superimposing a separate Poisson process which generated a low level of random points throughout the entire space, so easing the problem of clusters being distorted due to a handful of outlying points. Marriott (1971) showed that the criterion c2 W was asymptotically constant for optimal partitions of a multivariate uniform distribution. Krzanowski and Marriott (1995), Table 10.6, gives a list of updating formulae for clustering criteria based on W. Maximal predictive classification was developed by Gower (1974). The use of branch and bound to extend the range of exhaustive evaluation of all possible clusterings is described in Koontz, Narendra, and Fukunaga (1975) and Hand (1981). The k-means algorithm is described in MacQueen (1967) and the ISODATA algorithm is described in Hall and Ball (1965). Kaufman and Rousseeuw (1990) describe a variant in which the ‘central point’ of each cluster is an element of that cluster, rather than being the centroid of the elements. A review of early work on mathematical programming methods applied in cluster analysis is given by Rao (1971). For a cluster analytic study of whiskies, see Lapointe and Legendre (1994). One of the earliest references to the single link method of cluster analysis was Florek et al (1951), and Sibson (1973) was important in promoting the idea. Lance and Williams (1967) presented a general formula, useful for computational purposes, which included single link and complete link as special cases. The median method of cluster analysis is due to Gower (1967). Lambert and Williams (1966) describe the ‘association analysis’ method of monothetic divisive partitioning. The polythetic divisive method of clustering described in Section 8a3.2.2 is due to MacNaughton-Smith et al (1964). The fuzzy clustering criterion described in Section 8a.3 is due to Bezdek (1974) and Dunn (1974), and a review of fuzzy clustering algorithms is given in Bezdek (1987). The overlapping cluster methods summarised in Section 8a.3 is due to Shepard and Arabie (1979). A formal description of path analysis is given in Wermuth (1980). Recent books on factor analysis include Batholomew (1987), Reyment and Jöreskog (1993), and Basilevsky (1994). Bollen (1989) is a recommended book on structural equation models. Books on graphical models include those by Whittaker (1990), Edwards (1995), Cox and Wermuth (1996), and Lauritzen (1996). The bank loan model described in Display 8a.10XXX is developed in more detail in Hand, McConway, and Stanghellini (1997). References Anderberg M.R. (1973) Cluster Analysis for Applications. New York: Academic Press. Azzalini A. and Bowman A.W. (1990) A look at some data on the Old Faithful geyser. Applied Statistics, 39, 357-365. Banfield J.D. and Raftery A.E. (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821. Bartholomew D.J. (1987) Latent Variable Models and Factor Analysis. London: Charles Griffin and Co. Basilevsky A. (1994) Statistical Factor Analysis and Related Methods. New York: Wiley. Bezdek J.C. (1974) Numerical taxonomy with fuzzy sets. Journal of Mathematical Biology, 1, 57-71. Bezdek J.C. (1987) Some non-standard clustering algorithms. In Developments in Numerical Ecology, ed. P. Legendre and L. Legendre, Berlin: Springer-Verlag, 225-287. Bollen K.A. (1989) Structural Equations with Latent Variables. New York: Wiley. Cox D.R. and Wermuth N. (1996) Multivariate Dependencies: Models, Analysis, and Interpretation. London: Chapman and Hall. Dunn J.C. (1974) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3, 32-57. Edwards D. (1995) Introduction to Graphical Modelling. New York: Springer Verlag. Everitt B.S. (1981) A Monte Carlo investigation of the likelihood ratio test for the number of components in a mixture of normal distributions. Multivariate Behavioural Research, 16, 171-180. Everitt B.S. and Hand D.J. (1981) Finite Mixture Distributions. London: Chapman and Hall. Florek K., Lukasziwicz J., Perkal J., Steinhaus H., and Zubrzycki S. (1951) Sur la liaison et la division des points d’un ensemble fini. Colloquium Mathematicum, 2, 282-285. Gordon A. (1981) Classification: Methods for the Exploratory Analysis of Multivariate Data. London: Chapman and Hall. Goer J.C. (1967) A comparison of some methods of cluster analysis. Biometrics, 23, 623-628. Gower J.C. (1974) Maximal predictive classification. Biometrics, 30, 643-654. Hall D.J. and Ball G.B. (1965) ISODATA: a novel method of cluster analysis and pattern classification. Technical Report, Stanford Research Institute, Menlo Park, California. Hand D.J. (1981) Discrimination and Classification. Chichester: Wiley. Hand D.J., McConway K.J., and Stanghellini E. (1997) Graphical models of applicants for credit. IMA Journal of Mathematics Applied in Business and Industry, 8, 143-155. Sibson R. (1973) SLINK: an optimally efficient algorithm for the single link method. Computer Journal, 16, 30-34. Kaufman L. and Rousseeuw P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley. Kendall M.G. (1980) Multivariate Analysis. (2nd ed.) London: Griffin. Koontz W.L.G., Narendra P.M., and Fukunaga K. (1975) A branch and bound clustering algorithm. IEEE Transactions on Computers, 24, 908-915. Krzanowski W.J. and Marriott F.H.C. (1995) Multivariate Analysis vol.2: Classification, Covariance Structures, and Repeated Measurements. London: Arnold. Lambert J.M. and Williams W.T. (1966) Multivariate methods in plant ecology IV: comparison of information analysis and association analysis. Journal of Ecology, 54, 635-664. Lance G.N. and Williams W.T. (1967) A general theory of classificatory sorting strategies: 1. Hierarchical systems. Computer Journal, 9, 373-380. Lapointe F.-J. and Legendre P. (1994) A classification of pure malt Scotch whiskies. Applied Statistics, 43, 237-257. Lauritzen S.L. (1996) Graphical Models. Oxford: Clarendon Press. McLachlan G.J. (1987) On bootstrapping the likelihood ratio test for the number of components in a normal mixture. Applied Statistics, 36, 318-324. McLachlan G.J. and Basford K.E. (1988) Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker. McLaren C.E. (1996) Mixture models in haematology: a series of case studies. Statistical Methods in Medical Research, 5, 129-153. MacNaughton-Smith P., Williams W.T., Dale M.B., and Mockett L.G. (1964) Dissimilarity analysis. Nature, 202, 1034-1035. MacQueen J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, ed. L.M.Le Cam and J. Neyman, 1, Berkeley, California: University of California Press, 281-297. Marriott F.H.C. (1971) Practical problems in a method of cluster analysis. Biometrics, 27, 501-514. Mendell N.R., Finch S.J., and Thode H.C. (1993) Where is the likelihood ratio test powerful for detecting two component normal mixtures? Biometrics, 49, 907-915. Rao M.R. (1971) Cluster analysis and mathematical programming. Journal of the American Statistical Association, 66, 622-626. Reyment R. and Jöreskog K.G. (1993) Applied Factor Analysis in the Natural Sciences. Cambridge: Cambridge University Press. Shepard R.N. and Arabie P. (1979) Additive clustering: representation of similarities as combinations of discrete overlapping properties. Psychological Review, 86, 87-123. Späth H. (1985) Cluster Analysis and Dissection. Chichester: Ellis Horwood. Titterington D.M., Smith A.F.M., and Makov U.E. (1985) Statistical Analysis of Finite Mixture Distributions. New York: Wiley. Wermuth N. (1980) Linear recursive equations, covariance selection and path analysis. Journal of the American Statistical Association, 75, 963-972. Whittaker J. (1990) Graphical Models in Applied Multivariate Statistics. Chichester: Wiley. Wright S. (1921) Correlation and causation. Journal of Agricultural Research, 20, 557-585. Wright S. (1923) Theory of path coefficients: A reply to Niles’ criticism. Genetics, 8, 239-255. Wright S. (1934) The method of path coefficients. Annals of Mathematical Statistics, 5, 161-215.