Download CHAPTER 8A: GLOBAL DESCRIPTIVE MODELS (DAVID AND

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Coefficient of determination wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
CHAPTER 8A : GLOBAL DESCRIPTIVE MODELS
8a.1 Introduction
In Chapter 1XXX we defined what is meant, in the context of data mining, by the terms
‘model’ and ‘pattern’. A model is a high level description, summarising a large collection of
data and describing its important features. In contrast, a ‘pattern’ is a local description,
perhaps showing how just a few data points behave or characterising some persistent but
unusual structure within the data. We described the distinction in more detail in other
chapters - for example, in Section 5.1XXX. As well as distinguishing between models and
patterns, earlier chapters also noted the distinction between descriptive and predictive
models. A descriptive model presents, in convenient form, the main features of the data. It
is, essentially a summary of the data, permitting one to study the most important aspects of
the data without these being obscured by the sheer size of the data set. In contrast, a
predictive model has the specific objective of allowing one to predict the value of some target
characteristic of an object on the basis of observed values of other characteristics.
This chapter is concerned with descriptive models, presenting outlines of some of those
which are most important in data mining contexts. Chapter 8BXXX describes descriptive
patterns and Chapter 9XXX describes predictive models.
In Chapter 5 we noted the distinction between mechanistic and empirical models - the former
being based on some underlying theory about the mechanism through which the data arose
and the latter being simple a description. Data mining is usually concerned with the latter
situation. The fundamental objective is to produce insight and understanding about the
structure of the data, and to enable one to see what are its important features. Beyond this,
of course, one hopes one might discover unsuspected structure and structure which is
interesting and valuable in some sense. A good model can also be thought of as
‘generative’, in the sense that data randomly generated according to the model will have the
same characteristics as the real data from which the model was produced. If such randomly
generated data has features not possessed by the original data, or does not possess features
which the original data does (such as, for example, correlations between variables), then the
model is a poor one: it is failing in its job of adequately summarising the data.
This chapter is built on Chapter 5. There we described how one went about building a
model for data, how one decided whether a model was good or not, and the nature of
fundamental problems such as overfitting. We illustrated with some basic forms. In this
chapter we explore more complex model forms, of the kind needed to handle large
multivariate data sets. There are, in fact, many different types of model, each related to the
others in various ways (special cases, generalisations, different ways of looking at the same
structure, and so on). In a single chapter we cannot hope to examine all possible models
types in detail. What we have done is look at just some of the more important types.
One point is worth making at the start. Since we are concerned here with global models,
with structures which are representative of a mass of objects in some sense, then we do not
need to worry about failing to detect just a handful of objects possessing some property (we
are not concerned with patterns). This means that we can apply the methods to a (random)
sample from the data set and hope to obtain good results.
8a.2 Mixture models
In Chapter 5XXX, we described how to summarise univariate samples in terms of just a
handful of numbers (such as the mean and standard deviation), and also how to estimate
parameters describing the shapes of overall distributions, such as the Poisson and normal
distribution. Details of particular important distributions are given in Appendix XXX. In
Chapter 7XXX we illustrated the use of histograms and kernel smoothing methods to provide
simple graphical displays of univariate samples. Both of these ideas - simple summarising
distributions and smoothing methods - can be generalised to multivariate situations (indeed,
Appendix XXX includes an outline of the multivariate normal distribution). In particular,
smoothing methods have been widely used for predictive models based on multiple
predictors, and these are examined in Chapter 9XXX. Here, however, we look at models
which are intermediate between simple distributions and nonparametric smoothing methods.
In particular, we study models in which each distribution is assumed to be composed of
several component distributions, each relatively simple (so-called mixture distributions).
Distributions of the kind discussed in Chapter 5XXX and Appendix XXX are very useful, but
they do not solve all problems. Firstly, they need to be extended to the multivariate case,
and secondly, they may not be flexible enough to describe situations which occur in practice.
To illustrate the latter, consider Figure XXX2 in Chapter 7. This is a histogram of the
number of weeks owners of a particular credit card used that card to make supermarket
purchases in 1996. As we pointed out there, the histogram appears to be bimodal, with a
large and obvious mode to the left and a smaller, but nevertheless possibly important mode to
the right. An initial stab at a model for such data might be that it follows a Poisson
distribution (despite being bounded above by 52), but this would not have a sufficiently
heavy tail and would fail to pick up the right hand mode. Likewise, a binomial model would
also fail to follow the right hand mode. Something more sophisticated and flexible is
needed.
An obvious suggestion here is that the empirical distribution should be modelled by a
theoretical distribution which has two components. Perhaps there are two kinds of people,
those who are unlikely to use their credit card in a supermarket and those who do so most
weeks. The first set of people could be modelled by a Poisson distribution with a small
probability. The second set could be modelled by a reversed Poisson distribution with its
mode around 45 or 46 weeks (the position of the mode would be a parameter to be estimated
in fitting the model to the data). This leads us to an overall distribution of the form:
f  x  p 1x e 1 x!  1  p2 52  x  e 2 52  x!
(1)
Here p is the probability that a person belongs to the first group. Then, given that they do,
the expression  1x e 1 x! gives the probability that they will use their card x times in the
year. Likewise, (1-p) is the probability that they belong to the second group and
1  p2 52  x  e2 52  x! gives the probability that such a person will use their card x times
in the year.
Expression (1) is an example of a mixture distribution. The overall distribution consists of a
mixture of two Poisson components. Clearly it leads to a much more flexible model than a
simple single Poisson distribution - at the very least, it involves three parameters instead of
just one. However, by virtue of the argument which led to it, it may also be a more realistic
description of what is underlying the data. (In fact, as it happens even this model is not a
very good fit, and deeper exploration is required.) These two aspects - the extra flexibility of
the models consequent on the larger number of parameters and arguments based on suspicion
of a heterogeneous - mean that mixture models are widely used for modelling distributions
which are more complicated than simple standard forms. (Mixture models are also often
used to yield a flexible family of conjugate distributions in Bayesian analyses.)
The general form of a mixture distribution is
c
f  x    p k f k  x;  k 
k 1
where p k is the probability that an observation will come from the kth component (the
so-called kth mixing proportion), c is the number of components, f k  x;k  is the
distribution of the kth component, and k is the vector of parameters describing the kth
component (in the Poisson mixture example above, each k consisted of a single term, k ).
In most applications the f k  x have the same form, but there are situations where this is not
the case. The most widely used form of mixture distribution has normal components. Note
that the p k must lie between 0 and 1 and sum to 1.
Some examples of situations in which mixture distributions might be expected on theoretical
grounds are the length distribution of fish (since they hatch at a specific time of the year),
failure data (where there may be different causes of failure, and each cause results in a
distribution of failure times), time to death, and the distribution of characteristics of
heterogeneous populations of people.
Over the years, many methods have been applied in estimating the parameters of mixture
distributions. Nowadays the most popular seems to be the EM approach (Chapter XXX),
leading to maximum likelihood estimates. Display 8a.1XXX illustrates the application of
the EM algorithm in estimating the parameters of a normal mixture.
 DISPLAY 8a.1XXX
We wish to fit a normal mixture distribution
c
f  x    p k f k  x;  k ,  k 
k 1
k is the mean of the ith component and  k is the standard deviation of the kth component.
Suppose for the moment that we knew the values of the  k and the  k . Then, for a given value of x,
where
the probability that it arose from the kth class would be
P k | x  
pk f k  x; k ,  k 
f  x
From this, we could then estimate the values of the p k ,
k and  k as
(2)
1 n
p k   P k | xi 
n i 1
k 
1
np k
k 
1
np k
(3a)
n
 P k | x  x
i
i 1
(3b)
i
n
 P k | x  x
i
i 1
i
 k 
2
(3c)
where the summations are over the n points in the data set. This set of equations leads to an obvious
iterative procedure. We pick starting values for the  k and  k , plug them into (2) to yield estimates
P  k | x , use these estimates in (3a) to (3c), and then iterate back using the updated estimates of  k
and  k , cycling round until some convergence criterion has been satisfied.
Equations (3) are very similar to those involved in estimating parameters of a single normal distribution,
except that the contribution of each point are split across the separate components, in proportion to the
estimated size of that component at the point.
Of course, procedures such as the above will give results which depend on the chosen starting values.
Because of this it is worthwhile repeating the procedure from a number of different starting positions.

 DISPLAY 8a.2XXX
For a Poisson mixture
c
f  x    pk
k 1
kx e  
k
x!
,
the equations for the iterative estimation procedure analogous to Display 8a.2xxx take the form
P  k | xi  
p k P  k | xi 
f  x
1 n 
p k   P k | xi 
n i 1
1 n 
k 
 Pk | xi xi
np k i 1

Sometimes caution has to be exercised with maximum likelihood estimates of mixture
distributions. For example, in a normal mixture, if we put the mean of one component equal
to one of the sample points and let its standard deviation tend to zero then the likelihood will
increase without limit. The maximum likelihood solution in this case is likely to be of
limited value. There are various ways round this. The largest finite value of the likelihood
might be chosen to give the estimated parameter values. Alternatively, if the standard
deviations are constrained to be equal, the problem does not arise.
Another problem which can arise is due to lack of identifiability. A family of mixture
distributions is said to be identifiable if and only if the fact that two members of the family
are equal,
 p f  x;     p ' f  x;  '  ,
c
k 1
c'
k
k
j 1
j
j
implies that c = c’, and that for all k there is some j such that pk  p' j and k   ' j . If a
family is not identifiable, then two different members of it may be indistinguishable, which
can lead to problems in estimation.
Non-identifiability is more of a problem with discrete distributions than continuous ones
because, with m categories, only m-1 independent equations can be set up. Think of a
mixture of several Bernoulli components. The only observation here is the single proportion
of 1s, so how can we estimate the mixing proportions and the parameters of the separate
components?
Although mixture distributions are useful for analysing single variables, they are also useful
in multivariate situations. In Chapter 1XXX we briefly noted the difference between
mixture decomposition, segmentation, and cluster analysis. Here, however, we want to draw
attention to one important distinction which can characterise the difference between the
mixture models and cluster analysis. The aim of cluster analysis is to divide the data into
naturally occurring regions in which the points are closely or densely clustered, with
relatively sparse regions between them. From a probability density perspective, this will
correspond to regions of high density separated by valleys of low density, so that the
probability density function is fundamentally multimodal. However, mixture distributions,
even though they are composed of several components, may not be multimodal.
Consider the case of a two-component univariate normal mixture. Clearly, if the means are
equal, then this will be unimodal. More interestingly, a sufficient condition for the mixture
to be unimodal (for all values of the mixing proportions) when the means are different is
1  2  2 min1 ,  2  . Furthermore, for every choice of values of the means and standard
deviations in a two-component normal mixture there exist values of the mixing proportions
for which the mixture is unimodal.
 DISPLAY 8a.3XXX
The shape of the distribution of red blood cell volumes depends on the state of health of the individual.
People with chronic iron deficiency anaemia have a lognormal distribution of microcytic cells, while
healthy people have a lognormal distribution of normocytic cells. The distribution of red blood cell
volumes in the patients may still be unimodal even after iron therapy has started. Mixture
decomposition, using lognormal components, can detect the effect of the therapy early on.

In many situations where one wants to fit a mixture distribution one will be uncertain about
how many components are appropriate. After all, in data mining, one is hunting the novel, so
one might hope to through up interesting and unexpected insights about the population
structure. One way to choose the number of components would be to fit models with
different numbers of components and choose between them using likelihood ratio tests
(Chapter 5XXX): the size of the increase in likelihood between one model and another with
more components will indicate whether the extra component reflects a real aspect of the
underlying structure. Although fine in principle, such an approach is not valid because
conditions required for the likelihood ratio test are contravened (essentially arising from
identifiability problems). Various other proposals have been made (see Section
XXX’further reading’ below), but in data mining contexts perhaps more attention should be
paid to attempted interpretations of the components than significance tests (with large data
sets all the data are likely to deviate significantly from all models except for the most
complex).
8a.3 Cluster analysis
This section discusses techniques for decomposing or partitioning a (usually multivariate)
data set into groups so that those points in one group are similar to each other and are as
different as possible from the points in other groups. Although the same techniques may
often be applied, we should distinguish between two different objectives. In one, which we
might call segmentation or dissection, the aim is simply to partition the data in a way which is
convenient. ‘Convenient’ here might refer to administrative convenience, practical
convenience, or any other kind. For example, a manufacturer of shirts might want to choose
just a few sizes and shapes so as to maximise coverage of the male population. He will have
to choose those sizes in terms of collar size, chest size, arm length, and so on, so that no man
has a shape too different from that of a well-fitting shirt. To do this, he will partition the
population of men into a few groups in terms of the variables collar size, chest size, and arm
length and shirts of one size will be made for each group.
In contrast to this, one might want to see if a sample of data is composed of ‘natural’
subclasses. For example, whiskies can be characterised in terms of colour, nose, body,
palate, and finish and one might want to see if they fall into distinct classes in terms of these
variables. Here one is not partitioning the data for practical convenience, but rather is
hoping to discover something about the nature of the sample or the population from which it
arose - to discover if the overall population is, in fact, heterogeneous. Technically, this is
what cluster analysis seeks to do - to see if the data fall into distinct groups, with members
within each group being similar to other members in that group but different from members
of other groups. Having said that, the term ‘cluster analysis’ is often used in general to
describe both segmentation and cluster analysis problems (and we shall also be a little lax in
this regard). In each case the aim is to split the data into classes, so perhaps this is not too
serious a misuse. It is resolved, as we shall see below, by the fact that there is now a huge
number of methods for partitioning data in this way. The important thing is to match one’s
method with one’s objective. This way, mistakes will not arise, whatever one calls the
activity.
 DISPLAY 8a.4XXX
Owners of credit cards can be split into subgroups according to how they use their card - what kind of
purchases they make, how much money they spend, how often they use the card, where they use the card,
and so on. It can be very useful for marketing purposes to identify the group to which a card owner
belongs, since he or she can then be targeted with promotional material which might be of interest to
them (this clearly benefits the owner of the card, as well as the card company). Market segmentation in
general is, in fact, a heavy user of the kind of techniques discussed in this section. The segmentation
may be in terms of lifestyle, past purchasing behaviour, demographic characteristics, or other features.
A chain store might want to study whether outlets which are similar, in terms of social neighbourhood,
size, staff numbers, vicinity to other shops, and so on, have similar turnovers and yield similar profits.
A starting point here would be to partition the outlets, in terms of the above variables, and then to
examine the distributions of turnover within each group.
Cluster analysis has been heavily used in some areas of medicine, such as psychiatry, to try to identify
whether there are different subtypes of diseases lumped together under a single diagnosis.
Cluster analysis methods are used in biology to see if superficially identical plants or creatures in fact
belong to different species. Likewise, geographical locations can be split into subgroups on the basis of
the species of plants or animals which live there.
As an example of where the different between dissection and clustering analysis might matter, consider
partitioning the houses in a town. If we are organising a delivery service, we might want to split them in
terms of their geographical location. We would want to dissect the population of houses so that those
within each group are as close as possible to each other. Delivery vans could then be packed with
packages to go to just one group. On the other hand, if a company marketing DIY products might want
to split the houses into ‘naturally occurring’ groups of similar houses. One group might consist of small
starter homes, another of three and four bedroom family homes, and another (presumably smaller) of
executive mansions.

It will be obvious from the above that cluster analysis (which, for convenience, we are taking
as including dissection techniques) hinges on the notion of distance. In order to decide
whether a set of points can be split into subgroups, with members of a group being closer to
other members of their group than to members of other groups, we need to say what we mean
by ‘closer to’. The notion of ‘distance’, and different measures of it, have been discussed in
Chapter 2XXX (This will be transferred from Ch 7) Any of the measures described there,
or, indeed, any other distance measure, can be used as the basis for a cluster analysis. As far
as cluster analysis is concerned, the concept of distance is more fundamental than the
coordinates of the points. In principle, to carry out a cluster analysis all we need to know is
the set of interpoint distances, and not the values on any variables. However, some methods
make use of ‘central points’ of clusters, and so require the raw coordinates be available.
Cluster analysis has been the focus of a huge amount of research effort, going back for
several decades, so that the literature is now vast. It is also scattered. Considerable portions
of it exist in the statistical and machine learning literatures, but other publications may be
found elsewhere. One of the problems is that new methods are constantly being developed,
sometimes without an awareness of what has already been developed. More seriously, for
very few of the methods is a proper understanding of their properties and the way they
behave with different kinds of data available. This has been a problem for a long time. In
the late 1970s it was suggested that a moratorium should be declared on the development of
new methods while the properties of existing methods were studied. This did not happen.
One of the reasons is that it is difficult to tell if a cluster analysis has been successful. It is
very rare indeed that single application of a method (of exploratory data analysis or
modelling, in general, not merely cluster analysis) leads, by itself and with no other support,
to an enlightening discovery about the data. More typically, multiple analyses are needed,
looking at the data this way and that, while an interesting structure gradually comes to light.
Cluster analysis may contribute to the discovery of such structure, but one cannot typically
point to the application and say this is an example of successful application, out of the
context of the other ways the data have been examined. Bearing all this in mind, while we
encourage the use of newly developed methods, some caution should be exercised, and the
results examined with care, rather than being taken at face value.
As we shall see below, different methods of cluster analysis are effective at detecting
different kinds of cluster, and one should consider this when choosing a method. That is,
one should consider what it is one means by a ‘cluster’. To illustrate, one might take a
‘cluster’ as being a collection of points such that the maximum distance between all pairs of
points in the cluster is as small as possible. Then each point will be similar to each other
point in the cluster. An algorithm will be chosen which seeks to partition the data so as to
minimise this maximum interpoint distance (more on this below). One would clearly expect
such a method to produce compact, roughly spherical, clusters. On the other hand, one
might take a ‘cluster’ as being a collection of points such that each point is as close as
possible to some other member of the cluster - although not necessarily to all other members.
Clusters discovered by this approach need not be compact or roughly spherical, but could
have long (and not necessarily straight) sausage shapes. The first approach would simply
fail to pick up such clusters. The first approach would be appropriate in a segmentation
situation, while the second would be appropriate if the objects within each hypothesised
group could have been measured at different stages of some evolutionary process. For
example, in a cluster analysis of people suffering from some illness, to see if there were
different subtypes, one might want to allow for the possibility that the patients had been
measured at different stages of the disease, so that they had different symptom patterns even
though they belonged to the same subtype.
The important lesson to be learnt from this is that one must match the method to the
objectives. In particular, one must adopt a cluster analytic tool which is effective at
detecting clusters which conform to the definition of what one wants to mean by cluster in the
problem at hand. Having said that, it is perhaps worth adding that one should not be too
rigid about it. Data mining, after all, is about discovering the unexpected, so one must not be
too determined in imposing one’s preconceptions on the analysis. Perhaps a search for a
different kind of cluster structure will throw up things one had not previously thought of.
Broadly speaking, we can identify two different kinds of cluster analysis method: those based
on an attempt to find the optimal partition into a specified number of clusters, and those
based on a hierarchical attempt to discover cluster structure. We discuss each of these in
turn in the next two subsections.
8a.3.1 Optimisation methods
In Chapter 1XXX we described how data mining exercises were often conveniently thought of
in four parts: the task, the tool, the criterion function, and the search method. When the task
is to partition a data set, one approach is to define a criterion function measuring the quality of
a given partition and then search the space of possible partitions to find the optimal, or at least
a good, partition. A large number of different criteria have been proposed, and a wide range
of algorithms adopted.
Many criteria are based on standard statistical notions of between and within cluster variation.
The aim, in some form or another, is to find that partition which minimises within cluster
variation, while maximising between cluster variation. For example, we can define the
within cluster variation as
c
c
W   Wk  
k 1
 x  x x  x '
k 1 xX k
k
k
where c is the number of clusters, X k is the set of points in the kth cluster, and x k is the
vector of means of the points in the kth cluster. This matrix summarises the deviation of the
points from the mean of the clusters they are each in.
Likewise, we can define the between cluster variation as
c
B   n k  x k  x  x k  x '
k 1
where n k is the number of points in the kth cluster, and x is the vector of overall means of
all the points. This matrix summarises the sum of squared differences between the cluster
centres.
Traditional criteria based on W and B are the trace of W, tr(W), the determinant of W, W ,
and tr(BW-1). A disadvantage of tr(W) is that it depends on the scaling adopted for the
separate variables. Alter the units of one of them and a different cluster structure may result.
(See the example in Section 2.XXX (section on distances) where we compared the relative
distances between three objects measured in terms of weight and length which resulted when
different units were used.) Of course, this can be overcome by standardising the variables
prior to analysis, but this is often just as arbitrary as any other choice. This criterion tends to
yield compact spherical clusters. It also has a tendency to produce roughly equal groups.
Both of these properties may make this criterion useful in a segmentation context, but they
are less attractive for discovering natural clusters (where, for example, discovery of a distinct
very small cluster may represent a major advance).
The W criterion does not have the same scale dependence as tr(W), so that it also detects
elliptic structures as clusters, but does also favour equal sized clusters. Adjustments which
take cluster size into account have been suggested (for example, dividing by  nk2nk ), so
that the equal sized cluster tendency is counteracted, but it might be better to go for a
different criterion altogether than adjust an imperfect one. Note also that the original
criterion, W , has optimality properties if the data are thought to arise from a mixture of
multivariate normal distributions, and this is sacrificed by the modification. (Of course, if
one’s data are thought to be generated in that way, one might contemplate fitting a formal
mixture model, as outlined in Section 8a.2.2XXX.)
Finally, the tr(BW-1) criterion also has a tendency to yield equal sized clusters, and this time
of roughly equal shape. Note that since this criterion is equivalent to summing the
eigenvalues of BW-1 it will place most emphasis on the largest eigenvalue and hence have
tendency to yield collinear clusters.
The property that the clusters obtained from using the above criteria tend to have similar
shape is not attractive in all situations (indeed, it is probably a rare situation in which it is
attractive). Criteria based on other ways of combining the separate within cluster matrices
W k can relax this, for example,
W
nk
k
and
W
1p
k
, where p is the number of
variables. Even these criteria, however, have a tendency to favour similar sized clusters.
(A modification to the
W
nk
k
criterion, analogous to that to the W criterion, which
can help to overcome this property, is to divide each Wk by
to letting the distance vary between different clusters.)
n
2 nk
k
. This is equivalent
A variant of the above methods uses the sum of squared distances not from the cluster means,
but from particular members of the cluster. The search (see below) then includes a search
over cluster members to find that which minimises the criterion. In general, of course,
measures other than the sum of squared distances from the cluster ‘centre’ can be used. In
particular, the influence of the outlying points of a cluster can be reduced by replacing the
sum of squared distances by the simple distances. The L1 norm has also been proposed as a
measure of distance. Typically this will be used with the vector of medians as the cluster
‘centre’.
Methods based on minimising a within class matrix of sums of squares can be regarded as
minimising deviations from the centroids of the groups. Maximal predictive classification,
developed for use with binary variables in taxonomy but able to be applied more widely, can
also be regarded as minimising deviations from group ‘centres’, though with a different
definition of centres. Suppose that each object has given rise to a binary vector, such as
(0011…1), and suppose we have a proposed grouping into clusters. Then, for each group we
can define a binary vector which consists of the most common value, within the group, of
each variable. This vector of modes (instead of means) will serve as the ‘centre’ of the
group. Distance of a group member from this centre is then measured in terms of how many
of the variables have values which differ from those in this central vector. The criterion
optimised is then the total number of differences between the objects and the centres of the
groups they belong to. The ‘best’ grouping is that which minimises the overall number of
such differences.
Hierarchical methods of cluster analysis, described in the next section, do not construct a
single partition of the data, but rather construct a hierarchy of (typically) nested clusters.
One can then decide where to cut the hierarchy so as to partition the data in such a way as to
obtain the most convincing partition. For optimisation methods, however, it is necessary to
decide at the start how many clusters one wants. Of course, one can rerun the analysis
several times, with different numbers of clusters, but this still requires one to be able to
choose between competing numbers. There is no ‘best’ solution to this problem. One can,
of course, examine how the clustering criterion changes as one increases the number of
clusters, but this may not be comparable across different numbers (for example, perhaps the
criterion shows apparent improvement as the number increases, regardless of whether there is
really a better cluster structure - in just the same way as the sum of squared deviations
between a model and data decreases as the number of parameters increases, discussed in
Section 5.XXX). For a multivariate uniform distribution divided optimally into c clusters,
the criterion c2 W asymptotically takes the same value for all c, so this could be used to
compare partitions into different numbers.
It will be apparent from the above that cluster analysis is very much a data-driven tool, with
relatively little formal model-building underlying it. Some researchers have attempted to put
it on a sounder model-based footing. For example, one can supplement the procedures by
assuming that there is also a random process generating sparsely distributed points uniformly
across the whole space. This makes the methods less susceptible to outliers.
So much for the criteria which one may adopt. Now what about the algorithms to optimise
those criteria? In principle, at least, the problem is straightforward. One simply searches
through the space of possible assignments of points to clusters to find that which minimises
the criterion (or maximises it, depending on the chosen criterion). A little calculation shows,
however, that this is infeasible except for the smallest of problems. (The number of possible
 c
1 c
  1 i    c  i  n , so that, for example, there are
allocations of n objects into c classes is

c! i  0
 i
30
some 10 possible allocations of 100 objects into 2 classes.) Since, by definition, data
mining problems are not small, such crude exhaustive search methods are not applicable.
For some clustering criteria methods have been developed which permit exhaustive coverage
of all possible clusterings without actually carrying out an exhaustive search. These include
branch and bound methods, which eliminate potential clusterings on the grounds that they
have worse criterion values than alternatives already found, without actually evaluating the
criterion values for the potential clusterings. Such methods, while extending the range over
which exhaustive evaluation can be made, still break down for large data sets. For this
reason, we do not examine them further here.
If exhaustive search is infeasible, one must resort to methods which restrict the search in
some way. Iterative and sequential algorithms are particularly popular for cluster analysis,
and they often make use of a stochastic component.
For example, the k-means algorithm is based on the tr(W) criterion above, which gives the
sum of squared deviations between the sample points and their respective cluster centres.
There are several variants of the k-means algorithm. Essentially, it begins by picking cluster
centres, assigns the points to clusters according to which is the closest cluster centre,
computes the mean vectors of the points assigned to each cluster, and uses these as new
centres in an iterative approach. A variation of this is to examine each point in turn and
update the cluster centres whenever a point is reassigned, repeatedly cycling through the
points until the solution does not change. If the data set is very large, one can simply add in
each data point, without the recycling. Further extensions (e.g. the ISODATA algorithm)
include splitting and/or merging clusters.
Since many of the algorithms hinge around individual steps in which single points are added
to a cluster, updating formulae have often been developed. In particular, such formulae have
been developed for all of the criteria involving W above.
Although each step of such algorithms leads to an improvement in the clustering criterion, the
search is still restricted to only a part of the space of possible partitions. It is possible that a
good cluster solution will be missed. One way to alleviate (if not solve) this problem is to
carry out multiple searches from different randomly chosen starting points for the cluster
centres. One can even take this further and adopt a simulated annealing strategy, though
these are slow at the best of times and will be infeasible if the data set is large.
Since cluster analysis is essentially a problem of searching over a huge space of potential
solutions to find that which optimises some objective function, it will come as no surprise to
learn that various kinds of mathematical programming methods have been applied. These
include linear programming, dynamic programming, and linear and non-linear integer
programming.
8a.3.2 Hierarchical methods
Whereas optimisation methods of cluster analysis begin with a specified number of clusters
and search through possible allocations of points to clusters to find an allocation which
optimises some clustering criterion, hierarchical methods gradually merge points or divide
superclusters. In fact, on this basis we can identify two distinct types of hierarchical method:
the agglomerative (which merge) and the divisive (which divide). We shall deal with each in
turn. The agglomerative are the more important and widely used of the two. Note that
hierarchical methods can be viewed as a particular (and particularly straightforward) way to
reduce the size of the search. They are analogous to stepwise methods used for model
building in other parts of this book. (maybe refs to other sections hereXXX)
Hierarchical methods of cluster analysis permit a convenient graphical display, in which the
entire sequence of merging (or splitting) or clusters is shown. Because of its tree-like nature,
such a display is called a dendrogram. We illustrate below.
 DISPLAY 8a.5XXX
Cluster analysis is of most use when there are more than two variables, of course. If there are only two,
then one can eyeball a scatterplot, and look for structure. However, to illustrate the ideas on a data set
where we can see what is going on, we apply a hierarchical method to some two dimensional data. The
data are extracted from a larger data set given in Azzalini and Bowman (1990). Figure XXX1 shows a
scatterplot of the data. The vertical axis is the time between eruptions and the horizontal axis is the
length of the following eruption, both measured in minutes. The points are given numbers in this plot
merely so that we can relate them to the dendrogram in this exposition, and have no other substantive
significance.
Figure XXX2 shows the dendrogram which results from merging the two clusters which leads to the
smallest increase in within cluster sum of squares. The height of the crossbars in the dendrogram (where
branches merge) shows value of this criterion. Thus, initially, the smallest increase is obtained by
merging points 11 and 33, and from Figure XXX1 we can see that these are indeed very close (in fact,
the closest). The next merger comes from merging points 2 and 32. After a few more mergers of
individual pairs of neighbouring points, point 31 is merged with the cluster consisting of the two points 8
and 15, this being the merger which leads to least increase in the clustering criterion. This procedure
continues until the final merger, which is of two large clusters of points. This structure is evident from
the dendrogram. (It need not always be like this. Sometimes the final merger is of a large cluster with
one single outlying point - as we shall see below.) The hierarchical structure displayed in the
dendrogram also makes it clear that one could terminate the process at other points. This would be
equivalent to making a horizontal cut through the dendrogram at some other level, and would yield a
different number of clusters.

Figure XXX1: Time between eruptions versus duration of following eruption for Old
Faithful geyser.
[Data in file SDF17]
7
5
16
9
513
12 28
30
18
1022
2
3
34 26
1432
19
20
21
24
Duration
4
3
17
25
2
27
35
29 6
31 1
4
15
33
11
8
50
60
23
1
70
Wait
80
90
25
1
4
11
33
31
8
15
6
27
29
35
2
32
28
34
16
14
21
5
12
13
7
9
26
30
3
19
22
18
24
10
20
17
23
0
10
20
30
Figure XXX2: Dendrogram for a cluster analysis applied to the data in Figure XXX1.
8a.3.
2.1 Agglomerative methods
Agglomerative methods are based on measures of distance between clusters. Essentially,
given an initial clustering, they merge those two clusters which are nearest, to form a reduced
number of clusters. This is repeated, each time merging the two closest clusters, until just
one cluster, of all the data points, exists. Usually the starting point for the process is the
initial clustering in which each cluster consists of a single data point, so that the procedure
begins with the n points to be clustered.
Analogously to optimisation methods, the key to agglomerative methods will be seen to have
two parts: the measure of distance between clusters, and the search to find the two closest
clusters. The latter part is less of a problem here than was the search in optimisation
methods, since there are at most n(n-1)/2 pairs to examine.
Many measures of distance between clusters have been proposed. All of the criteria
described in Section XXX8a.2 can be used, using the difference between the criterion value
before merger and that after merging two clusters. However, other distance measures, are
especially suited to hierarchical methods. One of the earliest and most important of these is
the nearest neighbour or single link method. This defines the distance between two clusters
as the distance between the two closest points, one from each cluster. The single link
method is susceptible (which may be a good or bad thing, depending upon one’s objectives)
to the phenomenon of ‘chaining’, in which long strings of points are assigned to the same
cluster (contrast the production of compact spherical clusters discussed in Section
XXX8a3.1). This means that the single link method is of limited value for segmentation. It
also means that the method is sensitive to small perturbations of the data and to outlying
points (which, again, may be good or bad, depending upon what one is trying to do). The
single link method also has the property (for which it is unique - no other measure of distance
between clusters possesses it) that if two pairs of clusters are equidistant it does not matter
which is merged first. The overall result will be the same, regardless of the order of merger.
 DISPLAY 8a.6XXX
The dendrogram from the single link method applied to the data in Figure XXX1 is shown in Figure
XXX3. Note that although the initial merges are the same as those in Figure XXX2, the two methods
soon start to differ, and the final high level structure is quite different. Note in particular, that the final
merge of the single link method is to combine a single point (number 30) with the cluster consisting of
all of the other points.

Figure XXX3: Dendrogram of the single link method applied to the data in Figure XXX1.
30
20
10
30
25
1
4
31
8
15
11
33
6
27
29
35
17
23
21
16
2
32
28
34
14
5
12
13
26
7
9
3
19
22
18
24
10
20
0
At the other extreme from single link, furthest neighbour or complete link, takes as the
distance between two clusters the distance between the two most distant points, one from
each cluster. This imposes a tendency for the groups to be of equal size in terms of the
volume of space occupied (and not in terms of numbers of points), so making this measure
particularly appropriate for segmentation problems.
Other important measures, intermediate between single link and complete link, include the
centroid measure (the distance between two clusters is the distance between their centroids),
the group average measure (the distance between two clusters is the average of all the
distances between pairs of points, one from each cluster), and Ward’s measure (the distance
between two clusters is the difference between the total within cluster sum of squares for the
two clusters separately, and the within cluster sum of squares resulting from merging the two
clusters - see the tr(W) criterion discussed in Section XXX8a3.1). Each such measure has
slightly different properties.
Other variants also exist - the median measure ignores the size of clusters, taking the ‘centre’
of a combination of two clusters to be the mid-point of the line joining the centres of the two
components.
Since one is seeking the novel in data mining, it may well be worthwhile experimenting with
several measures, in case one throws up something unusual and interesting.
8a.3.2.2 Divisive methods
Just as stepwise methods of variable selection can start with none and gradually add variables
according to which lead to most improvement (analogous to agglomerative cluster analysis
methods) so they can also start with all the variables and gradually remove those whose
removal leads to least deterioration in the model.
This second approach is analogous to
divisive methods of cluster analysis. Divisive methods begin with a single cluster composed
of all of the data points, and seek to split this into components. These further components
are then split, and the process is taken as far as necessary. Ultimately, of course, it will end
with a partition in which each cluster consists of a single point.
Monothetic divisive methods split clusters using one variable at a time (so they are analogous
to the basic form of tree method of supervised classification discussed in Chapter XXX).
This is a convenient (though restrictive) way to limit the number of possible partitions which
must be examined. It has the attraction that the result is easily described by the dendrogram
- the split at each node is defined in terms of just a single variable. The term association
analysis is sometimes uses to describe monthetic divisive procedures applied to multivariate
binary data. This is not the same as the ‘association analysis’ described in Chapter XXX.
Polythetic divisive methods make splits on the basis of all of the variables together. Any
inter-cluster distance measure can be used. The difficulty comes in deciding how to choose
potential allocations to clusters - that is, how to restrict the search through the space of
possible partitions. In one method, objects are examined one at a time, and that one selected
for transfer from a main cluster to a subcluster which leads to greatest improvement in the
clustering criterion.
Divisive methods are more computationally intensive and seem to be less widely used than
agglomerative methods.
8a.3.3 Fuzzy clusters and overlapping clusters
By definition, a partitioning of a set of points divides them into subsets such that each point
belongs to only one subset. Sometimes it can be advantageous to relax this, and permit
objects to belong to more than one group. This is clearly closely related to the ideas of
mixture models, in which a given position in the space spanned by the variables can be given
probabilities of belonging to each of the components. (Proponents of fuzzy set theory avoid
the probability interpretation and adopt axioms which differ slightly from those of
probability.) One such method generalises the sum of squared distances criterion
c
tr  W   tr 
c
 x  x x  x '      x
k
k 1 xX k
k
k 1 xX k
r
 x kr 
2
r
(where r refers to summation over the variables) to
c
  u x
k 1 xX k
kx
r
 x kr 
2
r
where u kx is the membership function for object x in the kth cluster. (The values of u kx are
all positive and the sum over the groups, for each point, is 1.)
While fuzzy clustering procedures decide to what extent each point belongs to several
clusters, another model is to let each point (possibly) belong to more than one cluster. One
method for this begins with a similarity matrix S and seeks to approximate it by the form
DUD’, where U is a diagonal matrix of weights and D is an n by c indicator matrix in which
the columns correspond to clusters and the rows to points. Dij takes the value 1 whenever
point i is in cluster j, and 0 otherwise. That clustering which leads to the best approximation
of S is the solution, where the approximation is in terms of the sum of squared devations of
DUD’ from S.
8a.4 Modelling multivariate distributions through graph structures
In many data mining problems, our aim is to identify and model relationships between
variables. Graphical representations have obvious attractions for this purpose. The word
‘graphical’ here is intended in the mathematical sense: the models can be represented as
graphs in which the nodes represent variables and the edges represent relationships between
variables. Several different types of graphical representation have been developed, and we
shall explore the more important ones in this section. They are characterised by the nature of
the relationships that the edges of the graph represent. We begin, in the next section, with
the earliest formal model of this type, that of path analysis. Path analysis decomposes
relationships between variables which have been explicitly measured. In Section XXX we
generalise the ideas, into other covariance structure models. More recently, there has been
tremendous interest in another, related class of models, based not on the structure of the
relationships between variables, but on decomposing the overall multivariate distribution of
the variables. These models go under various names, including Bayesian belief networks
and conditional independence models. These are described in Section XXX.
8a.4.1 Path analysis and causal models
A crude summary (exact for ellipsoidal distributions) of the relationships, in a population or
sample, between multiple variables is given by the covariance matrix. With p variables, this
is a p by p matrix in which the (i,j)th off-diagonal element contains the covariance between
the ith and jth variables and the ith diagonal element contains the variance of the ith variable.
The off-diagonal elements thus provide information about the two-dimensional marginal
distributions of the variables: they tell us if larger values of one variable are associated with
larger values of another, and so on. Such a matrix provides a quantitative summary of the
information seen graphically in a scatterplot matrix (see Chapter 7XXX).
Two dimensional marginal relationships are all very well, but they can be deceptive. It is
entirely possible that such a relationship is spurious, in the sense that both variables are
related to a third, so that the observed marginal relationship is a consequence of their joint
relationship to this third. Perhaps, if the third variable is fixed at any particular value, the
relationship between the two variables will vanish. What is needed is some way to tease
apart the relationships between variables. What we need then, is some way to find
convenient, and realistic, summaries of the covariances between variables. Can these
covariances be explained by a causal structure relating the variables, on which we can base
improved understanding and prediction? Is there, in fact, just one single underlying variable
(perhaps explicitly measured, perhaps not) which explains the non-zero covariances between
a hundred other variables? And so on. Moreover, as we discussed in Chapter 5XXX, there
are advantages to replacing a model with a large number of parameters by one with few (for
example, replacing the 5050 parameters in an unconstrained 100 by 100 covariance matrix by
one with just 100 parameters will lead to much less variability in the estimates.)
Path analysis, having its origins in the 1920s with work in genetics by Sewell Wright (1921,
1923, 1934), was the earliest formal model of this kind. It is a way of decomposing linear
relationships between measured variables, in such a way that strengths can be attributed to
causal paths between them. We should begin by saying what we mean by ‘causal’. For us,
in this section, x is a cause of y if and only if the value of y can be changed by manipulating x
and x alone. Subtleties arise because the effect of x on y may be mediated by some other,
intermediate, variable z. That is, x may influence z which in turn influences y. Or, even
more complicated, x and z may both influence y and perhaps their effects cancel out so that
there appears to be no marginal effect of x on y. Clearly some care is needed in teasing apart
the relationships.
The basic form of path analysis is based on two fundamental assumptions:
(a) that the variables have a given weak causal order. That is, we can order the
variables so that earlier ones may (but need not) influence later ones, but later ones
cannot influence earlier ones. The direction of cause is denoted by an arrow in a
path model. This order is determined from outside the model, and is not derived
from the analysis. The existence of this order means that the graphs of a path
analysis model are directed.
(b) the set of relationships between the observed variables is causally closed. This
means that a non-zero covariance between two variables x and y can be explained
either by the direct effect of one of them on the other, or by indirect effects due to
other variables in the model.
We shall say more about these two requirements below.
The marginal effect of some variable x on another variable y can be determined from the
simple regression equation y = bx + e. Here e is the error term and b is the regression
coefficient of x on y. b tells us the expected difference between y values for two
observations which happen to differ in one unit on x. Note, however, that this does not
necessarily mean that changing the value of x by one unit will cause a change of b units in y
for a particular object. The regression relationship may or may not be a causal relationship.
It might simply reflect the effect of a third variable: x and y happen to be correlated because
of the way the sample was drawn. (For example, let x be the size of vocabulary and y the
complexity of arithmetic problems that children in a sample covering various ages can solve.
Then we will probably find a positive value for b. However, it would be unwise to claim
that increasing a child’s vocabulary would enhance their arithmetic ability. It is more likely
that both vocabulary and arithmetic skill are causally related to number of years of
schooling.) In general, we may find that some component of the relationship between two
variables x and y is attributable to the direct causal effect of one on the other, while the
remainder is attributable indirectly to other variables. Path analysis separates these
components.
Just as, in regression, one can work either with standardised or with unstandardised variables,
so one can in path analysis. Conventionally, in path analysis, coefficients based on
unstandardised variables are called effect coefficients, while those based on standardised
variables are called path coefficients. The difference is not important for us, though, of
course, it can be important when interpreting the results.
The basic structures which arise in path analysis are shown in Figure XXX4, where the
arrows indicate the direction of causation. Figure XXX4(a) shows a situation in which
variable x may be a cause of y, which may be a cause of z. From the model we see that x
may have an indirect effect on z. We also see that none of the reverse relationships hold - y
cannot be a cause of x, and so on - as a result of the specified weak causal order. Note that,
while y has a direct effect on z, x only has an indirect effect: the effect of x on z is via (and
only via) y. There is no arrow going directly from x to z.
In Figure XXX4(b), variable x is a potential cause of y and also a potential cause of z. That
is, if we change the value of x then the value of y and the value of z may change. The
reverse does not hold - changing the value of y or z does not influence x. This model is
interesting because there is no direct link between y and z, but there may be a nonzero
marginal covariance between these two variables. (The vocabulary and arithmetic skill
example above illustrates, this, with a relabelling of the variables.) The relationship between
y and z is, for obvious reasons, termed noncausal. This sort of relationship is responsible for
many misinterpretations in data mining, arising when variable x is not explicitly measured.
If objects with a range of x values are measured, one will observe a covariance between y and
z which cannot be explained (since x is not observed). One might, therefore, be tempted to
deduce the existence of a causal relationship when in fact it is entirely spurious. (Of course,
sometimes it is the marginal relationship between y and z which is of interest. In this case
one clearly would not want to control for x, even if one could.)
In Figure XXX4(c) variables x and y each have a separate causal influence on z. Note that
the lack of an edge linking x and y means that they are conditionally independent, given z.
However, if z is uncontrolled, then there may be an induced relationship between x and y: in a
random sample of objects, there may be a nonzero covariance between x and y. Yet again
the path and regression coefficients are identical.
Finally, in Figure XXX4(d), x has a direct effect on y, both a direct and an indirect effect on
z, and y has a direct effect on z. This is the most interesting of the four structures in Figure
XXX4. The regression model y ~ x yields the path coefficient of x on y. The regression
model z ~ x + y yields the path coefficients of x and y on z. The coefficient from this model
tell us the direct effect of x on z. The indirect effect arises from the path via y and is the
product of the effect of x on y and y on z.
Figure XXX4: Some basic causal structures involving three variables.
(a)
x
y
(b)
y
x
z
(c)
x
y
z
(d)
y
x
z
Figure XXX5: An example of direct and indirect causal relationships. The sizes of the path
coefficients are shown.
y
0.2
0.5
x
z
0.6
 DISPLAY 8a.7 XXX
The numbers in Figure XXX5 were derived from the following regression models
y = 0.2x + e1
z = 0.5y + 0.6x + e2
z = 0.7x + e3
The total effect of x on z is 0.7. That is, unit change in x induces a total change of 0.7 units in z. On
the other hand, the direct effect of x on z is only 0.6. This is the effect not mediated by other variables.
The difference between the direct and total effect of x on z is the indirect effect, via y. The size of this is
0.2 (the direct effect of x on y) times 0.5 (the direct effect of y on z), that is 0.1. Thus the total effect of x
on z is 0.7 = 0.5×0.2 + 0.6.
The path coefficient between y and z is 0.5. This shows the direct causal effect of y on z. However, this
is not what would be obtained from a simple regression of z on y. The covariance between these two
variables has an additional component due to the fact that both are related to x. If we fit a simple linear
regression, predicting z from x, we obtain x = 0.62y + e4 . The coefficient here is composed of the direct
effect of y on z plus the spurious effect induced by x, of magnitude 0.2×0.6 = 0.12.

 DISPLAY 8a.8 XXX
Figure XXX6 shows a more complicated situation with four variables. Here w is a direct cause of x, y,
and z; x is a direct cause of y and z; and y is a direct cause of z. Or, to put it in a way which makes
clearer the recursive regression relationships: x is caused by w; y is caused by x and w; and z is caused by
w, x, and y. The various direct and indirect causal paths are decomposed in Table XXX1. In this table
the notation r(x,y) stands for the total covariation between x and y, p(x,y) stands for the path coefficient
between two variables (that part of the total covariation which is attributable to the direct causal effect of
x on y), and b(x,y|z) is the regression coefficient of a model predicting y from x, controlling for w.
Figure XXX6: A more complicated causal model involving four variables.
x
z
w
y
____________________________________________________________________________
Table XXX1: Decomposition of causal paths in Figure XXX6, showing the direct and indirect contributions
to the total covariation between each pair of variables.
____________________________________________________________________________
Relation
between
Total
covariation
Direct
Causal
Indirect
Total
wx
r(w,x)
r(w,x)
0
r(w,x)
0
wy
r(w,y)
p(w,y)
p(x,y)p(w,x)
r(w,y)
0
wz
r(w,z)
p(w,z)
xy
r(x,y)-p(x,y)
r(x,y)
p(x,y)
p(y,z)p(x,y)p(w,x)
+ p(y,z)p(w,y)
+ p(x,z)p(w,x)
0
Noncausal
r(w,z)
p(x,y)
=p(w,y)p(w,x)
xz
r(x,z)
p(x,z)
p(y,z)p(x,y)
yz
r(y,z)
p(y,z)
0
b(x,z|w)
r(x,z)-b(x,z|w)
p(y,z)
r(y,z)-p(y,z)
=b(y,z|w,x)
____________________________________________________________________________

The models we have discussed above are very simple. They require that assumptions (a) and
(b) be satisfied, and when this is the case, they decompose the overall covariation between
two variables into direct causal influences, indirect causal influences, and noncausal
relationships in which both variables are jointly influenced by other variables. When the
assumptions are justified, there is no ambiguity or uncertainty about the model or its
interpretation. However, assumptions (a) and (b) are often difficult to justify. As far as (a)
is concerned, if x precedes y in time, then the direction of any possible causal relationship is
clear, but when this is not the case it is common to find that one cannot unambiguously
decide whether a change in one variable is causing a change in another or vice versa.
Although path analysis has been extended to these more general situations this is not always
without difficulty. The clear interpretation of the model is sometimes lost.
The relaxation of assumption (b), by seeking to explain relationships between observed
variables in terms of other, unmeasured (or latent) variables, has led to a broad class of
models called linear structural relational models or covariance structure models. We
discuss these in the next section. Such models are not without their problems (of fitting, of
interpretation, and so on), but then what models are? They have been described as ‘the most
important and influential statistical revolution to have occurred in the social sciences’ and,
handled with care and applied appropriately, can be immensely useful and revealing.
When the graph is complete, so that there is an edge connecting each pair of variables, then
each variable is potentially influenced by all those that precede it in the order given by
assumption (a). Estimation of the path coefficients is then straightforward: one simply
regresses each variable on those preceding it - obtaining the kind of decomposition illustrated
in Display XXX8a.8. With p variables, this means that p-1 multiple regressions provide
enough information to compute all the path coefficients. (For this reason, such models are
called univariate recursive regression models.)
Things are a little more subtle, however, if one want to impose prior restrictions on some of
the paths; if one knows, for example, that some possible pairs of variables are not linked.
(Corresponding to a missing edge in the graph.) Then the system involves more equations
than there are unknowns and becomes overidentified. The simplest illustration of this arises
in Figure XXX4(a), where x may cause y, which, in turn, may cause z, but where it is known
that there is no direct link from x to z. Suppose the sample covariation between x and y is c1,
that between y and z is c2, and that between x and z is c3. The equations from which we
must estimate the path coefficients p(a,b) are then p(x,y) = c1, p(y,z)=c2, and p(x,y)p(y,z)=c3.
It seems that this could lead to a contradiction: p(x,y)=c1 and p(x,y)=c3/c2. Various ways of
tackling this are used. One can simply estimate the parameters using the regressions of
variables for which there are explicit links in the model. More generally, a measure of the
discrepancy between the sample and theoretical covariance matrices is minimised - for
example, a weighted sum of squares of the differences between the elements of the two
matrices.
8a.4.2 Structural equation models
Broadly speaking, structural equation models seek to explain the relationships between
observed variables in terms of unobserved or latent variables, which may be explanatory and
responses. The simplest model of this kind is a factor analysis model, which explains the
covariances between observed variables in terms of their linear relationship to one or more
unobserved ‘factors’. That is, the factor analysis model has the form
x = f + u
(1)
Here x is the vector of p measurements on an object, f is a vector of length k containing the
(unobserved) scores that this object has on the k latent vectors,  is a matrix of coefficients or
‘factor loadings’ linking x and f and u is a random vector (typically taken to be multivariate
normal with zero expectation and diagonal covariance matrix . The diagonality means that
the variables are conditionally independent given the factors.). From this model one can
derive the theoretical form of the covariance matrix of the observed scores to be
=’ + 
(2)
Nowadays, maximum likelihood methods are probably the most popular ways of estimating
the factor loadings (typically through iteratively reweighted least squares methods of
minimising the discrepancy between observed and theoretical covariance matrices), though
software packages often provide alternatives.
Factor analysis has not had an altogether untroubled history. There are three main reasons
for this. One is that, as mentioned in Chapter 7XXX, the solutions are not unique. This is
easily seen since the matrix M, with M orthogonal, (a rotation of the original solution, in
the space spanned by the factors) also satisfies (2), =M(M)’ + . In fact, this can be an
advantage, rather than a disadvantage, since one can choose the solution (by varying M) to
satisfy some additional criterion. In particular, one can choose it so that the factor solutions
facilitate interpretation. For example, one can try to arrange things so that each factor has
either large or small weights on each of the variables (and few intermediate weights). Then
a factor is clearly interpretable in terms of the variables it contributes substantially to. This
notion of ‘interpretability’ is the second reason why the technique has, in the past, had its
critics. There is sometimes a tendency to overinterpret the results, to assign descriptive
names to the factors, which then take on an unjustified authority of their own. Finally, the
third reason arises from the fact that, in the early days, factor analysis was an essentially
exploratory tool (this is no longer the case - see below). It was often used as an attempt to
see if observed relationships could be explained by latent factors, without their being
theoretical reasons for this. As we have been at pains to point out in this book, this is fine,
provided one does not assert too much about the reality of unearthed relationships: they need
to be checked and verified to ensure that they are not chance phenomena associated with the
particular data set to hand. Having said all of the above, factor analysis is now well
understood and is a respectable data analytic technique. It is widely used in areas ranging
from psychology and sociology through botany, zoology, ecology, to geology and the earth
sciences.
The factor analysis model (1) can be represent graphically as in Figure XXX7. The common
factor, at the bottom of the graph, each contribute to each of the manifest variables.
Additional, specific components (the ui ) also contribute to the manifest variables.
Figure XXX7: A graphical representation of a factor analysis model.
U1
1
U2
Up
X1
X2
Xp
F1
Fk
This representation immediately shows us how such models may be generalised. Note, in
particular, that there are no links between the factors. These can easily be added to the
graph, producing a different model, the parameters of which can again be estimated. (Again
we do not go into details here - estimation will be handled by the software package.)
(PADHRAIC - you may have described iteratively reweighted least squares methods for
ML estimation in your chapters 6a, 6b) Adding links between factors means that the
factors are correlated.
The basic factor analysis model is straightforward enough, but as additional complexities are
introduced, so additional problems can arise. In particular, models can become
non-identified. The issue is the same as that with mixture distributions discussed above: two
different models, parameterised in different ways, lead to empirically observable structures
which are indistinguishable. In this case one cannot choose between the models and,
perhaps even worse, there may be estimation problems.
The basic factor model in (1) permitted all factors to load on all variables. There were no
restrictions imposed. This is entirely appropriate in an exploratory context, where one is
seeking to discover the structure in the data. In other situations, however, one has some idea
of what structure may exist. For example, one might believe, a priori, that there are two
factors, one of which loads on some of the variables, and the other on a disjoint set of
variables. In this case one would want to restrict certain factor loadings to zero at the start.
Such confirmatory factor analysis can be a very powerful tool.
 DISPLAY 8a.9 XXX
Dunn, Everitt, and Pickles (1993) carried out a confirmatory factor analysis of the data on patterns of
consumption of legal and illegal psychoactive substances described in Chapter 7. The postulated a three
factor model for the observed data. Their three factors were:
Alcohol use: in which the model permitted non-zero loadings on beer, wine, liquor, and cigarettes.
Cannabis use: in which the model permitted non-zero loadings on marijuana, hashish, cigarettes, and
wine.
Hard drug use: in which the model permitted non-zero loadings on amphetamines, tranquillizers,
hallucinogenics, hashish, cocaine, heroin, drug-store medication, inhalants, and liquor.
If we assume that the factors are independent, we see that this model will imply zero correlations
between certain variables. For example, marijuana and beer consumption have no factor in common,
and hence will be predicted to have zero correlation. In fact the empirical correlation matrix shows them
to have a correlation of 0.445. There are other pairs like this, so perhaps the model needs to be
modified. Dunn et al suggest permitting non-zero correlations between three factors.

The factor analysis model shown in Figure XXX7 shows how the latent factors are related to
the observed, manifest variables, and hence permits us to estimate characteristics of the latent
variables. But this principle can be applied more widely. In many situations no
measurements are made without error. This suggests that any theory relating the concepts
which we are trying to measure is really a theory relating latent variables. We can set up a
system of equations showing the relationships between these latent variables (that is what our
‘theory’ is), but we then need another system showing how these latent variables are related
to the things we actually measure. The system describing the relationships between the
latent variables (the theoretical constructs) is called the structural model, since it describes
the structure of our model. The system describing the relationships between the latent
variables and the things we actually measure is the measurement model. Separate systems of
equations are set up for the two models, being interlocked via the latent variables. The term
LISREL, standing for linear structural relational models is used to describe the overall model
which results when the measurement and structural models are put together (and also for a
particular software package for fitting such models). These models can be extremely
complicated, since the theoretical relationships can be complicated. An example is the
MIMIC model, an acronym for multiple indicator - multiple cause, in which it is postulated
that several causes influence some latent construct, and that the effects are observed on
several indicators. Models of change over time can also be developed using structural
relational models. Here variables are measured at repeated times and the models show the
links between the times.
The theory of structural equation models has been developed with multivariate normal
distributions in mind (it is, after all, based on second order statistics). If the distributions
deviate substantially from normal distributions, then the results should be interpreted with
caution - prior transformation of the data is to be recommended, if this is possible.
Alternatively, one can estimate the parameters using standard approaches, but then estimate
their standard errors using more advanced techniques such as bootstrap methods (since it is
typically the measures of accuracy of the estimates which are more susceptible to departures
from normality).
8a.4.3 Bayesian belief networks
Recursive regression models have an attractive conditional independence interpretation
(provided the residuals are uncorrelated). That is, the absence of an edge between two nodes
of the graph means that the variables represented by those nodes are independent, conditional
on the other variables in the graph. Unfortunately, this property does not hold for some of
the more sophisticated extensions of linear structural relational models. However, since the
property is so attractive, it has served as the motivating force for a different class of models,
which have attracted much interest in recent years. These models are based on reducing the
overall multivariate probability distribution by representing (or approximating) it in terms of
interlocking distributions each involving only a few variables. Obviously this is equivalent
to assuming that some variables are conditionally independent (given the other variables in
the graph), so that a simplified model results. They were developed in parallel for
multivariate normal data and for categorical data (where they can be regarded as providing
convenient representations of log-linear models).
Such models essentially decompose the joint distribution of all of the variables in a way
which leads to simplification. Consider the example involving five variables, U, V, W, X,
and Y, illustrated in Figure XXX9. The complete joint distribution is f(U, V, W, X, Y), but
judicious rearrangement leads to the factorisation
f(U, V, W, X, Y) = f(Y|W, X) f(W|U, V) f( X) f(U) f(V)
Such decompositions are perfectly general, and may be made whatever the nature of the
distribution. In the case of multivariate normal distributions, each factor is also a normal
distribution. In the case of categorical data the overall distribution, involving the
cross-classification of all of the variables (and hence perhaps with a vast number of cells,
making estimation of all of the probabilities infeasible) is replaced by a product of terms each
of which involve only a few variables. In the above example, the five-way table of
probabilities is replaced by two three-way tables (f(Y|W, X) and f(W|U, V)) and three one-way
tables. In this example the improvement is not dramatic, but in real problems, which may
involve hundreds of variables, the improvement may be from the completely impossible to
the relatively straightforward.
Figure XXX9: A simple example of a conditional independence graphical model.
U
V
W
X
Y
 DISPLAY 8a.10 XXX
Figure XXX8 shows the graphical model derived from a sample of 23,216 applicants for unsecured
personal loans, using data kindly provided by a major UK bank. The graph shows the relationships
between the variables:
LNAMT: the total amount of the loan, in £;
MRTST: the marital status of the applicant, coded as married, divorced, single, widowed, or
separated;
RESCODE: housing tenure status of the applicant, coded as home-owner, living with parents, tenant,
or other;
DISPINC: disposable monthly income in £, calculated from data on the applicant’s income, the
income of the applicant’s spouse, other income, monthly mortgage repayments, and other
repayments;
CUSTAGE: the applicant’s age in years;
INSIND: a binary indicator of whether the applicant has taken out the loan protection insurance sold
by the bank alongside the loan;
BANKIND: a binary indicator of whether the applicant has a current account with the bank providing
the loan;
CARDDEL: a binary indicator of whether the applicant’s credit card account with the bank has ever
been delinquent;
DEL: a binary indicator of whether the loan account was categorized as ‘bad’ using the bank’s
standard definition.
This model is quite complicated, in that relatively few of the possible edges between pairs of variables
have been deleted. (That is, relatively few pairs of variables are conditionally independent, given the
other variables in the model.) However, despite this, the model is revealing. It shows, for example,
that the two ‘delinquency’ variables, CARDDEL and DEL, are conditionally independent of LNAMT,
MRTST, and RESCODE, given the other four variables. Put another way, once we know the disposable
income, age, and whether or not the customer has a current account with the bank and took out the
insurance loan, then information on the size of the loan, the applicant’s marital status, and the nature of
their living accommodation is irrelevant to their delinquency risk category. This separation of two sets
of nodes {CARDDEL, DEL} and {LNAMT, MRTST, RESCODE} by a third set {DISPINC,
CUSTAGE, BANKIND, INSIND}, in the sense that all paths from the first set to the second pass
through the third, is termed the Markov property. There are various versions of this property, including
versions for directed graphs.

Figure XXX8: An undirected model for bank loans.
MRTST
RESCODE
LNAMT
DISPINC
BANKIND
CUSTAGE
CARDDEL
INSIND
DEL
The multivariate normal distribution has the useful property that two variables are
conditionally independent, given the other variables, if and only if the corresponding element
of the inverse of the covariance matrix is zero. This means that the inverse covariance
matrix reveals the pattern of relationships between the variables. (Or, at least, it does in
principle: in fact, of course, it will be necessary to decide whether a small value in the inverse
covariance matrix is sufficiently small to be regarded as zero.) It also means that we can
draw a graph in which there are no edges linking the nodes corresponding to variables which
have a small value in this inverse matrix.
There is an important difference between the type of graph illustrated in Figure XXX8 and
the ones described in the preceding subsections. This is that this new graph is undirected.
There are no natural directions to the edges linking nodes (so they are denoted by mere lines,
instead of arrows). Of course, if we have external information about the direction of
causation (as was assumed in the case of path analysis, for example) then we might be able to
use arrows, though this does lead to certain complications. As far as directed graphs go, ones
without cycles (that is, ‘acyclic’ directed graphs) are the most important. A cycle is a loop of
nodes and directed edges, such that one is led back to the start: UVWXU. Acyclic
directed graphs can be converted into univariate recursive regression graphs (though there
may be more than one way to do this, and if there is then they correspond to different
statistical models) and we have already seen that such models are convenient. We also
referred to a ‘Markov’ property for undirected graphs in Display 8a.10XXX. A similar
property exists for directed graphs, though it is slightly more complicated. A set of variables
X is independent of a set of variables Y, given a set Z, if (i) there are no directed paths
between X and Y lying wholly outside Z, and (ii) there are no pairs of directed paths lying
outside Z, one leading to X and one leading to Y, both starting from a common node W
outside Z, and (iii) all nodes which have multiple parents on paths connecting X and Y either
lie in or have a descendant in Z.
Construction of conditional independence models involves essentially two steps. The first is
the construction of the topology, and the second is the determination of the conditional
distributions.
Padhraic, I was not sure how much we wanted to go into this, since there are many
different approaches, and it will depend on what package the user adopts. What is your
view?
Padhraic: We need an example of a directed graph needed here. I couldn’t find anything
convincing in the papers from the last two KDD meetings. I have one example which is
the continuation of the display example on loans above, but it’s very complicated (I think
too complicated for this chapter). Do you have anything suitable? As a last resort we
could use a toy example, like the one in Heckerman’s ‘Data Mining and Knowledge
Discovery’ paper.
Discussion
Mixture models and factor analytic models are examples of latent structure models (we said
at the start of the chapter that there were relationships between the various kinds of models).
We are trying to explain the observed pattern of data in terms of a postulated latent or
underlying structure. There are also other forms of latent structure model. For example,
latent class analysis deals with the situation in which both the observed and the postulated
latent variables are categorical, latent trait analysis the situation when the observed
variables are categorical but the latent ones are continuous, and latent profile analysis the
situation when the observed variables are continuous but the latent ones are categorical.
Padhraic, if you do decide to add something to this chapter about methods which make
Markov assumptions, then I would like to add a paragraph here about variance
components, multilevel (hierarchical) models, and repeated measures data. I didn’t
include it in the above (could easily have had separate sections) simply because we had to
stop somewhere and I’m not sure how important such methods are in the context of data
mining. (I agree they are very important in general.)
Further reading
Books on mixture distributions include Everitt and Hand (1981), Titterington, Smith, and
Makov (1985), McLachlan and Basford (1988). Mixture models in haematology are
described in McLaren (1996). Studies of tests for numbers of components of mixture
models are described in Everitt (1981), McLachlan (1987), and Mendell, Finch, and Thode
(1993).
There are now many books on cluster analysis. Recommended ones include Anderberg
(1973), Späth (1985), and Kaufman and Rousseeuw (1990). The distinction between
dissection and finding natural partitions is not always appreciated, and yet it can be an
important one and it should not be ignored. Examples of authors who have made the
distinction include Kendall (1980), Gordon (1981), and Späth (1985). Banfield and Raftery
(1993) proposed the idea of adding a ‘cluster’ which pervades the whole space by
superimposing a separate Poisson process which generated a low level of random points
throughout the entire space, so easing the problem of clusters being distorted due to a handful
of outlying points. Marriott (1971) showed that the criterion c2 W was asymptotically
constant for optimal partitions of a multivariate uniform distribution. Krzanowski and
Marriott (1995), Table 10.6, gives a list of updating formulae for clustering criteria based on
W. Maximal predictive classification was developed by Gower (1974). The use of branch
and bound to extend the range of exhaustive evaluation of all possible clusterings is described
in Koontz, Narendra, and Fukunaga (1975) and Hand (1981). The k-means algorithm is
described in MacQueen (1967) and the ISODATA algorithm is described in Hall and Ball
(1965). Kaufman and Rousseeuw (1990) describe a variant in which the ‘central point’ of
each cluster is an element of that cluster, rather than being the centroid of the elements. A
review of early work on mathematical programming methods applied in cluster analysis is
given by Rao (1971). For a cluster analytic study of whiskies, see Lapointe and Legendre
(1994). One of the earliest references to the single link method of cluster analysis was
Florek et al (1951), and Sibson (1973) was important in promoting the idea. Lance and
Williams (1967) presented a general formula, useful for computational purposes, which
included single link and complete link as special cases. The median method of cluster
analysis is due to Gower (1967). Lambert and Williams (1966) describe the ‘association
analysis’ method of monothetic divisive partitioning. The polythetic divisive method of
clustering described in Section 8a3.2.2 is due to MacNaughton-Smith et al (1964). The
fuzzy clustering criterion described in Section 8a.3 is due to Bezdek (1974) and Dunn (1974),
and a review of fuzzy clustering algorithms is given in Bezdek (1987). The overlapping
cluster methods summarised in Section 8a.3 is due to Shepard and Arabie (1979).
A formal description of path analysis is given in Wermuth (1980). Recent books on factor
analysis include Batholomew (1987), Reyment and Jöreskog (1993), and Basilevsky (1994).
Bollen (1989) is a recommended book on structural equation models. Books on graphical
models include those by Whittaker (1990), Edwards (1995), Cox and Wermuth (1996), and
Lauritzen (1996). The bank loan model described in Display 8a.10XXX is developed in
more detail in Hand, McConway, and Stanghellini (1997).
References
Anderberg M.R. (1973) Cluster Analysis for Applications. New York: Academic Press.
Azzalini A. and Bowman A.W. (1990) A look at some data on the Old Faithful geyser.
Applied Statistics, 39, 357-365.
Banfield J.D. and Raftery A.E. (1993) Model-based Gaussian and non-Gaussian clustering.
Biometrics, 49, 803-821.
Bartholomew D.J. (1987) Latent Variable Models and Factor Analysis. London: Charles
Griffin and Co.
Basilevsky A. (1994) Statistical Factor Analysis and Related Methods. New York: Wiley.
Bezdek J.C. (1974) Numerical taxonomy with fuzzy sets. Journal of Mathematical Biology,
1, 57-71.
Bezdek J.C. (1987) Some non-standard clustering algorithms. In Developments in Numerical
Ecology, ed. P. Legendre and L. Legendre, Berlin: Springer-Verlag, 225-287.
Bollen K.A. (1989) Structural Equations with Latent Variables. New York: Wiley.
Cox D.R. and Wermuth N. (1996) Multivariate Dependencies: Models, Analysis, and
Interpretation. London: Chapman and Hall.
Dunn J.C. (1974) A fuzzy relative of the ISODATA process and its use in detecting compact
well-separated clusters. Journal of Cybernetics, 3, 32-57.
Edwards D. (1995) Introduction to Graphical Modelling. New York: Springer Verlag.
Everitt B.S. (1981) A Monte Carlo investigation of the likelihood ratio test for the number of
components in a mixture of normal distributions. Multivariate Behavioural Research, 16,
171-180.
Everitt B.S. and Hand D.J. (1981) Finite Mixture Distributions. London: Chapman and Hall.
Florek K., Lukasziwicz J., Perkal J., Steinhaus H., and Zubrzycki S. (1951) Sur la liaison et la
division des points d’un ensemble fini. Colloquium Mathematicum, 2, 282-285.
Gordon A. (1981) Classification: Methods for the Exploratory Analysis of Multivariate Data.
London: Chapman and Hall.
Goer J.C. (1967) A comparison of some methods of cluster analysis. Biometrics, 23, 623-628.
Gower J.C. (1974) Maximal predictive classification. Biometrics, 30, 643-654.
Hall D.J. and Ball G.B. (1965) ISODATA: a novel method of cluster analysis and pattern
classification. Technical Report, Stanford Research Institute, Menlo Park, California.
Hand D.J. (1981) Discrimination and Classification. Chichester: Wiley.
Hand D.J., McConway K.J., and Stanghellini E. (1997) Graphical models of applicants for
credit. IMA Journal of Mathematics Applied in Business and Industry, 8, 143-155.
Sibson R. (1973) SLINK: an optimally efficient algorithm for the single link method.
Computer Journal, 16, 30-34.
Kaufman L. and Rousseeuw P.J. (1990) Finding Groups in Data: An Introduction to Cluster
Analysis. New York: Wiley.
Kendall M.G. (1980) Multivariate Analysis. (2nd ed.) London: Griffin.
Koontz W.L.G., Narendra P.M., and Fukunaga K. (1975) A branch and bound clustering
algorithm. IEEE Transactions on Computers, 24, 908-915.
Krzanowski W.J. and Marriott F.H.C. (1995) Multivariate Analysis vol.2: Classification,
Covariance Structures, and Repeated Measurements. London: Arnold.
Lambert J.M. and Williams W.T. (1966) Multivariate methods in plant ecology IV:
comparison of information analysis and association analysis. Journal of Ecology, 54,
635-664.
Lance G.N. and Williams W.T. (1967) A general theory of classificatory sorting strategies: 1.
Hierarchical systems. Computer Journal, 9, 373-380.
Lapointe F.-J. and Legendre P. (1994) A classification of pure malt Scotch whiskies. Applied
Statistics, 43, 237-257.
Lauritzen S.L. (1996) Graphical Models. Oxford: Clarendon Press.
McLachlan G.J. (1987) On bootstrapping the likelihood ratio test for the number of
components in a normal mixture. Applied Statistics, 36, 318-324.
McLachlan G.J. and Basford K.E. (1988) Mixture Models: Inference and Applications to
Clustering. New York: Marcel Dekker.
McLaren C.E. (1996) Mixture models in haematology: a series of case studies. Statistical
Methods in Medical Research, 5, 129-153.
MacNaughton-Smith P., Williams W.T., Dale M.B., and Mockett L.G. (1964) Dissimilarity
analysis. Nature, 202, 1034-1035.
MacQueen J. (1967) Some methods for classification and analysis of multivariate
observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, ed. L.M.Le Cam and J. Neyman, 1, Berkeley, California: University of
California Press, 281-297.
Marriott F.H.C. (1971) Practical problems in a method of cluster analysis. Biometrics, 27,
501-514.
Mendell N.R., Finch S.J., and Thode H.C. (1993) Where is the likelihood ratio test powerful
for detecting two component normal mixtures? Biometrics, 49, 907-915.
Rao M.R. (1971) Cluster analysis and mathematical programming. Journal of the American
Statistical Association, 66, 622-626.
Reyment R. and Jöreskog K.G. (1993) Applied Factor Analysis in the Natural Sciences.
Cambridge: Cambridge University Press.
Shepard R.N. and Arabie P. (1979) Additive clustering: representation of similarities as
combinations of discrete overlapping properties. Psychological Review, 86, 87-123.
Späth H. (1985) Cluster Analysis and Dissection. Chichester: Ellis Horwood.
Titterington D.M., Smith A.F.M., and Makov U.E. (1985) Statistical Analysis of Finite
Mixture Distributions. New York: Wiley.
Wermuth N. (1980) Linear recursive equations, covariance selection and path analysis.
Journal of the American Statistical Association, 75, 963-972.
Whittaker J. (1990) Graphical Models in Applied Multivariate Statistics. Chichester: Wiley.
Wright S. (1921) Correlation and causation. Journal of Agricultural Research, 20, 557-585.
Wright S. (1923) Theory of path coefficients: A reply to Niles’ criticism. Genetics, 8,
239-255.
Wright S. (1934) The method of path coefficients. Annals of Mathematical Statistics, 5,
161-215.