Download chap7b

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
CHAPTER 5 : PRINCIPLES OF MODEL FITTING
5.1 Introduction
The aim of this chapter is to outline the basic principles of model fitting. In doing so we will
introduce some important elementary model forms, leaving more advanced models which are
important in data mining to be described in Chapters 8 and 9. We begin by reminding you of
the distinction between a model and a pattern.
A model is a high level, global description of a data set. It takes a large sample perspective.
It may be descriptive - summarising the data in a database in a convenient and concise way or it may be inferential, allowing one to make some statement about the population from
which the data were drawn or about likely future data values. Examples of models are
Box-Jenkins time series, Bayesian belief networks, and regression models.
In contrast, a pattern is a local feature of the data, perhaps holding for only a few records or
for a few variables (or both). Patterns are of interest because they represent departures from
the general run of the data: a pair of variables which have a particularly high correlation, a set
of items which have exceptionally high values on some variables, a group of records which
always score the same on some variables, and so on. As with models, one may want to find
patterns for descriptive reasons or for inferential reasons: one may want to identify members
of the existing database which have unusual properties, or one may want to predict which of
future records are likely to have unusual properties. Examples of patterns are transient
waveforms in an EEG trace, a peculiarity of some types of customers in a database, and
outliers. Patterns are the primary focus of Chapter 7.
What precisely constitutes an interesting model or pattern depends entirely on the context.
Something striking, unusual, or worth knowing about in one context may be quite run of the
mill in another - and vice versa.
Although, of course, there is much overlap, methods of model building in data mining mainly
have their roots in statistics, while methods of pattern detection mainly have their roots in
machine learning. The word ‘roots’ here is important, since the use of the methods in data
mining contexts has introduced differences in emphasis - a key one arising from the large
sizes of the data sets typically dealt with in data mining contexts.
A very basic form of model is a single value summarising a set of data values. Take a
database in which one of the fields holds household income. If the database has a million
records (or, indeed, even a hundred values), we need to summarise these values in some way,
so that we can get an intellectual grasp on them. If the values are all identical, then
summarising them is easy. But what if they are all different, or, at least, come from a very
large set of possible values (as will typically be the case in real life for such a variable) so
that there are many different values in the database? What we need to do is replace the list
of a million values by one, or perhaps a few, numbers which describe the general shape of the
distribution of values. A popular summarising number is the average or mean. This is the
single number which is closest to those in the database in the sense that the sum of the
squared differences between this number and all the others is minimised.
Note what is going on here. To find our single summarising value we chose a criterion (we
used the sum of squared differences between the number which we would use as our
descriptive summary and the given data) and then found the number which optimised (here
minimised) this criterion. This is the essence of the model-fitting process (although, as we
shall see, this basic idea will need to be modified). We decide on the basic form of the
model (here a single summarising number), decide on the criterion by which we measure the
adequacy of the model (here a sum of squared differences between the single summarising
number and the data points), and find that model (here the single summarising number)
which optimises the criterion. Even in this simple example other criteria can be used. We
could, for example, have minimised the sum of absolute differences between the data and the
single summarising number (this would lead to the median of the data). Similarly (although
hardly relevant in so simple an example as this one) different algorithms might be used to
find that model which optimises the criterion.
In general, numbers calculated from the data are called statistics. It is obvious that a single
summary statistic is of rather limited value. We can make comparative statements about the
mean values of datasets - the mean value of the income this year is greater than it was last
year, the mean value of income for this group of households is greater than for that group,
and so on - but it leaves a lot of questions unanswered. Certainly, on average this group has
a lower income than that group, but what about at the extremes? Is the spread of incomes
greater in this group than in that? Do the high earners in this group earn more than in that?
What is the most common value of household income? And so on. In view of this, it would
be nice to supplement the mean value with some further descriptive statistics. We might
choose to describe the variability of the data, and perhaps its skewness. We might also
choose to give the smallest and largest values, or perhaps the values of quartiles, or we might
choose to give the most common value (the mode), and so on. Technical definitions of these
and other summary statistics are given in Section 5.2.
Our model is now becoming more sophisticated. In place of the single mean value we have
several values which summarise the shape of the distribution of values in the database.
Note, however, that we are still simply summarising the values in the database. We are
describing the set of values which actually exist in the database in a convenient (and
approximate) way. This is sufficient if our interest lies in the items in the database and in no
others. In particular, it is of interest if we do not want to make any statements about future
values which may be added to the database, or about the values which may arise in future
databases (for example, the distribution of household incomes next year). Questions like
these new ones are not questions of description, but are questions of inference. We want to
infer likely values (or likely mean values or whatever) for data which might arise. We can
base our inference on the database we have to hand, but such inference goes beyond mere
description. Up until now our model has simply been a description, summarising given data.
To use a model for inference requires interpreting it in a rather different way. Now we think
of it as approximating an underlying structure from which the data arose. When thought of
in this way, it is not much of an intellectual leap to imagine further data arising from the same
underlying structure. And it is this which allows us to make statements about how likely
particular values are to be observed in the future. The process of going from the available
data to an approximating description of an underlying structure, rather than simply a
summarising description of the available data, is inference.
We might also, at this point, draw a further distinction between two different kinds of models
used in inference. Sometimes one has a good idea, based on theory, of the nature of the
model which might have led to the data. For example, one might have a theory of stellar
evolution which one believes should explain the observations on stars, or a theory of
pharmacokinetics explaining the observations on drug concentrations. In this case the model
is a mechanistic (or realistic, substantive, iconic or phenomenological) one. In contrast,
sometimes one has no theoretical basis for the model and simply constructs one based on the
relationships observed in the data. These are empirical (or operational) models. In data
mining, one will usually be concerned with empirical models. In any case, recall from the
discussion in Chapter 2 that, even in the case of mechanistic models, the modern view is that
no models are ‘true’. They are simply approximations to reality - the aim of science being to
obtain better and better approximations.
When making an inferential statement, one can never be totally sure that one is right (if one
can be sure then it is hardly inference!). Thus it is often useful to add a measure of the
uncertainty surrounding one’s conclusions. In fact, sometimes, rather than giving a single
number for an inferred value, one will give an interval estimate - a range of values which one
is fairly confident contains the underlying value. Interpretation of such estimates depends on
the type of inference one is carrying out. We shall say more about interval estimates in
Section 5.3.
Inferential models are often more elaborate than descriptive models. This is hardly
surprising, since their essential use is to make statements about other data which might be
observed. In particular, this means that, where a descriptive model will be happy to stop
with simple numerical summaries, inferential models have to say more. This is typically
achieved by means of a probability model for the data. Thus one might assume that the data
have arisen from a particular form of probability distribution, and that future data will also
arise from this distribution. In this case one will use the available data to infer the
parameters describing that distribution - subsequently permitting statements to be made about
potential new data.
Section 5.2 opens by reminding the reader of the meaning of some of the more elementary
descriptive statistical tools. These are elementary in the sense that they are basic tools, but
also in the sense that they are the elements from which much else is built. Description,
however, is but one kind of data mining aim. Another is inference - to draw some
conclusion, on the basis of the available data, about other data which might have been
collected or which might be collected in the future. Section 5.3 discusses this, in the context
of estimation - determining good values for parameters describing distributions. Properties
of estimators which make them attractive are discussed, and maximum likelihood and
Bayesian estimation techniques are outlined. Hypothesis test have traditionally played a
large role in data analysis, but the unique problems of data mining require us to examine
them afresh, raising new issues which were not critical when only small data sets and limited
data exploration was carried out. Hypothesis tests are the focus of Section 5.4. Simple
parameter estimation is all very well, but with large data sets and fast computers there comes
the possibility of fitting very complex models with many (hundreds or even thousands) of
parameters. Although traditional statistics and data analysis have developed understanding
of the problems that will occur in these circumstances, data mining is giving more urgency to
them. Model fitting is seen not to be simply an issue of finding that structure which provides
a good match to the data - as discussed in Section 5.5. Section 5.6 rounds off by providing
brief descriptions of some of the more important statistical distributions which underlie
model building in data mining. Section 5.7 gives some pointers to further reading.
5.2 Summarising data: some simple examples
We have already described the mean as a simple summary of the ‘average’ value of a
collection of values. As noted above, it is the value which is ‘central’ in the sense that it
minimises the sum of squared differences between it and the data values. It is computed by
dividing the sum of the data values by their number (see Display 5.1). Thus, if there are n
data values, the mean is the value such that the sum of n copies of it equals the sum of the
data values.
The mean is a measure of location. Other important measures of location are the median,
which is the value which has an equal number of data points above and below it. (Easy if n
is an odd number. When there is an even number it is usually defined as halfway between
the two middle values.)
Another measure of location is the mode - the most common value of the data. Sometimes
distributions have more than one mode - and then they are multimodal.
Other measures of location focus on other parts of the distribution of data values. The first
quartile is that value such that a quarter of the data points lie below it. The third quartile has
three quarters below it. We leave it to you to discover why we have not mentioned the
second quartile. Likewise, deciles and percentiles are sometimes used.
Various measures of dispersion or variability are also common. They include the standard
deviation and its square, the variance. The variance is defined as the average of the squared
differences between the mean and the individual data values. (So we see, that since the
mean minimises the sum of these squared differences, there is a close link between the mean
and the variance.) The interquartile range is common in some applications, defined as the
difference between the third and first quartile. The range itself is the difference between the
largest and smallest data point.
Skewness measures whether or not a distribution has a single long tail. For example, the
distribution of peoples’ incomes typically shows the vast mass of people earning small to
moderate values, and just a few people earning large sums, tailing off to the very very few
who earn astronomically large sums - the Bill Gateses of the world. A distribution is said to
be right-skewed if the long tail is in the direction of increasing values and left-skewed
otherwise. Right-skewed distributions are more common. Symmetric distributions have
zero skewness.
 DISPLAY 5.1
Suppose that
 x , , x 
The mean is
x   xi n
1
n
are a set of n data values.
  x  x  n , although, for reasons explained in Display 5.4, it is sometimes
  x  x   n  1 .
2
The variance is
i
2
calculated as
i
The standard deviation is
x
i
 x n .
2
A common measure of skewness is
 x
i
 x
3
  x
i
 x

2 3/ 2
.
Of course, data summaries do not have to be single numbers, and they do not have to be
simple. In Section 1.2 we noted the distinction between clustering and segmentation. Both
can be used to provide summaries of a database. In the first case one will be seeking a
concise and accurate description of the data, finding naturally occurring groupings in the data
to yield this. Thus a cluster analysis of automobiles may reveal that there are luxury
vehicles, small family saloons, compact hatchbacks, and so on (depending, of course, on the
variables used to describe the vehicles). The result will be a convenient summary of a large
mass of data into a handful of simple types. Of course, accuracy is lost by this process.
There will doubtless be vehicles which lie on the edges of the divisions. But this is more
than compensated for by the reduction in complexity of the description, from the entire list of
data to a handful of types. Applying a segmentation strategy to the same database will also
yield a summarising description, although now the summary will be in terms which are
convenient for some external reasons and may not reflect natural distinctions between the
types of vehicle. However, the segmentation will also lead to a concise summary of the data.
In these examples, the models are both summarising the data in the database - they are not
inferential, though such models can also be inferential. (Cluster analysis has been widely
used in psychiatry in an effort to characterise different kinds of mental illness. Here the aim
is clearly not simply to summarise the data which happens to be to hand, but rather it is to
make some kind of general statement about the types of disease - a fundamentally inferential
statement.) So, in the automobile example, the models are descriptive because they are
being used to summarise the available data. However, they are also descriptive in the sense
introduced in Chapter 1 - they simply aim to describe the structure implicit in the data,
without any intention to be able subsequently to predict values of some variables from others.
Other kinds of models may be predictive in this sense, while still being descriptive in the
sense that they summarise data and there is no intention of going beyond the summary data.
One might, for example, seek to summarise the automobile data in terms of relationships
between the variables recorded in those data. Perhaps weight and engine size are highly
correlated, so that a good description is achieved by noting only one of these, along with the
correlation between them.
5.3 Estimation
The preceding section described some ways of summarising a given set of data. When we
are concerned with inference, however, we want to make more general statements, statements
about the entire population of values which might have been drawn. That is, we want to
make statements about the probability mass function or probability density function (or,
equivalently, about the cumulative distribution function) from which the data arose. The
probability mass function gives the probabilities that each value of the random variable will
occur. The probability density function gives the probability that a value will occur in each
infinitesimal interval. Informally, one can think of these functions as showing the
distribution of values one would have obtained if one took an infinitely large sample. More
formal descriptions are given in texts on mathematical statistics. Thus the same principles
apply to calculating descriptive measures of probability mass functions and probability
density functions as they do to calculating descriptive measures of finite samples (although,
in the case of density function, integrals rather than summations are involved). In particular,
we can calculate means, medians, variances, and so on. In this context there is a
conventional notation for the mean or, as it is also known, the expected value of a
distribution: the expected value of a random variable x is denoted E(x).
5.3.1 Desirable properties of estimators
In the following subsections we describe the two most important modern methods of
estimating the parameters of a model. There are also others. Different methods, naturally
enough, behave differently, and it is important to be aware of the properties of the different
methods so that one can adopt a method suited to the problem. Here we briefly describe
~
some attractive properties of estimators. Let  be an estimator of a parameter  . Since
~
 is a number derived from the data, if we were to draw a different sample of data, we
~ ~
would obtain a different value for  .  is thus a random variable. That means that it has
a distribution - with different values in the distribution arising as different samples are drawn.
Since it has a distribution of values, we can obtain descriptive summaries of that distribution.
~
It will, for example, have a mean or expected value, E   .
~
~
The bias of  is E     , the difference between the expected value of the estimator and
~
the true value of the parameter. Estimators for which E      0 are said to be unbiased.
Such estimators show no systematic departure from the true parameter value.
Just as the bias of an estimator can be used as a measure of its quality, so also can its
~
~
~ 2
variance: Var  E   E  . One can choose between estimators which have the same
bias by using their variance. Estimators which have minimum variance for given bias are
called, unsurprisingly, best unbiased estimators.
2
~
~
The mean squared error of  is E     . That is, it is the averaged squared difference
between the value of the estimator and the value of the parameter. Mean squared error has a
natural decomposition as
2
2
~
~
~
~ 2
~
~
E       E     E   E   Bias 2  Var
~
the sum of the squared bias of  and its variance. Mean squared error is an attractive
criterion since it includes both systematic (bias) and random (variance) differences between
the estimated and true values. Unfortunately, we often find that they work in different
directions: modifying an estimator to reduce its bias leads to one with larger variance, and
vice versa, and the trick is to arrive at the best compromise. Sometimes this is achieved by
deliberately introducing bias into unbiased estimators. An example of this decomposition of
mean squared error is given in Display 5.10.
There are more subtle aspects to the use of mean squared error in estimation. For example, it
treats equally large departures from  as equally serious, regardless of whether they are
above or below  . This is appropriate for measures of location, but perhaps not so
appropriate for measures of dispersion, which, by definition, are bounded below by zero, or
for estimates of probabilities or probability densities.
~
~
Suppose that we have a sequence of estimators, 1 ,n based on increasing sample sizes.
~
Then the sequence is said to be consistent if the probability that the difference between i
and the true value  is greater than any given value tends to 1 as the sample size increases.
This is clearly an attractive property (perhaps especially in data mining contexts, with large
samples) since it means that the larger the sample the closer the estimator is likely to be to the
true value.
5.3.2 Maximum likelihood estimation
Maximum likelihood estimation is the most important method of parameter estimation.
Given a set of n observations, x1 , , x n , independently sampled from the same distribution
f  x|  (independently and identically distributed, or iid, as statisticians say), the likelihood
function is
L | x1 , , x n    f  xi | 
n
i 1
Recall that this is a function of the unknown parameter  , and that it shows the probability
that the data would have arisen under different values of this parameter. The value for which
the data has the highest probability of having arisen is the maximum likelihood estimator (or
MLE). This is the value which maximises L. Maximum likelihood estimators are often
denoted by putting a caret over the symbol for the parameter - here  .
 DISPLAY 5.2
Suppose we have assumed that our sample of n data points have arisen independently from a normal
distribution with unit variance and unknown mean  . (We agree that this may appear to be an unlikely
situation, but please bear with us for the sake of keeping the example simple!) Then the likelihood
function for  is
2
 1
exp   xi    
 2

2
 1
2
exp    xi    
 2

L | x1 , , x n    2 
 2 
Setting the derivative
1
d
L | x1 , x n  to 0 yields
d
x
and hence
1 2
i
   0
   xi n , the sample mean. 
Maximum likelihood estimators are intuitively and mathematically attractive - for example,
they are consistent. Moreover, if  is the MLE of a parameter  , then g  is the MLE
 
of the function g  . (Some care needs to be exercised if g is not a one-to-one function.)
On the other hand - nothing is perfect - maximum likelihood estimators are often biased - see
Display 5.4.
For simple problems (where ‘simple’ refers to the mathematical structure of the problem, and
not to the number of data points, which can be large) MLEs can be found using differential
calculus. See, for example, Display 5.2. In practice, the log of the likelihood is usually
maximised (e.g. Display 5.3) since this replaces the awkward product in the definition by a
sum; it leads to the same result as maximising L directly because log is a monotonic
transformation. For more complicated problems, especially those involving multiple
parameters, mathematical search methods such as steepest descent, genetic algorithms, or
simulated annealing must be used. Multiple maxima can be an especial problem (which is
precisely why stochastic optimisation methods are often necessary), as can situations where
optima occur at the boundaries of the parameter space.
 DISPLAY 5.3
Customers in a supermarket either purchase or do not purchase milk. Suppose we want an estimate of
the proportion purchasing milk, based on a sample of 1000 randomly drawn from the database. A
simple model here would be that the observations independently follow a Bernoulli distribution with
some unknown parameter p. The likelihood is then
L   p xi 1  p
1 xi
 p r 1  p
1000 r
where x i takes the value 1 if the ith customer in the sample does purchase milk and 0 if he or she does
not, and r is the number amongst the 1000 who do purchase milk. Taking logs of this yields

log L   r log p  1000  r  log1  p

which is straightforward to differentiate. 
 DISPLAY 5.4
The maximum likelihood estimator of the variance
1 n   xi  x  . However,
 2 of a normal distribution N  ,  2 
is
2
n 1 2
2
1
E    xi  x   

n

n
so that the estimator is biased. It is common to use the unbiased estimator
1 n  1   xi  x  .
2

 DISPLAY 5.5
Simple linear regression is widely used in data mining. This is discussed in detail in Chapter 9. In its
most simple form it relates two variables x, a predictor or explanatory variable, and y a response
variable. The relationship is assumed to take the form y = a + bx + , where a and b are parameters and
 is a random variable assumed to come from a normal distribution with mean 0 and variance  .
2
The likelihood function for such a model is
L

2
1
 1/ 2   yi   a bx  
e 
 2
2
1
  1/ 2    yi   a bx  
e
n/2
 2 
2
To find the maximum likelihood estimators of a and b we can take logs, and discard terms which do not
involve a or b. This yields


l   yi  a  bxi 
2
That is, one can estimate a and b by finding those values which minimise the sum of squared differences
between the predicted values (a+bx) and the observed values ( yi ). Such a procedure - minimising a
sum of squares - is ubiquitous in data mining, and goes under the name of the least squares method.
The sum of squared criterion is of great historical importance. We have already seen how the mean
results from minimising such a criterion and it has roots going back to Gauss and beyond. Derived as
above, however, any concern one might feel about the apparent arbitrariness of the choice of a sum of
squares (why not a sum of absolute values, for example?) vanishes (or at least is shifted) - it arises
naturally from the choice of a normal distribution for the error term in the model. 
Up to this point we have been discussing point estimates, single number estimates of the
parameter in question. Such estimates are ‘best’ in some sense, but they convey no sense of
the uncertainty associated with them - perhaps the estimate was merely the best of a large
number of almost equally good estimates, or perhaps it was clearly the best. Interval
estimates provide this sort of information. In place of a single number, they give an estimate
of an interval which one can be confident, to some specified degree, contains the unknown
parameter. Such an interval is a confidence interval, and the upper and lower limits of the
interval are called, naturally enough, confidence limits. Interpretation of confidence
intervals is rather subtle. Since, here, we are assuming that  is unknown but fixed, it does
not make sense to say that  has a certain probability of lying within a given interval: it
either does or it does not. However, it does make sense to say that an interval calculated by
the given procedure, contains  with a certain probability: the interval, after all, is calculated
from the sample and is thus a random variable.
As an example (deliberately artificial to keep the explanation simple - in more complicated
and realistic situations the computation will be handled by your computer) suppose the data
consist of 100 independent observations from a normal distribution with unknown mean 
but known variance  2 , and we want a 95% confidence interval for  . (This situation is
particularly unrealistic since, if the variance were known, the mean would almost certainly
also be known.) That is, we want to find a lower limit L(x) and an upper limit U(x) such that
P   L( x ), U ( x )  0.95 .
The distribution of the sample mean x in this situation is known to follow a normal
distribution with mean  and variance  2 100 , and hence standard deviation  10 . We
also know, from the properties of the normal distribution (see Section 5.6), that 95% of the
probability lies within 1.96 standard deviations of the mean. Hence
P   196
.  10  x    196
.  10  0.95 . This can be rewritten as
.  10 and
P x  196
.  10    x  196
.  10  0.95 . Thus L( x )  x  196
U ( x )  x  196
.  10 define a suitable 95% confidence interval.
The same principle can be followed to derive confidence intervals in situations involving
other distributions: we find a suitable interval for the statistic and invert it to find an interval
for the unknown parameter. As it happens, however, the above example has more
widespread applicability than might at first appear. The central limit theorem (see Section
5.6) tells us that the distribution of many statistics can be well approximated by a normal
distribution, especially if the sample size is large, as in typical data mining applications.
Thus one often sees confidence intervals calculated by the method above, even though the
statistic involved is much more complicated than the mean.
5.3.3 Bayesian estimation
We briefly described the Bayesian perspective on inference in Chapter 2. Whereas in the
frequentist perspective a parameter  is regarded as a fixed but unknown quantity,
Bayesians regard  as having a distribution of possible values. Before the data are
analysed, the distribution of probabilities that  will take different values is known as the
prior distribution, p  - this is the prior belief the investigator has that the different values
of the parameter are the ‘true’ value. Analysis of the data leads to modification of this
distribution, to yield the posterior distribution, p | x . This represents an updating of the
prior distribution, taking into account the information in the empirical data. The
modification from prior to posterior is carried out by means of a fundamental theorem named
after Sir Thomas Bayes - hence the name. Bayes’ theorem states
p | x  
p x|  p 
(5.1)
 p x|  p  d
Note that this updating procedure had led to a distribution, and not a single value, for  .
However, the distribution can be used to yield a single value estimate. One could, for
example, take the mean of the posterior distribution, or its mode.
For a given set of data, the denominator in equation (5.1) is a constant, so we can
alternatively write the expression as
p | x  p x|  p 
(5.2)
In words, this says that the posterior distribution of  given x (that is, the distribution
conditional on having observed the data x) is proportion to the product of the prior p  and
the likelihood p x|  . Note that the structure of (5.1) and (5.2) means that the distribution
can be updated sequentially if the data arise independently,
p | x, y  p y|  p x|  p  ,
a property which is attractive in data mining applications where very large data sets may not
fit into the main memory of the computer at any one time. Note also that this result is
independent of the order of the data (provided, of course, that x and y are independent).
The denominator in (5.1), p x    p x|  p  d , is called the predictive distribution of x,
and represents our predictions about the value of x. It includes our uncertainty about  , via
the prior p  , and our uncertainty about x when  is known, via p x|  . The predictive
distribution will change as new data are observed and p  becomes updated. The
predictive distribution can be useful for model checking: if observed data x has only a small
probability according to the predictive distribution, then one may wonder if that distribution
is correct.
 DISPLAY 5.6
Suppose that we believe a single data point x comes from a normal distribution with unknown mean
and known variance
.
 ~ N 0 ,  0  , with 0
That is, x ~ N
and
 ,   .
Suppose our prior distribution for


is
 0 known. Then
p | x   p x|  p 


1
2
exp  1 2  x   
2

1
2 0

exp  1 2 0   0 
2

 1

 exp  2 1  0  1     0  0  x  
 2

The mathematics here looks horribly complicated (a fairly common occurrence with Bayesian methods)
but consider the following reparameterisation. Let
1   01   1 
and
1
1  1 0  0  x  
Then
2
 1

 1

p | x   exp   2 1  1 1   exp    1  1 
 2

 2

Since this is a probability density function for
p | x  
This is a normal distribution
 , it must integrate to unity.
Hence
1
2
 1

exp    1  1 
21
 2

N 1 , 1  . This means that the normal prior distribution has been
updated to yield a normal posterior distribution. In particular, this means that the complicated
mathematics can be avoided. Given a normal prior for the mean, and data arising from a normal
distribution as above, then we can obtain the posterior merely by computing the updated parameters.
Moreover the updating of the parameters is not as messy as it might at first seem. Reciprocals of
variances are called precisions. Here, 1  1 , precision of the updated distribution, is simply the sum of
the precisions of the prior and the data distributions. This is perfectly reasonable: adding data to the
prior should decrease the variance, or increase the precision. Likewise, the updated mean, 1 , is simply
a weighted sum of the prior mean and the datum x, with weights which depend on the precisions of those
two values.
When there are n data points, with the same situation as above, the posterior is again normal, now with
updated parameter values
1  1  0  n  
and
1
1  1 0  0  xn  

We should say something about the choice of prior distribution. The prior distribution
represents one’s initial belief that the parameter takes different values. The more confident
one is that it takes particular values, the more closely the prior will be bunched about those
values. The less confident one is, the larger will be the dispersion of the prior. In the case
of a normal mean, if one had no idea of the true value, one would want to take a prior which
gave equal probability to each possible value. That is, one would want to adopt a prior
which was perfectly flat or which had infinite variance. This would not correspond to any
proper density function (which has to have some non-zero values and which has to integrate
to unity). Despite this, it is sometimes useful to adopt improper priors, which are uniform
throughout the space of the parameter. One can think of such priors as being essentially flat
in all regions where the parameter might conceivably occur. Even so, there still remains the
difficulty that priors which are uniform for a particular parameter are not uniform for a
nonlinear transformation of that parameter.
Another issue, which might be seen as a difficulty or a strength of Bayesian inference, is that
priors show the prior belief an individual has in the various possible values of a parameter and individuals differ. It is entirely possible that your prior will differ from mine. This
means that we will probably obtain different results from an analysis. In some
circumstances this is fine. But it others it is less so. One way to overcome this which is
sometimes applicable is to use a so-called reference prior, a prior which is agreed by
convention. A common form of reference prior is Jeffrey’s prior. To define this we need to
define the Fisher information. This is defined as
2

  log L | x  

I  | x    E 

2





That is, the negative of the expectation of the second derivative of the log-likelihood.
Essentially this measures the curvature of the likelihood function - or its flatness. And the
flatter a likelihood function is, the less the information it is providing about the parameter
values. Jeffrey’s prior is then defined as
p   I  | x 
The reason that this is a convenient reference prior is that, if     is some function of 
then this has prior proportional to I | x  . This means that a consistent prior will result
however the parameter is transformed.
The example presented in Display 5.6 began with a normal prior and ended with a normal
posterior. Conjugate families of distributions satisfy this property in general: the prior
distribution and posterior distribution both belong to the same family. The advantage of
using conjugate families is that the complicated updating process can be replaced by a simple
updating of the parameters.
We have already remarked that it is straightforward to obtain single point estimates from the
posterior distribution. Interval estimates are also easy to obtain - integration of the posterior
distribution over a region will give the estimated probability that the parameter lies in that
region. When a single parameter is involved, and the region is an interval, the result is a
credibility interval. A natural such interval would be the interval containing a given
probability (say 90%) and such that the posterior density is highest over the interval. This
gives the shortest possible credibility interval. Given that one is prepared to accept the
fundamental Bayesian notion that the parameter is a random variable, then the interpretation
of such intervals is much more straightforward than the interpretation of frequentist
confidence intervals.
5.4 Hypothesis tests
In many situations one wishes to see if the data support some idea about the value of a
parameter. For example, we might want to know if a new treatment has an effect greater
than the standard treatment, or if the regression slope in a population is different from zero.
Since often, we will be unable to measure these things for the entire population, we must base
our conclusion on a sample. Statistical tools for exploring such hypotheses are called
hypothesis tests.
The basic principle of such tests is as follows. We begin by defining two complementary
hypotheses: the null hypothesis and the alternative hypothesis. Often the null hypothesis is
some point value (e.g. that the effect in question has value zero - that there is no treatment
difference or regression slope) and often the alternative hypothesis is simply the complement
of the null hypothesis. Suppose, for example, that we are concerned with drawing
conclusions about a parameter  . Then we will denote the null hypothesis by H 0 :  0 ,
and the alternative hypothesis by H1 :  0 . Using the observed data, we calculate a
statistic (what form of statistic is best depends on the nature of the hypothesis being tested;
examples are given below). If we assume that the null hypothesis is correct, then such a
statistic would have a distribution (it would vary from sample to sample), and the observed
value would be one point from that distribution. If the observed value was way out in the
tail of the distribution, we would have to conclude that either an unlikely event had occurred
(that is, that an unusual sample had been drawn, since it would be unlikely we would obtain
such an extreme value if the null hypothesis were true), or that our assumption was wrong
and that the null hypothesis was not, in fact, true. The more extreme the observed value, the
less confidence we would have in the null hypothesis.
We can put numbers on this procedure. If we take the top tail of the distribution of the
statistic (the distribution based on the assumption that the null hypothesis is true) then we can
find those potential values which, taken together, have a probability of 0.05 of occurring.
Then if the observed value did lie in this region, we could reject the null hypothesis ‘at the
5% level’, meaning that only 5% of the time would we expect to see a result in this region if
the null hypothesis was correct. For obvious reasons, this region is called the rejection
region. Of course, we might not merely be interested in deviations from the null hypothesis
in one direction. That is, we might be interested in the lower tail, as well as the upper tail of
the distribution. In this case we might define the rejection region as the union of the test
statistic values in the lowest 2.5% of the probability distribution of the test statistic and the
test statistic values in the uppermost 2.5% of the probability distribution. This would be a
two-tailed test, whereas the former was a one-tailed test. The size of the rejection region
can, of course, be chosen at will, and this size is known as the significance level of the test.
Common values are 1%, 5%, and 10%.
We can compare different test procedures in terms of their power. The power of a test is the
probability that it will correctly reject a false null hypothesis. To evaluate the power of a
test, we need some specific alternative hypothesis. This will enable us to calculate the
probability that the test statistic will fall in the rejection region if the alternative hypothesis is
true.
The fundamental question now is, how do we find a good test statistic for a particular
problem? One strategy is to use the likelihood ratio. The likelihood ratio test statistic to
test the hypothesis H 0 :  0 against the alternative H1 :  0 is defined as
 x 
L0 | x 
sup L | x 
That is, the ratio of the likelihood when   0 to the largest value of the likelihood when 
is unconstrained. Clearly the null hypothesis should be rejected when   x  is small. This
can easily be generalised to situations when the null hypothesis is not a point hypothesis but
includes a set of possible values for  .
 DISPLAY 5.7
Suppose we have a sample of n points independently drawn from a normal distribution with unknown
mean and unit variance, and that we wish to test the hypothesis that the mean has value 0. The
likelihood under this (null hypothesis) assumption is
L 0| x    p xi |0  
1
2
 1
exp   xi  0 
2
 2

The maximum likelihood estimator of the mean of a normal distribution is the sample mean, so that the
unconstrained maximum likelihood is
L | x    p xi |    
The ratio of these simplifies to
1
2
 1
exp   xi  x  
2
 2

  x   exp n  x  0 2
2
Our rejection region is thus
as
 x|  x  c
for a suitably chosen value of c. This can be rewritten
2
x   ln c . Thus the test statistic x has to be compared with a constant. 
n
Certain situations arise particularly frequently in testing. These include tests of differences
between means, tests to compare variances, and tests to compare an observed distribution
with a hypothesised distribution (so-called goodness-of-fit tests). Display 5.8 outlines the
common t-test of a difference between the means of two independent groups. Descriptions
of other tests can be found in introductory statistics texts.
 DISPLAY 5.8
x ,, x  be a sample of n observations randomly drawn from a normal distribution
N   ,   , and let  y ,, y  be an independent sample of m observations randomly drawn from
Let
1
n
2
x
1
a normal distribution

m

N  y ,  2 . Suppose we wish to test the hypothesis that the means are equal:
H0 :  x   y . The likelihood ratio statistic under these circumstances reduces to
t
with
s  s x2
where
xy
s 2 1 n  1 m
n 1
m1
 s y2
,
nm2
nm2
sx2    x  x  n  1 is the estimated variance for the x sample, with similar for s y2 . s 2
2
here is thus seen to be simple a weighted sum of the sample variances of the two samples, and the test
statistic is merely the difference between the two sample means adjusted by the estimated standard
deviation of that difference. Under the null hypothesis t follows a t distribution with n+m-2 degrees of
freedom (see Section 5.6).
Although normality of the two populations being compared here is assumed, this test is fairly robust to
departures from normality, especially if the sample sizes are roughly equal and the variances are roughly
equal. This test is very widely used. 
The hypothesis testing strategy outlined above is based in the fact that a random sample has
been drawn from some distribution, and the aim is to make some probability statement about
a parameter of that distribution. The ultimate objective is to make an inference from the
sample to the general underlying population of potential values. For obvious reasons, this is
sometimes described as the sampling paradigm. An alternative strategy is sometimes
appropriate, especially when one is not confident that the sample have been obtained by
probability sampling (see Chapter 2). Now inference to the underlying population is not
possible. However, one can still sometimes make a probability statement about how likely
some effect is under a null hypothesis. Consider, for example, a comparison of a treatment
and control group. We might adopt, as our null hypothesis, that there is no treatment effect,
so that the distribution of scores of people who received the treatment should be the same as
that of those who did not. If we take a (possibly not randomly drawn) sample of people and
randomly assign them to the treatment and control group, then we would expect the
difference of (say) mean scores between the groups to be small. Indeed, under fairly general
assumptions, it is not difficult to work out the distribution of the difference between the
sample means of the two groups one would expect if there was no treatment effect, and if
such difference just arose as a consequence of an imbalance in the random allocation. One
can then explore how unlikely it is that a difference as large or larger than that actually
obtained would be seen - in much the same way as before. Tests based on this principle are
termed randomisation tests or permutation tests. Note that they make no statistical inference
from the sample to the overall population, but they do permit one to make a conditional
probability statement about the treatment effect, conditional on the observed values.
Many statistical tests make certain assumptions about the forms of the population
distributions from which the samples are drawn. The two-sample t-test in Display 5.8
illustrates this; there an assumption of normality was made. Often, however, it is
inconvenient to make such assumptions. Perhaps one has little idea or justification for the
assumption, or perhaps the data are known not to follow the form required by a standard test.
In such circumstances one can adopt distribution free tests. Tests based on ranks fall into
this class. Here the basic data are replaced by the numerical label of the position in which
they occurred. For example, to explore whether two samples had arisen from the same
distribution, one could replace the actual numerical values by their ranks. If they had arisen
from the same distributions, then one would expect that the ranks of the members of the two
samples would be well mixed. If, however, one distribution had a larger mean than the
other, one would expect one sample to tend to have large ranks and the other small ranks. If
the distributions had the same means, but one sample had a larger variance than the other,
then one would expect one sample to show a surfeit of large and small ranks, and the other to
dominate the intermediate ranks. Test statistics can be constructed based on the average
values or some other combinations of the ranks, and their significance level can be evaluated
using randomisation arguments. Such test statistics include the sign test statistic, the rank
sum test statistic, the Kolmogorov-Smirnov test statistic, and the Wilcoxon test statistic.
Sometimes the term nonparametric test is also used to describe such tests - the rationale
being that such tests are not testing the value of some parameter of any assumed distribution.
This section has described the classical (frequentist) approach to statistical hypothesis testing.
In data mining, however, things can become more complicated.
Firstly, because data mining involves large data sets, one should expect to obtain statistical
significance: even slight departures from the hypothesised model form will show up as real
(unlikely to be merely chance fluctuations). This will be true even though the departures
from the assumed model form may be of no practical significance. (If they are of practical
significance, of course, then well and good.) Worse, the slight departures from the
hypothesised model will show up as significant, even though these departures are due to
contamination or data distortion. We have already remarked on the inevitability of this.
Secondly, sequential model fitting processes are common. In Section 5.5 and Chapter 6 we
describe stepwise model fitting procedures, which gradually refine a model by adding or
deleting terms. Separate tests on each model, as if it was de novo, lead to incorrect
probabilities. Formal sequential testing procedures have been developed, but they can be
quite complex. Moreover, they may be weak - because of the multiple testing going on.
This is discussed below.
Thirdly, data mining is an essentially exploratory process. This has various implications.
One is that many models will be examined. Suppose one tests m true (though we will not
know this) null hypotheses, each at, say, the 5% level. Since they are each true, this means
that, for each hypothesis separately, there is a probability of 0.05 of incorrectly rejecting the
hypothesis. Since they are independent, this means that the probability of incorrectly
m
rejecting at least one is p  1  1  0.05 . When m = 1 we have p = 0.05, which is fine.
But when m = 10 we obtain p = 0.4013 and when m = 100 we obtain p = 0.9941. That is, if
we test as few as even 100 true null hypotheses, we are almost certain to incorrectly reject at
least one. Alternatively, one could control the overall family error rate, setting p = 0.05, so
that the probability of incorrectly rejecting one of more of the m true null hypotheses was
m
0.05. In this case we use 0.05  1  1    for each given m to obtain the level at which
each of the separate true null hypotheses is being tested. With m = 10 we obtain   0.0051
and with m = 100 we obtain   0.0005 . This means that we have only a very small
probability of incorrectly rejecting any of the separate component hypotheses.
Of course, things are much more complicated in practice: the hypotheses are unlikely to be
completely independent (at the other extreme, if they are completely dependent, then
accepting or rejecting one implies the acceptance or rejection of all), with an unknown and
essentially unknowable dependence structure, and there will typically be a mixture of true (or
approximately true) and false null hypotheses.
Various simultaneous test procedures have been developed in attempts to ease these
difficulties, but the problem is not really one of inadequate methods but is rather more
fundamental. A basic approach is based on the Bonferroni inequality. We can expand the
m
probability 1    that none of the true null hypotheses are rejected to yield
1    m  1  m . It follows that 1  1    m  m . That is, the probability that one or
more true null hypotheses is incorrectly rejected is less than or equal to m . In general, the
probability of incorrectly rejecting one or more of the true null hypotheses is less than the
sum of incorrectly rejecting each of them. This is a first order Bonferroni inequality. By
including other terms in the expansion, more accurate bounds can also be developed - though
they require knowledge of the dependence relationships between the hypotheses.
With some test procedures difficulties can arise in which a global test of a family of
hypotheses rejects the null hypothesis (so one believes at least one to be false) but no single
component is rejected. Once again strategies have been developed for overcoming this in
particular applications. For example, in multivariate analysis of variance, which is
concerned with comparing several groups of objects which have been measured on multiple
variables, test procedures which overcome these problems based on comparing each test
statistic with a single threshold value have been developed. See Section 5.7 for details of
references to such material.
It will be obvious from the above that attempts to put probabilities on statements of various
kinds, via hypothesis tests, while they do have a place in data mining, are not a universal
panacea. However, they can be regarded as a particular type of a more general procedure
which maps the data and statement to a numerical value or score. Higher (or lower,
depending upon the procedure) scores indicate that one statement, or model, is to be preferred
to another, without attempting any absolute probabilistic interpretation. The penalised
goodness-of-fit criteria described in Section 5.5 can be thought of as scoring methods.
5.5 Fitting complex models
The previous section dealt with estimating single parameters, or, at least, simple models. In
this section we consider more elaborate models, and we shall see that new kinds of problems
arise.
In Section 5.1 we spoke of inferring an underlying structure from the available data. In this
we are regarding the available data as being a sample drawn from some distribution which is
defined in terms of the underlying structure. We can then think of the set of all possible data
points which could have been chosen as the population of values, as described in Chapter 2.
Sometimes this population is finite (as in the population of heights of people who work for a
particular corporation) but often it is infinite - as in the amount people might spend on a
Friday night in a particular supermarket. Note that here the range of possible values is finite,
as is the number of people who will make a purchase (if unknown) but, no matter how long a
list of possible values one makes, one cannot state what the value of the next transaction will
be.
In what follows we shall again assume that the available data have been drawn from the
population by a random sampling process. If this is not the case, and if some of the
distortions outlined in Chapter 2 have occurred, then the following needs to be modified.
Note that in any case in data mining contexts one should be constantly on the alert for
possible sampling distortions. With large datasets such distortions are very likely, and they
can totally invalidate the results of an analysis. In some contexts the data available on which
to construct the model is called the design sample or training sample, and we shall sometimes
use these terms.
In the inferential modelling process the unknown distribution of the (potential) values in the
population is the model whose structure we are seeking to infer. Following the outline in
Section 5.1, an obvious strategy to follow to estimate this distribution is to define some
measure which can serve as a criterion of goodness-of-fit between the model and the
observed data, and then seek to minimise (or maximise, depending upon how it is expressed)
this. However, although obvious, this is not an ideal solution. The available data is, after
all, only a sample from the population of interest. Another sample, drawn by the same
random process, would (almost certainly) have been different from the available sample.
This means that, if one builds a model which very accurately predicts the available data, then
it is likely that this model will not be able to predict another sample very well. This would
mean that one had not modelled the underlying distributions very well. Clearly something
beyond mere goodness-of-fit to the available data is required.
We shall examine this in more detail via an example. Suppose our aim is to construct a
model which will allow us to predict the value of a variable y from the value of a variable x.
That is, we want to use the available data to infer the underlying model of the relationship
between x and y, and then subsequently use this underlying model to predict values of y from
values of x: presented with a case with a known (or measured) value of the variable x, our
model should allow us to say what its y value is likely to be. (Models with this basic
structure are very important, and will be discussed in detail in Chapter 9.)
Perhaps the first thing to note is that we do not expect to be able to obtain perfect predictions.
The values of x and y will vary from sample to sample (partly due to the random nature of the
sample selection process and perhaps also partly due to intrinsic measurement error), so we
must expect that our predictive model will also vary from sample to sample. Of course, we
hope that such variation will be slight (if not then one must question the value of the
model-building process in this case). Moreover, there may be other, unmeasured influences
on y, beyond those included in x, so that perfect prediction of y from x is not possible, even in
principle. However, we do believe that there is some relationship between y and x, some
predictive power for y in x, and it is this we hope to capture. In a sense we are being
pragmatic: we cannot obtain perfection but we can obtain something useful.
Just for the purposes of this initial example (and this is most definitely not a real requirement
in practice) we will suppose that each value of x in the design sample is associated with a
different value of y. Figure 2 shows a plot of the sort of design set we have in mind. We
could now produce a model which permits perfect prediction of the value of y from the value
of x for the points in the design set. Such a model is shown in Figure 3. It is clear that this
predictive model is quite complicated. It is also clear that other predictive models could also
be built which would give perfect prediction of the design set - other wobbly lines which
went through every point but which would lead to different predictions for other values of x.
An obvious question then is how we should choose between these models. Moreover, the
models we have proposed yield perfect predictions for the design set elements and we have
already commented that we would not expect to achieve this. If we do achieve it, we are
modelling not only the underlying structure (sometimes called the systematic variation)
which led to the data, but also the peculiarities specific to the design sample which we
happen to have (sometimes called the random variation). This means that, if we were to
observe another data point with an x value equal to one of those in the design set, it would
probably have a y value not equal to the corresponding design set y value, and hence it would
be incorrectly predicted by our model. This suggests that we need a simpler model, one
which models the underlying structure but not the extra variation in the observed y values
arising from other sources. Models which go too far, as in these examples, are said to overfit
the data. Overfitting does not only arise when a model fits the design data perfectly, but
arises whenever it goes beyond the systematic part and begins to fit the random part.
[FIGURE 2: This figure to show a scatterplot of y against x such that each x value is
associated with a unique y value. The data follow a quadratic curve, with variation
about it.]
[FIGURE 3: As figure 2, but with the addition of a wobbly line going through all the
data points.]
 DISPLAY 5.9
The performance of supervised classification rules (Chapter 9) is often assessed in terms of their
misclassification or error rate, the proportion of future objects they are likely to misclassify. For many
years this was estimated by the simple procedure of seeing how many of the design set points a rule
misclassified. However, such an approach is optimistically biased - it underestimates the error rate
which will be obtained on future points from the same distribution. This is because the classification
rule has been optimised to fit the design set very well. (It would be perverse, after all, to pick a rule
deliberately because it did not classify the design set very well.) But this very fact means that it will
overfit the design set to some extent - and the more flexible the rule the more it will overfit. Because of
this some very sophisticated methods of error rate have been developed. Many of them involve
resampling methods, in which the data are used more than once.
The leaving-one-out method, for example, follows one of the procedures described in the text. A
classifier is built on all but one of the design set points, and is then used to classify this omitted point.
Since it was not included in the design stage, there is no danger of overfitting. This is repeated for each
design set point in turn, leading to each of them being classified by a rule which is based on all the other
points. The overall estimate of error rate of a rule based on all the design set points is then the
proportion of these omitted points which have been misclassified.
Bootstrap methods, of which there are now many, randomly select a subset of the design set, of the same
size (and hence obviously with replacement), use this to construct a classifier, and find the error rate of
the remainder of the points. This is repeated for multiple bootstrap samples, the final estimate being an
average of them. Some highly sophisticated (and quite complicated) variants of this have been
developed. 
Given that the model in Figure 3 overfitted the data, Figure 4 illustrates a simpler model
which we might consider. This is a very simple model - it can be described in terms of two
parameters: y = ax + b. It is not as good a fit as the previous model, but then it shouldn’t be
- we do not want it to provide perfect prediction for the design set. But we must ask
ourselves if it is too simple. Is it failing to model some aspect of the systematic variation?
Indeed, studying the figure suggests that there is a curved component, that x is quadratically
related to y, and our linear model is failing to model that. So perhaps something rather more
complicated than the linear model is needed, but not so complicated as the model in Figure 3.
What is needed is something between the models in Figures 3 and 4. The trick in inferential
modelling is to decide just how complex the model should be. We need to find some
compromise, so that the model is flexible enough to fit the (unknown) systematic component,
but not so flexible that it fits the additional random variation. In statistics, this compromise
is known as the bias/variance trade-off (see Display 5.10): an inflexible model will mean that
it gives biased predictions (at least for some values of x) while a very flexible model will give
predictions at any particular x which may differ dramatically from sample to sample. The
term ‘degrees of freedom’, which occurs at various places in this book, is essentially the
dimension of the vector of parameters which is estimated during the process of fitting a
model. A highly flexible model has many degrees of freedom, while the simple linear model
above has only two.
[FIGURE 4: Same data again, but now with a straight line superimposed as a possible
predictive model.]
 DISPLAY 5.10
Suppose that the values of a variable y are related to predictor variables x through
y  f  x  
(1)
where f is an unknown function and  is a random variable. Our aim is to formulate a model which
will allow prediction of the value of y corresponding to a given value of x. (Note that implicit in this
formulation is the notion that there is a ‘true’ f.) Our model is to be built using a design set which
consists of a random sample of pairs
 x , y  , i = 1,…, n from (1).
i
i
For convenience, we will denote
this design set by X. Using methods described in Chapter 9, we construct a model
f  x| X  . The
notation here means that the model is based on X and provides a prediction at x.
Now, different design sets will lead to different estimators so that the prediction for a given value of x is
a random variable, varying with the design set. The accuracy of the prediction can be measured by the
mean square error between the true value of f at x (that is y  f  x ) and the predicted value


2
f  x| X  . That is, by E y  f  x| X  . Note that the expectation here, and in what follows, is over
the distribution of different design sets and over different values of y.
This measure of predictive accuracy can be decomposed. We have


E y  f  x| X 
2


2
 E  y  f  x   E f  x   f  x| X 
2




 E  y  f  x    f  x   E  f  x| X   E  E  f  x| X   f  x| X 

2
 E  y  f  x   E f  x   E f  x| X 
2

2
2
 E E f  x| X   f  x| X 
2
2
The first term on the right hand side is independent of the training set. No matter how clever we are in
building our model, we can do nothing to reduce this term. The middle term on the right hand side is the
square of the bias. The final term on the right hand side is the variance of the predicted value at x
arising from variation in the design set. Clearly, in order to reduce the overall mean square error as much
as possible we need to reduce both the bias term and the variance term. However, there are difficulties
in reducing both of these simultaneously.
Suppose that we have a very flexible estimator, which can follow the design set very well (the wobbly
line in Figure 3, for example). Then its value at x will tend to be close to the expected value f  x , so
that the bias term above (the middle one) will tend to be small. On the other hand, since it will vary
dramatically from design set to design set, the variance term (the last term above) will tend to be large.
If, on the other hand, we have a rigid estimator, which does not change much as design sets change, the
bias term may be large while the variance term will be small. A very flexible estimator will have low
bias (over design sets) but will tend to overfit the design set, giving a spurious indication of apparent
accuracy. An inflexible estimator may have large bias, and will not model the shape of f  x very
well. 
In the above example, the complexity of the model lay in the permitted flexibility of the line
used to predict y from x. Formally, one might say that the more complex models were
higher order polynomials in x. Other classes of models introduce flexibility in other ways,
but however they do it they are all subject to the dangers of overfitting. For example,
predictive models with many predictor variables are very flexible by virtue of the many
parameters associated with the many predictors, tree methods (Chapter 9) are very flexible if
they have many nodes, and neural network models are very flexible if they have many hidden
nodes. In general, if the space of possible models is large (if a large number of candidate
models are being considered) then there is a danger of overfitting (although it also depends
on the size of the design set - see below).
Our fundamental aim is to choose a model which will accurately predict future values of y,
but since y has some intrinsic and irreducible random variation (that due to measurement
error or other predictor variables not included in x) the best we can hope to do is to predict
some representative value from the distribution of y corresponding to a given x. That is, our
prediction, for a given value of x, should provide a good estimate of some sort of central
value of the conditional distribution of y given x (we denote this conditional random variable
as y|x). We have already noted that the mean provides a ‘central value’ of a distribution in
the sense that it is the value which minimises the sum of squared differences to other points
in the distribution. Thus, if one were to predict the mean of the y|x distribution one would be
doing the best one could in terms of minimising the sum of squared errors in the future.
This, then, is a sensible aim: to use as the predicted value of y for a given x the mean value of
the y|x distribution.
Various approaches can be used to avoid overfitting the design data. Here we shall briefly
describe three. Clearly, since their aims are the same, there are links between them.
One approach is as follows. In general we measure how well the model fits the design data
using some goodness-of-fit criterion. For example, we could use the sum of squared
deviations between the predicted values of y and the observed values in the above example.
The curve in Figure 3 above would then lead to a value of 0, indicating perfect prediction,
while the straight line in Figure 4 would lead to a larger (and hence poorer) sum of squared
deviations. But the perfect prediction from Figure 3 is spurious, at least as far as future data
go. We can attempt to alleviate the problem by modifying the goodness-of-fit criterion so
that it does not simply predict performance on the design set. In the above example, an
indication that the model is too flexible is given by the wobbliness of the predictor line.
Thus one might penalise the criterion one uses to measure goodness-of-fit to the design set by
adding a term which grows with increasing model complexity - which works against using
too wobbly a line.
Example 1: We have already described the fundamental role of likelihood in model fitting.
As noted in Chapter 1, the likelihood is a measure of how probable it is that the data could
have arisen from a particular model. Denote the likelihood of a model M based on data
set Y by L(M;Y). Then a measure of how far M is from the model M* which predicts the
data perfectly (more correctly, M* is the model which maximises the likelihood from
amongst those models under consideration) is given by the deviance:


D M   2 ln L M ; Y   ln L M * , Y 

The larger this is, the further is the model M from the best which can be achieved. Again,
however, this will improve (in this case, decrease) with increasing model flexibility,
regardless of whether the resulting model better reflects the underlying structure. To
allow for this an extra term is sometimes added. The Akaike information criterion (AIC)
is defined as D M   2 p , where p is the number of parameters in the model: again
increasing model complexity - increasing p - is compensated for by the penalisation term.
Another variant on this is the Bayesian information criterion, defined as D M   pn ,
where n is the number of data points.
Example 2: In our example at the beginning of this section, in which we were concerned
with predicting a single y variable from a single x variable, we noted that the model in
Figure 3 was too flexible, fitting the design data perfectly and not generalising very well.
This weakness can be spotted by eye, since it can be seen that the predicted value of y
fluctuates rapidly with changing x. One way to allow for this is to penalise the goodness
of fit by a term which measures irregularity of the prediction. A popular measure is based
on integrating the square of the second derivative  2 f x 2  over the range of x. Here
f is the model such that y  f  x . This penalty term is a measure of the curvature of the
function - high curvature indicating that the model form is too flexible.
2
Example 3: The sum of squared differences between predicted values and target values is
also a common goodness-of-fit criterion when fitting neural networks (Chapter 9). Again,
however, increasing the flexibility of the model, here by including more and more hidden
nodes for example, will improve the apparent predictive power as the model provides
better and better fits to the design set. To overcome this the sum of squared deviations is
often penalised by a term proportional to the sum of squared weights in the network so-called weight decay.
The minimum message length and minimum description length approaches to model building
are explicitly based on the notion that models which overfit the data are more complex than
models which don’t go too far. They optimise a criterion which has two components. One
is a description of the data in terms of the model, and the other is a description of the
complexity of the model. The two components are represented in commensurate terms
(code length). A more complex model will lead to a simpler description of the data, but at
the cost of requiring a longer expected code to describe the model - and vice versa. The best
estimator is one which provides a suitable compromise by minimising the overall length.
This notion of modifying some measure of fit of a model to the design data, so as to get a
better indication of how well it fits the underlying structure, is ubiquitous. Here are two
further examples from standard multiple regression (Chapter 9):
Example 4: In basic multiple regression a common measure of the predictive power of a
model is the multiple correlation coefficient, denoted R2. Denoting the values of y
predicted from the model by y , and the mean of the y values by y , this is defined as
  y  y 
R
  y  y
2
2
  y  y
 1
  y  y
2
2
It thus gives the proportion of the variance in the y values which can be explained by the
model (or one minus the ratio of the unexplained variance to the total variance). A large
value (near 1) means that the model explains much of the variation in the observed y
values, and so provides a good fit to the data. However, even if the x variables contained
no predictive power for y, R2 would be non-zero and would increase as further x variables
were included in the model. That is, as the model increases in flexibility (by adding more
variables, and hence more parameters) it begins to fit the design data better and better,
regardless of the fact that our aim is to fit the underlying structure and not the peculiarities
of the design data.
To overcome this, an adjusted (or ‘corrected’) form of R2 is often used. This is defined as
k  n  1 

R 2   R2 



n  1  n  k  1
where k is the number of predictor variables and n is the number of data points.
Whereas R2 is monotonic with increasing k (that is, it always increases as k increases) R 2
may decrease, depending on the predictive power of the new x variables.
Example 5: Ridge regression modifies the standard estimates used in least squares
regression by shrinking the covariance matrix of the predictor variables towards the
identity matrix. Formally, if the usual estimate of the regression coefficients is given by
(X’X)-1X’Y, where X is the n by k matrix of values of the k predictor variables on the n
records and Y is the n by 1 vector of values of the variable to be predicted, then the ridge
regression estimator is (X’X+kI)-1X’Y, where k is a parameter to be chosen and I is the
identity matrix.
In the above, the goodness-of-fit criterion was modified so that the model would not overfit
the design data but would provide a better fit to the underlying structure. An alternative is to
apply the goodness-of-fit criterion and a measure of overfitting one after the other.
Sometimes these two steps are applied several times in what are called stepwise algorithms
(Chapter 6). Indeed, traditional statistical methods of model fitting can be seen as adopting
such a strategy. In essence, one finds a term which improves the goodness-of-fit of the
model to the data and then carries out a statistical test to see if the improvement due to the
extra term is real or could easily be attributed to chance variation in how the design data
happen to have arisen. These are forward stepwise procedures. Backwards stepwise
procedures work in the opposite directions, beginning with a complicated model and
eliminating those terms where the improvement can be attributed to chance. This is a
common approach with tree methods, where models with too many leaf nodes are build and
then pruned back.
A third strategy for avoiding overfitting is based on the observation that our fundamental aim
is to predict future values, and to go directly for that, as in the following. Temporarily delete
one data point from the design set. Using this reduced design set find the best fitting models
of the types under consideration (e.g. in the above example, we might try linear, quadratic,
and cubic models to predict y from x). Then, for each of these models, see how accurately
they predict the y value of the observation which has been left out. Repeat this in turn for
each observation, and combine them (for example, in a sum of squared errors between the
true y values and their predicted values). This yields a measure of how well each model
predicts future values - and hence is a more appropriate criterion on which to base the model
selection.
In the opening paragraphs of this section we noted that a model which provided too good a fit
to a given design sample would be unlikely to provide a good fit to a second sample. This is
by virtue of the fact that it modelled the idiosyncratic aspects of the design sample as well as
the systematic aspects. The same applies the other way round: a model which went too far
in providing a good fit to the second sample would be unlikely to provide a good fit to the
first. This suggests that we might be better off using a model which is in some sense
between the models based on the two samples - some sort of average of the two models.
This idea can be generalised. If we had many samples, we could produce one model for
each, each highly predictive for the sample on which it was based, but less predictive for the
other samples. Then we could take an average of these models. For any particular sample,
this average would not be as good as the model built for that sample, but averaged over its
predictive performance on all samples it might well do better. This is the idea underlying the
strategy of bagging (from bootstrap aggregating). Normally, of course, one does not have
multiple samples (if one did, one would be inclined to merge them into a single large
sample). But one can simulate this through a process called bootstrapping. In
bootstrapping one draws a sample with replacement (Chapter 2) from the design set of the
same size as the design set. This serves the role of a single sample in the above - and a
model is built using this sample. This can be repeated many times, each bootstrap sample
yielding a model. Finally an average model can be obtained which smoothes out the
irregularities specific to each individual model. This approach is more effective with more
flexible models since this means that the individual models reflect substantial parts of the
intrinsic variability of their specific samples, and hence will benefit from the smoothing
implicit in bagging.
The bagging idea is related to the general notion of model averaging where, in general,
multiple models are built and the results averaged. Take a classification tree as an example.
Here the design data is recursively partitioned into smaller and smaller groups. Each
partition is based (in the simpler trees, at least) on splitting some variable at a threshold.
Choosing different variables and different thresholds will lead to different tree structures different predictive models. Elaborate trees correspond to overfitted models, and benefit can
be obtained by averaging them. Of course, it is not immediately clear how to weight the
different trees when computing the average, and many ways have been proposed. Some are
outlined in Chapter 9.
Sometimes data sets distinct from the design set, termed validation sets, are used to choose
between models (since selection of the final model depends on the validation set, technically
these are really being used as part of the design process, in a general sense).
Although our example above was a predictive model, model simplification strategies can also
be applied to descriptive models, those where the aim is simply to find good estimates of the
underlying structure.
In the above we have characterised the problem of choosing a model as one of a compromise
between inflexibility and overfitting. Improve one of these by adopting a model of different
complexity and the other gets worse. However, there is one way in which one can improve
both simultaneously - or, at least, improve one without causing the other to deteriorate, which
comes down to the same thing. This is to increase the size of the design sample on which
the model is based. For a model of a given complexity, the larger the design sample the
more accurate will be the estimates of the parameters - the smaller will be their variances. In
the limit, if the population is finite (albeit large), as it is in many data mining problems, then,
of course, the variance of the estimates will be zero when the entire population is taken.
Broadly speaking, the standard deviation of parameter estimates is inversely related to the
square root of the sample size. (This needs to be modified if the population is finite, but it
can be taken as a rule of thumb - see Section 2.5.) This means that there is a law of
diminishing returns but it also means that one can choose one’s sample sufficiently large that
the uncertainty in the parameter estimates is small enough to be irrelevant.
Perhaps it is worth adding here that in many data mining problems the large size of the data
set means overfitting is not a problem. General ad hoc rules are dangerous to give because
they may well break down for your particular problem.
5.6 Some common probability distributions
Chapter 2 described how the notion of uncertainty was fundamental in data mining exercises,
and how crucial it was to have methods for coping with it. That chapter also introduced the
idea of a probability distribution. Here we describe some of the more important probability
distributions which arise in data mining.
1. Bernoulli distribution
The Bernoulli distribution has just two possible outcomes. Situations which might be
described by such a distribution include the outcome of a coin toss (heads or tails) or success
or failure in some situation. Denoting the outcomes by 0 and 1, let p be the probability of
observing a 1 and (1-p) the probability of observing a 0. Then the probability mass function
can be written as p x 1  p , with x taking the value 0 or 1. The mean of the distribution
is p and its variance is p(1-p).
1 x
2. Binomial distribution
This is a generalisation of the Bernoulli distribution, and describes the number of ‘type 1
outcomes’ (e.g. successes) in n independent Bernoulli trials, each with parameter p. The
 n
n x
probability mass function has the form   p x 1  p . The mean is np and the variance is
 x
np(1-p).
3. Multinomial distribution
The multinomial distribution is a generalisation of the binomial distribution to the case where
there are more than two potential outcome; for example, there may be k possible outcomes,
the ith having probability pi of occurring.
Suppose that n observations have been independently drawn from a multinomial distribution.
Then the mean number of observations yielding the ith outcome is npi and its variance is
npi 1  pi  . Note that, since the occurrence of one outcome means the others cannot occur,
the individual outcomes must be negatively correlated. In fact, the covariance between the
ith and jth ( i  j ) outcome is  npi p j .
4. Poisson distribution
If random events are observed independently, with underlying rate  , then we would expect
to observe t in a time interval of length t. Sometimes, of course, we would observe none
in time t, at other times we would observe 1, and so on. If the rate is low, we would rarely
expect to observe a large number of events (unless t was large). A distribution which
describes this state of affairs is the Poisson distribution. It has probability mass function
 t  x e  t x ! . The mean and variance of the Poisson distribution are the same, both being
.
Given a binomial distribution with large n and small p such that np is a constant, then this
may be well approximated by a Poisson distribution  np e  np x ! .
x
5. Normal (or Gaussian) distribution
The probability density function takes the form
1
2
 1
exp 
2  x   
 2
 2

where  is the mean of the distribution and  2 is the variance. The standard normal
distribution is the special case with zero mean and unit variance. The normal distribution is
very important, partly as a consequence of the central limit theorem. Roughly speaking, this
says that the distribution of the mean of a sample of n observations becomes more and more
like the normal distribution as n increases, regardless of the form of the populations
distribution from which the data are drawn. (Of course, mathematical niceties require this to
be qualified for full rigor.) This is why many statistical procedures are based on an
assumption that various distributions are normal - if the sample size is large enough, this is a
reasonable assumption.
The normal distribution is symmetric about its mean, and 95% of its probability lies within
 196
. standard deviations of the mean.
6. Student’s t-distribution
Consider a sample from a normal distribution with known standard deviation  . An
appropriate test statistic to use to make inferences about the mean  would be the ratio
x

n
Using this, for example, one can see how far the sample mean deviates from a hypothesised
value of the unknown mean. This ratio will be normally distributed by the central limit
theorem (see Normal distribution above). Note that here the denominator is a constant. Of
course, in real life, one is seldom in a situation of making inferences about a mean when the
standard deviation is unknown. This means that one would usually want to replace the
above ratio by
x
s n
where s is the sample estimate of the standard deviation. As soon as one does this the ratio
ceases to be normally distributed - extra random variation has been introduced by the fact that
the denominator now varies from sample to sample. The distribution of this new ratio will
have a larger spread than that of the corresponding normal distribution - it will have fatter
tails. This distribution is called the t-distribution. Note that there are many - they differ
according to how large is the sample size, since this affects the variability in s. They are
indexed by (n-1), known as the degrees of freedom of the distribution.
We can also describe this situation by saying that the ratio of two random variables, the
numerator following a normal distribution and the square of the denominator following a
chi-squared distribution (see below), follows a t-distribution.
The probability density function is quite complicated and it is unnecessary to reproduce it
here (it is available in introductory texts on mathematical statistics). The mean is (n-1) and
the variance is  n  1  n  3 .
7. Chi-squared distribution
The distribution of the sum of the squares of n values, each following the standard normal
distribution, is called the chi-squared distribution with n degrees of freedom. Such a
distribution has mean n and variance 2n. Again it seems unnecessary to reproduce the
probability density function here - it can be readily found in introductory mathematical
statistics texts it needed. The chi-squared distribution is particularly widely used in tests of
goodness-of-fit.
8. F distribution
If u and v are independently distributed with n1 and n 2 degrees of freedom, respectively,
then the ratio
u n1
v n2
is said to follow an F distribution with n1 and n 2 degrees of freedom. This is widely used
in tests to compare variances, such as arise in analysis of variance applications.
F
9. The multivariate normal distribution
This is an extension of the univariate normal distribution to multiple random variables. Let
x  x1 ,, x p denote a p component random vector. Then the probability density function


of the multivariate normal distribution has the form
1
 2 
p 2

12
 1

exp  x    '  1  x    
 2

where  is the mean vector of the distribution and  is the covariance matrix.
Just as the univariate normal distribution plays a unique role, so does the multivariate normal
distribution. It has the property that its marginal distributions are normal, as also are its
conditional distributions (the joint distribution of a subset of variables, given fixed values of
the others). Note, however, that the converse is not true: just because the p marginals of a
distribution are normal, this does not mean the overall distribution is multivariate normal.
5.7 Further reading
The material in this chapter is covered in more detail in statistics texts - introductory ones,
such as Daly et al (1995), for the more basic material and more advanced texts, such as Cox
and Hinkley (1974) and Lindsey (1996), for a deeper discussion of inferential concepts.
Bayesian methods now have their own books. A comprehensive one is Bernardo and Smith
(1994) and a lighter introduction is Lee (1989). Miller (1980) describes simultaneous test
procedures. Nonparametric methods are described in Randles and Wolfe (1979) and Maritz
(1981).
References:
Bernardo J.M. and Smith A.F.M. (1994) Bayesian Theory. Chichester: Wiley.
Cox D.R. and Hinkley D.V. (1974) Theoretical Statistics. London: Chapman and Hall.
Daly F., Hand D.J., Jones M.C., Lunn A.D., and McConway K. (1995) Elements of statistics,
Wokingham, England: Addison-Wesley.
Lee P.M. (1989) Bayesian Statistics: an introduction. London: Edward Arnold.
Lindsey J.K. (1996) Parametric Statistical Inference. Oxford: Clarendon Press.
Maritz J.S. (1981) Distribution-Free Statistical Methods. London: Chapman and Hall.
Randles R.H. and Wolfe D.A. (1979) Introduction to the Theory of Nonparametric Statistics.
New York: Wiley.