Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 5 : PRINCIPLES OF MODEL FITTING 5.1 Introduction The aim of this chapter is to outline the basic principles of model fitting. In doing so we will introduce some important elementary model forms, leaving more advanced models which are important in data mining to be described in Chapters 8 and 9. We begin by reminding you of the distinction between a model and a pattern. A model is a high level, global description of a data set. It takes a large sample perspective. It may be descriptive - summarising the data in a database in a convenient and concise way or it may be inferential, allowing one to make some statement about the population from which the data were drawn or about likely future data values. Examples of models are Box-Jenkins time series, Bayesian belief networks, and regression models. In contrast, a pattern is a local feature of the data, perhaps holding for only a few records or for a few variables (or both). Patterns are of interest because they represent departures from the general run of the data: a pair of variables which have a particularly high correlation, a set of items which have exceptionally high values on some variables, a group of records which always score the same on some variables, and so on. As with models, one may want to find patterns for descriptive reasons or for inferential reasons: one may want to identify members of the existing database which have unusual properties, or one may want to predict which of future records are likely to have unusual properties. Examples of patterns are transient waveforms in an EEG trace, a peculiarity of some types of customers in a database, and outliers. Patterns are the primary focus of Chapter 7. What precisely constitutes an interesting model or pattern depends entirely on the context. Something striking, unusual, or worth knowing about in one context may be quite run of the mill in another - and vice versa. Although, of course, there is much overlap, methods of model building in data mining mainly have their roots in statistics, while methods of pattern detection mainly have their roots in machine learning. The word ‘roots’ here is important, since the use of the methods in data mining contexts has introduced differences in emphasis - a key one arising from the large sizes of the data sets typically dealt with in data mining contexts. A very basic form of model is a single value summarising a set of data values. Take a database in which one of the fields holds household income. If the database has a million records (or, indeed, even a hundred values), we need to summarise these values in some way, so that we can get an intellectual grasp on them. If the values are all identical, then summarising them is easy. But what if they are all different, or, at least, come from a very large set of possible values (as will typically be the case in real life for such a variable) so that there are many different values in the database? What we need to do is replace the list of a million values by one, or perhaps a few, numbers which describe the general shape of the distribution of values. A popular summarising number is the average or mean. This is the single number which is closest to those in the database in the sense that the sum of the squared differences between this number and all the others is minimised. Note what is going on here. To find our single summarising value we chose a criterion (we used the sum of squared differences between the number which we would use as our descriptive summary and the given data) and then found the number which optimised (here minimised) this criterion. This is the essence of the model-fitting process (although, as we shall see, this basic idea will need to be modified). We decide on the basic form of the model (here a single summarising number), decide on the criterion by which we measure the adequacy of the model (here a sum of squared differences between the single summarising number and the data points), and find that model (here the single summarising number) which optimises the criterion. Even in this simple example other criteria can be used. We could, for example, have minimised the sum of absolute differences between the data and the single summarising number (this would lead to the median of the data). Similarly (although hardly relevant in so simple an example as this one) different algorithms might be used to find that model which optimises the criterion. In general, numbers calculated from the data are called statistics. It is obvious that a single summary statistic is of rather limited value. We can make comparative statements about the mean values of datasets - the mean value of the income this year is greater than it was last year, the mean value of income for this group of households is greater than for that group, and so on - but it leaves a lot of questions unanswered. Certainly, on average this group has a lower income than that group, but what about at the extremes? Is the spread of incomes greater in this group than in that? Do the high earners in this group earn more than in that? What is the most common value of household income? And so on. In view of this, it would be nice to supplement the mean value with some further descriptive statistics. We might choose to describe the variability of the data, and perhaps its skewness. We might also choose to give the smallest and largest values, or perhaps the values of quartiles, or we might choose to give the most common value (the mode), and so on. Technical definitions of these and other summary statistics are given in Section 5.2. Our model is now becoming more sophisticated. In place of the single mean value we have several values which summarise the shape of the distribution of values in the database. Note, however, that we are still simply summarising the values in the database. We are describing the set of values which actually exist in the database in a convenient (and approximate) way. This is sufficient if our interest lies in the items in the database and in no others. In particular, it is of interest if we do not want to make any statements about future values which may be added to the database, or about the values which may arise in future databases (for example, the distribution of household incomes next year). Questions like these new ones are not questions of description, but are questions of inference. We want to infer likely values (or likely mean values or whatever) for data which might arise. We can base our inference on the database we have to hand, but such inference goes beyond mere description. Up until now our model has simply been a description, summarising given data. To use a model for inference requires interpreting it in a rather different way. Now we think of it as approximating an underlying structure from which the data arose. When thought of in this way, it is not much of an intellectual leap to imagine further data arising from the same underlying structure. And it is this which allows us to make statements about how likely particular values are to be observed in the future. The process of going from the available data to an approximating description of an underlying structure, rather than simply a summarising description of the available data, is inference. We might also, at this point, draw a further distinction between two different kinds of models used in inference. Sometimes one has a good idea, based on theory, of the nature of the model which might have led to the data. For example, one might have a theory of stellar evolution which one believes should explain the observations on stars, or a theory of pharmacokinetics explaining the observations on drug concentrations. In this case the model is a mechanistic (or realistic, substantive, iconic or phenomenological) one. In contrast, sometimes one has no theoretical basis for the model and simply constructs one based on the relationships observed in the data. These are empirical (or operational) models. In data mining, one will usually be concerned with empirical models. In any case, recall from the discussion in Chapter 2 that, even in the case of mechanistic models, the modern view is that no models are ‘true’. They are simply approximations to reality - the aim of science being to obtain better and better approximations. When making an inferential statement, one can never be totally sure that one is right (if one can be sure then it is hardly inference!). Thus it is often useful to add a measure of the uncertainty surrounding one’s conclusions. In fact, sometimes, rather than giving a single number for an inferred value, one will give an interval estimate - a range of values which one is fairly confident contains the underlying value. Interpretation of such estimates depends on the type of inference one is carrying out. We shall say more about interval estimates in Section 5.3. Inferential models are often more elaborate than descriptive models. This is hardly surprising, since their essential use is to make statements about other data which might be observed. In particular, this means that, where a descriptive model will be happy to stop with simple numerical summaries, inferential models have to say more. This is typically achieved by means of a probability model for the data. Thus one might assume that the data have arisen from a particular form of probability distribution, and that future data will also arise from this distribution. In this case one will use the available data to infer the parameters describing that distribution - subsequently permitting statements to be made about potential new data. Section 5.2 opens by reminding the reader of the meaning of some of the more elementary descriptive statistical tools. These are elementary in the sense that they are basic tools, but also in the sense that they are the elements from which much else is built. Description, however, is but one kind of data mining aim. Another is inference - to draw some conclusion, on the basis of the available data, about other data which might have been collected or which might be collected in the future. Section 5.3 discusses this, in the context of estimation - determining good values for parameters describing distributions. Properties of estimators which make them attractive are discussed, and maximum likelihood and Bayesian estimation techniques are outlined. Hypothesis test have traditionally played a large role in data analysis, but the unique problems of data mining require us to examine them afresh, raising new issues which were not critical when only small data sets and limited data exploration was carried out. Hypothesis tests are the focus of Section 5.4. Simple parameter estimation is all very well, but with large data sets and fast computers there comes the possibility of fitting very complex models with many (hundreds or even thousands) of parameters. Although traditional statistics and data analysis have developed understanding of the problems that will occur in these circumstances, data mining is giving more urgency to them. Model fitting is seen not to be simply an issue of finding that structure which provides a good match to the data - as discussed in Section 5.5. Section 5.6 rounds off by providing brief descriptions of some of the more important statistical distributions which underlie model building in data mining. Section 5.7 gives some pointers to further reading. 5.2 Summarising data: some simple examples We have already described the mean as a simple summary of the ‘average’ value of a collection of values. As noted above, it is the value which is ‘central’ in the sense that it minimises the sum of squared differences between it and the data values. It is computed by dividing the sum of the data values by their number (see Display 5.1). Thus, if there are n data values, the mean is the value such that the sum of n copies of it equals the sum of the data values. The mean is a measure of location. Other important measures of location are the median, which is the value which has an equal number of data points above and below it. (Easy if n is an odd number. When there is an even number it is usually defined as halfway between the two middle values.) Another measure of location is the mode - the most common value of the data. Sometimes distributions have more than one mode - and then they are multimodal. Other measures of location focus on other parts of the distribution of data values. The first quartile is that value such that a quarter of the data points lie below it. The third quartile has three quarters below it. We leave it to you to discover why we have not mentioned the second quartile. Likewise, deciles and percentiles are sometimes used. Various measures of dispersion or variability are also common. They include the standard deviation and its square, the variance. The variance is defined as the average of the squared differences between the mean and the individual data values. (So we see, that since the mean minimises the sum of these squared differences, there is a close link between the mean and the variance.) The interquartile range is common in some applications, defined as the difference between the third and first quartile. The range itself is the difference between the largest and smallest data point. Skewness measures whether or not a distribution has a single long tail. For example, the distribution of peoples’ incomes typically shows the vast mass of people earning small to moderate values, and just a few people earning large sums, tailing off to the very very few who earn astronomically large sums - the Bill Gateses of the world. A distribution is said to be right-skewed if the long tail is in the direction of increasing values and left-skewed otherwise. Right-skewed distributions are more common. Symmetric distributions have zero skewness. DISPLAY 5.1 Suppose that x , , x The mean is x xi n 1 n are a set of n data values. x x n , although, for reasons explained in Display 5.4, it is sometimes x x n 1 . 2 The variance is i 2 calculated as i The standard deviation is x i x n . 2 A common measure of skewness is x i x 3 x i x 2 3/ 2 . Of course, data summaries do not have to be single numbers, and they do not have to be simple. In Section 1.2 we noted the distinction between clustering and segmentation. Both can be used to provide summaries of a database. In the first case one will be seeking a concise and accurate description of the data, finding naturally occurring groupings in the data to yield this. Thus a cluster analysis of automobiles may reveal that there are luxury vehicles, small family saloons, compact hatchbacks, and so on (depending, of course, on the variables used to describe the vehicles). The result will be a convenient summary of a large mass of data into a handful of simple types. Of course, accuracy is lost by this process. There will doubtless be vehicles which lie on the edges of the divisions. But this is more than compensated for by the reduction in complexity of the description, from the entire list of data to a handful of types. Applying a segmentation strategy to the same database will also yield a summarising description, although now the summary will be in terms which are convenient for some external reasons and may not reflect natural distinctions between the types of vehicle. However, the segmentation will also lead to a concise summary of the data. In these examples, the models are both summarising the data in the database - they are not inferential, though such models can also be inferential. (Cluster analysis has been widely used in psychiatry in an effort to characterise different kinds of mental illness. Here the aim is clearly not simply to summarise the data which happens to be to hand, but rather it is to make some kind of general statement about the types of disease - a fundamentally inferential statement.) So, in the automobile example, the models are descriptive because they are being used to summarise the available data. However, they are also descriptive in the sense introduced in Chapter 1 - they simply aim to describe the structure implicit in the data, without any intention to be able subsequently to predict values of some variables from others. Other kinds of models may be predictive in this sense, while still being descriptive in the sense that they summarise data and there is no intention of going beyond the summary data. One might, for example, seek to summarise the automobile data in terms of relationships between the variables recorded in those data. Perhaps weight and engine size are highly correlated, so that a good description is achieved by noting only one of these, along with the correlation between them. 5.3 Estimation The preceding section described some ways of summarising a given set of data. When we are concerned with inference, however, we want to make more general statements, statements about the entire population of values which might have been drawn. That is, we want to make statements about the probability mass function or probability density function (or, equivalently, about the cumulative distribution function) from which the data arose. The probability mass function gives the probabilities that each value of the random variable will occur. The probability density function gives the probability that a value will occur in each infinitesimal interval. Informally, one can think of these functions as showing the distribution of values one would have obtained if one took an infinitely large sample. More formal descriptions are given in texts on mathematical statistics. Thus the same principles apply to calculating descriptive measures of probability mass functions and probability density functions as they do to calculating descriptive measures of finite samples (although, in the case of density function, integrals rather than summations are involved). In particular, we can calculate means, medians, variances, and so on. In this context there is a conventional notation for the mean or, as it is also known, the expected value of a distribution: the expected value of a random variable x is denoted E(x). 5.3.1 Desirable properties of estimators In the following subsections we describe the two most important modern methods of estimating the parameters of a model. There are also others. Different methods, naturally enough, behave differently, and it is important to be aware of the properties of the different methods so that one can adopt a method suited to the problem. Here we briefly describe ~ some attractive properties of estimators. Let be an estimator of a parameter . Since ~ is a number derived from the data, if we were to draw a different sample of data, we ~ ~ would obtain a different value for . is thus a random variable. That means that it has a distribution - with different values in the distribution arising as different samples are drawn. Since it has a distribution of values, we can obtain descriptive summaries of that distribution. ~ It will, for example, have a mean or expected value, E . ~ ~ The bias of is E , the difference between the expected value of the estimator and ~ the true value of the parameter. Estimators for which E 0 are said to be unbiased. Such estimators show no systematic departure from the true parameter value. Just as the bias of an estimator can be used as a measure of its quality, so also can its ~ ~ ~ 2 variance: Var E E . One can choose between estimators which have the same bias by using their variance. Estimators which have minimum variance for given bias are called, unsurprisingly, best unbiased estimators. 2 ~ ~ The mean squared error of is E . That is, it is the averaged squared difference between the value of the estimator and the value of the parameter. Mean squared error has a natural decomposition as 2 2 ~ ~ ~ ~ 2 ~ ~ E E E E Bias 2 Var ~ the sum of the squared bias of and its variance. Mean squared error is an attractive criterion since it includes both systematic (bias) and random (variance) differences between the estimated and true values. Unfortunately, we often find that they work in different directions: modifying an estimator to reduce its bias leads to one with larger variance, and vice versa, and the trick is to arrive at the best compromise. Sometimes this is achieved by deliberately introducing bias into unbiased estimators. An example of this decomposition of mean squared error is given in Display 5.10. There are more subtle aspects to the use of mean squared error in estimation. For example, it treats equally large departures from as equally serious, regardless of whether they are above or below . This is appropriate for measures of location, but perhaps not so appropriate for measures of dispersion, which, by definition, are bounded below by zero, or for estimates of probabilities or probability densities. ~ ~ Suppose that we have a sequence of estimators, 1 ,n based on increasing sample sizes. ~ Then the sequence is said to be consistent if the probability that the difference between i and the true value is greater than any given value tends to 1 as the sample size increases. This is clearly an attractive property (perhaps especially in data mining contexts, with large samples) since it means that the larger the sample the closer the estimator is likely to be to the true value. 5.3.2 Maximum likelihood estimation Maximum likelihood estimation is the most important method of parameter estimation. Given a set of n observations, x1 , , x n , independently sampled from the same distribution f x| (independently and identically distributed, or iid, as statisticians say), the likelihood function is L | x1 , , x n f xi | n i 1 Recall that this is a function of the unknown parameter , and that it shows the probability that the data would have arisen under different values of this parameter. The value for which the data has the highest probability of having arisen is the maximum likelihood estimator (or MLE). This is the value which maximises L. Maximum likelihood estimators are often denoted by putting a caret over the symbol for the parameter - here . DISPLAY 5.2 Suppose we have assumed that our sample of n data points have arisen independently from a normal distribution with unit variance and unknown mean . (We agree that this may appear to be an unlikely situation, but please bear with us for the sake of keeping the example simple!) Then the likelihood function for is 2 1 exp xi 2 2 1 2 exp xi 2 L | x1 , , x n 2 2 Setting the derivative 1 d L | x1 , x n to 0 yields d x and hence 1 2 i 0 xi n , the sample mean. Maximum likelihood estimators are intuitively and mathematically attractive - for example, they are consistent. Moreover, if is the MLE of a parameter , then g is the MLE of the function g . (Some care needs to be exercised if g is not a one-to-one function.) On the other hand - nothing is perfect - maximum likelihood estimators are often biased - see Display 5.4. For simple problems (where ‘simple’ refers to the mathematical structure of the problem, and not to the number of data points, which can be large) MLEs can be found using differential calculus. See, for example, Display 5.2. In practice, the log of the likelihood is usually maximised (e.g. Display 5.3) since this replaces the awkward product in the definition by a sum; it leads to the same result as maximising L directly because log is a monotonic transformation. For more complicated problems, especially those involving multiple parameters, mathematical search methods such as steepest descent, genetic algorithms, or simulated annealing must be used. Multiple maxima can be an especial problem (which is precisely why stochastic optimisation methods are often necessary), as can situations where optima occur at the boundaries of the parameter space. DISPLAY 5.3 Customers in a supermarket either purchase or do not purchase milk. Suppose we want an estimate of the proportion purchasing milk, based on a sample of 1000 randomly drawn from the database. A simple model here would be that the observations independently follow a Bernoulli distribution with some unknown parameter p. The likelihood is then L p xi 1 p 1 xi p r 1 p 1000 r where x i takes the value 1 if the ith customer in the sample does purchase milk and 0 if he or she does not, and r is the number amongst the 1000 who do purchase milk. Taking logs of this yields log L r log p 1000 r log1 p which is straightforward to differentiate. DISPLAY 5.4 The maximum likelihood estimator of the variance 1 n xi x . However, 2 of a normal distribution N , 2 is 2 n 1 2 2 1 E xi x n n so that the estimator is biased. It is common to use the unbiased estimator 1 n 1 xi x . 2 DISPLAY 5.5 Simple linear regression is widely used in data mining. This is discussed in detail in Chapter 9. In its most simple form it relates two variables x, a predictor or explanatory variable, and y a response variable. The relationship is assumed to take the form y = a + bx + , where a and b are parameters and is a random variable assumed to come from a normal distribution with mean 0 and variance . 2 The likelihood function for such a model is L 2 1 1/ 2 yi a bx e 2 2 1 1/ 2 yi a bx e n/2 2 2 To find the maximum likelihood estimators of a and b we can take logs, and discard terms which do not involve a or b. This yields l yi a bxi 2 That is, one can estimate a and b by finding those values which minimise the sum of squared differences between the predicted values (a+bx) and the observed values ( yi ). Such a procedure - minimising a sum of squares - is ubiquitous in data mining, and goes under the name of the least squares method. The sum of squared criterion is of great historical importance. We have already seen how the mean results from minimising such a criterion and it has roots going back to Gauss and beyond. Derived as above, however, any concern one might feel about the apparent arbitrariness of the choice of a sum of squares (why not a sum of absolute values, for example?) vanishes (or at least is shifted) - it arises naturally from the choice of a normal distribution for the error term in the model. Up to this point we have been discussing point estimates, single number estimates of the parameter in question. Such estimates are ‘best’ in some sense, but they convey no sense of the uncertainty associated with them - perhaps the estimate was merely the best of a large number of almost equally good estimates, or perhaps it was clearly the best. Interval estimates provide this sort of information. In place of a single number, they give an estimate of an interval which one can be confident, to some specified degree, contains the unknown parameter. Such an interval is a confidence interval, and the upper and lower limits of the interval are called, naturally enough, confidence limits. Interpretation of confidence intervals is rather subtle. Since, here, we are assuming that is unknown but fixed, it does not make sense to say that has a certain probability of lying within a given interval: it either does or it does not. However, it does make sense to say that an interval calculated by the given procedure, contains with a certain probability: the interval, after all, is calculated from the sample and is thus a random variable. As an example (deliberately artificial to keep the explanation simple - in more complicated and realistic situations the computation will be handled by your computer) suppose the data consist of 100 independent observations from a normal distribution with unknown mean but known variance 2 , and we want a 95% confidence interval for . (This situation is particularly unrealistic since, if the variance were known, the mean would almost certainly also be known.) That is, we want to find a lower limit L(x) and an upper limit U(x) such that P L( x ), U ( x ) 0.95 . The distribution of the sample mean x in this situation is known to follow a normal distribution with mean and variance 2 100 , and hence standard deviation 10 . We also know, from the properties of the normal distribution (see Section 5.6), that 95% of the probability lies within 1.96 standard deviations of the mean. Hence P 196 . 10 x 196 . 10 0.95 . This can be rewritten as . 10 and P x 196 . 10 x 196 . 10 0.95 . Thus L( x ) x 196 U ( x ) x 196 . 10 define a suitable 95% confidence interval. The same principle can be followed to derive confidence intervals in situations involving other distributions: we find a suitable interval for the statistic and invert it to find an interval for the unknown parameter. As it happens, however, the above example has more widespread applicability than might at first appear. The central limit theorem (see Section 5.6) tells us that the distribution of many statistics can be well approximated by a normal distribution, especially if the sample size is large, as in typical data mining applications. Thus one often sees confidence intervals calculated by the method above, even though the statistic involved is much more complicated than the mean. 5.3.3 Bayesian estimation We briefly described the Bayesian perspective on inference in Chapter 2. Whereas in the frequentist perspective a parameter is regarded as a fixed but unknown quantity, Bayesians regard as having a distribution of possible values. Before the data are analysed, the distribution of probabilities that will take different values is known as the prior distribution, p - this is the prior belief the investigator has that the different values of the parameter are the ‘true’ value. Analysis of the data leads to modification of this distribution, to yield the posterior distribution, p | x . This represents an updating of the prior distribution, taking into account the information in the empirical data. The modification from prior to posterior is carried out by means of a fundamental theorem named after Sir Thomas Bayes - hence the name. Bayes’ theorem states p | x p x| p (5.1) p x| p d Note that this updating procedure had led to a distribution, and not a single value, for . However, the distribution can be used to yield a single value estimate. One could, for example, take the mean of the posterior distribution, or its mode. For a given set of data, the denominator in equation (5.1) is a constant, so we can alternatively write the expression as p | x p x| p (5.2) In words, this says that the posterior distribution of given x (that is, the distribution conditional on having observed the data x) is proportion to the product of the prior p and the likelihood p x| . Note that the structure of (5.1) and (5.2) means that the distribution can be updated sequentially if the data arise independently, p | x, y p y| p x| p , a property which is attractive in data mining applications where very large data sets may not fit into the main memory of the computer at any one time. Note also that this result is independent of the order of the data (provided, of course, that x and y are independent). The denominator in (5.1), p x p x| p d , is called the predictive distribution of x, and represents our predictions about the value of x. It includes our uncertainty about , via the prior p , and our uncertainty about x when is known, via p x| . The predictive distribution will change as new data are observed and p becomes updated. The predictive distribution can be useful for model checking: if observed data x has only a small probability according to the predictive distribution, then one may wonder if that distribution is correct. DISPLAY 5.6 Suppose that we believe a single data point x comes from a normal distribution with unknown mean and known variance . ~ N 0 , 0 , with 0 That is, x ~ N and , . Suppose our prior distribution for is 0 known. Then p | x p x| p 1 2 exp 1 2 x 2 1 2 0 exp 1 2 0 0 2 1 exp 2 1 0 1 0 0 x 2 The mathematics here looks horribly complicated (a fairly common occurrence with Bayesian methods) but consider the following reparameterisation. Let 1 01 1 and 1 1 1 0 0 x Then 2 1 1 p | x exp 2 1 1 1 exp 1 1 2 2 Since this is a probability density function for p | x This is a normal distribution , it must integrate to unity. Hence 1 2 1 exp 1 1 21 2 N 1 , 1 . This means that the normal prior distribution has been updated to yield a normal posterior distribution. In particular, this means that the complicated mathematics can be avoided. Given a normal prior for the mean, and data arising from a normal distribution as above, then we can obtain the posterior merely by computing the updated parameters. Moreover the updating of the parameters is not as messy as it might at first seem. Reciprocals of variances are called precisions. Here, 1 1 , precision of the updated distribution, is simply the sum of the precisions of the prior and the data distributions. This is perfectly reasonable: adding data to the prior should decrease the variance, or increase the precision. Likewise, the updated mean, 1 , is simply a weighted sum of the prior mean and the datum x, with weights which depend on the precisions of those two values. When there are n data points, with the same situation as above, the posterior is again normal, now with updated parameter values 1 1 0 n and 1 1 1 0 0 xn We should say something about the choice of prior distribution. The prior distribution represents one’s initial belief that the parameter takes different values. The more confident one is that it takes particular values, the more closely the prior will be bunched about those values. The less confident one is, the larger will be the dispersion of the prior. In the case of a normal mean, if one had no idea of the true value, one would want to take a prior which gave equal probability to each possible value. That is, one would want to adopt a prior which was perfectly flat or which had infinite variance. This would not correspond to any proper density function (which has to have some non-zero values and which has to integrate to unity). Despite this, it is sometimes useful to adopt improper priors, which are uniform throughout the space of the parameter. One can think of such priors as being essentially flat in all regions where the parameter might conceivably occur. Even so, there still remains the difficulty that priors which are uniform for a particular parameter are not uniform for a nonlinear transformation of that parameter. Another issue, which might be seen as a difficulty or a strength of Bayesian inference, is that priors show the prior belief an individual has in the various possible values of a parameter and individuals differ. It is entirely possible that your prior will differ from mine. This means that we will probably obtain different results from an analysis. In some circumstances this is fine. But it others it is less so. One way to overcome this which is sometimes applicable is to use a so-called reference prior, a prior which is agreed by convention. A common form of reference prior is Jeffrey’s prior. To define this we need to define the Fisher information. This is defined as 2 log L | x I | x E 2 That is, the negative of the expectation of the second derivative of the log-likelihood. Essentially this measures the curvature of the likelihood function - or its flatness. And the flatter a likelihood function is, the less the information it is providing about the parameter values. Jeffrey’s prior is then defined as p I | x The reason that this is a convenient reference prior is that, if is some function of then this has prior proportional to I | x . This means that a consistent prior will result however the parameter is transformed. The example presented in Display 5.6 began with a normal prior and ended with a normal posterior. Conjugate families of distributions satisfy this property in general: the prior distribution and posterior distribution both belong to the same family. The advantage of using conjugate families is that the complicated updating process can be replaced by a simple updating of the parameters. We have already remarked that it is straightforward to obtain single point estimates from the posterior distribution. Interval estimates are also easy to obtain - integration of the posterior distribution over a region will give the estimated probability that the parameter lies in that region. When a single parameter is involved, and the region is an interval, the result is a credibility interval. A natural such interval would be the interval containing a given probability (say 90%) and such that the posterior density is highest over the interval. This gives the shortest possible credibility interval. Given that one is prepared to accept the fundamental Bayesian notion that the parameter is a random variable, then the interpretation of such intervals is much more straightforward than the interpretation of frequentist confidence intervals. 5.4 Hypothesis tests In many situations one wishes to see if the data support some idea about the value of a parameter. For example, we might want to know if a new treatment has an effect greater than the standard treatment, or if the regression slope in a population is different from zero. Since often, we will be unable to measure these things for the entire population, we must base our conclusion on a sample. Statistical tools for exploring such hypotheses are called hypothesis tests. The basic principle of such tests is as follows. We begin by defining two complementary hypotheses: the null hypothesis and the alternative hypothesis. Often the null hypothesis is some point value (e.g. that the effect in question has value zero - that there is no treatment difference or regression slope) and often the alternative hypothesis is simply the complement of the null hypothesis. Suppose, for example, that we are concerned with drawing conclusions about a parameter . Then we will denote the null hypothesis by H 0 : 0 , and the alternative hypothesis by H1 : 0 . Using the observed data, we calculate a statistic (what form of statistic is best depends on the nature of the hypothesis being tested; examples are given below). If we assume that the null hypothesis is correct, then such a statistic would have a distribution (it would vary from sample to sample), and the observed value would be one point from that distribution. If the observed value was way out in the tail of the distribution, we would have to conclude that either an unlikely event had occurred (that is, that an unusual sample had been drawn, since it would be unlikely we would obtain such an extreme value if the null hypothesis were true), or that our assumption was wrong and that the null hypothesis was not, in fact, true. The more extreme the observed value, the less confidence we would have in the null hypothesis. We can put numbers on this procedure. If we take the top tail of the distribution of the statistic (the distribution based on the assumption that the null hypothesis is true) then we can find those potential values which, taken together, have a probability of 0.05 of occurring. Then if the observed value did lie in this region, we could reject the null hypothesis ‘at the 5% level’, meaning that only 5% of the time would we expect to see a result in this region if the null hypothesis was correct. For obvious reasons, this region is called the rejection region. Of course, we might not merely be interested in deviations from the null hypothesis in one direction. That is, we might be interested in the lower tail, as well as the upper tail of the distribution. In this case we might define the rejection region as the union of the test statistic values in the lowest 2.5% of the probability distribution of the test statistic and the test statistic values in the uppermost 2.5% of the probability distribution. This would be a two-tailed test, whereas the former was a one-tailed test. The size of the rejection region can, of course, be chosen at will, and this size is known as the significance level of the test. Common values are 1%, 5%, and 10%. We can compare different test procedures in terms of their power. The power of a test is the probability that it will correctly reject a false null hypothesis. To evaluate the power of a test, we need some specific alternative hypothesis. This will enable us to calculate the probability that the test statistic will fall in the rejection region if the alternative hypothesis is true. The fundamental question now is, how do we find a good test statistic for a particular problem? One strategy is to use the likelihood ratio. The likelihood ratio test statistic to test the hypothesis H 0 : 0 against the alternative H1 : 0 is defined as x L0 | x sup L | x That is, the ratio of the likelihood when 0 to the largest value of the likelihood when is unconstrained. Clearly the null hypothesis should be rejected when x is small. This can easily be generalised to situations when the null hypothesis is not a point hypothesis but includes a set of possible values for . DISPLAY 5.7 Suppose we have a sample of n points independently drawn from a normal distribution with unknown mean and unit variance, and that we wish to test the hypothesis that the mean has value 0. The likelihood under this (null hypothesis) assumption is L 0| x p xi |0 1 2 1 exp xi 0 2 2 The maximum likelihood estimator of the mean of a normal distribution is the sample mean, so that the unconstrained maximum likelihood is L | x p xi | The ratio of these simplifies to 1 2 1 exp xi x 2 2 x exp n x 0 2 2 Our rejection region is thus as x| x c for a suitably chosen value of c. This can be rewritten 2 x ln c . Thus the test statistic x has to be compared with a constant. n Certain situations arise particularly frequently in testing. These include tests of differences between means, tests to compare variances, and tests to compare an observed distribution with a hypothesised distribution (so-called goodness-of-fit tests). Display 5.8 outlines the common t-test of a difference between the means of two independent groups. Descriptions of other tests can be found in introductory statistics texts. DISPLAY 5.8 x ,, x be a sample of n observations randomly drawn from a normal distribution N , , and let y ,, y be an independent sample of m observations randomly drawn from Let 1 n 2 x 1 a normal distribution m N y , 2 . Suppose we wish to test the hypothesis that the means are equal: H0 : x y . The likelihood ratio statistic under these circumstances reduces to t with s s x2 where xy s 2 1 n 1 m n 1 m1 s y2 , nm2 nm2 sx2 x x n 1 is the estimated variance for the x sample, with similar for s y2 . s 2 2 here is thus seen to be simple a weighted sum of the sample variances of the two samples, and the test statistic is merely the difference between the two sample means adjusted by the estimated standard deviation of that difference. Under the null hypothesis t follows a t distribution with n+m-2 degrees of freedom (see Section 5.6). Although normality of the two populations being compared here is assumed, this test is fairly robust to departures from normality, especially if the sample sizes are roughly equal and the variances are roughly equal. This test is very widely used. The hypothesis testing strategy outlined above is based in the fact that a random sample has been drawn from some distribution, and the aim is to make some probability statement about a parameter of that distribution. The ultimate objective is to make an inference from the sample to the general underlying population of potential values. For obvious reasons, this is sometimes described as the sampling paradigm. An alternative strategy is sometimes appropriate, especially when one is not confident that the sample have been obtained by probability sampling (see Chapter 2). Now inference to the underlying population is not possible. However, one can still sometimes make a probability statement about how likely some effect is under a null hypothesis. Consider, for example, a comparison of a treatment and control group. We might adopt, as our null hypothesis, that there is no treatment effect, so that the distribution of scores of people who received the treatment should be the same as that of those who did not. If we take a (possibly not randomly drawn) sample of people and randomly assign them to the treatment and control group, then we would expect the difference of (say) mean scores between the groups to be small. Indeed, under fairly general assumptions, it is not difficult to work out the distribution of the difference between the sample means of the two groups one would expect if there was no treatment effect, and if such difference just arose as a consequence of an imbalance in the random allocation. One can then explore how unlikely it is that a difference as large or larger than that actually obtained would be seen - in much the same way as before. Tests based on this principle are termed randomisation tests or permutation tests. Note that they make no statistical inference from the sample to the overall population, but they do permit one to make a conditional probability statement about the treatment effect, conditional on the observed values. Many statistical tests make certain assumptions about the forms of the population distributions from which the samples are drawn. The two-sample t-test in Display 5.8 illustrates this; there an assumption of normality was made. Often, however, it is inconvenient to make such assumptions. Perhaps one has little idea or justification for the assumption, or perhaps the data are known not to follow the form required by a standard test. In such circumstances one can adopt distribution free tests. Tests based on ranks fall into this class. Here the basic data are replaced by the numerical label of the position in which they occurred. For example, to explore whether two samples had arisen from the same distribution, one could replace the actual numerical values by their ranks. If they had arisen from the same distributions, then one would expect that the ranks of the members of the two samples would be well mixed. If, however, one distribution had a larger mean than the other, one would expect one sample to tend to have large ranks and the other small ranks. If the distributions had the same means, but one sample had a larger variance than the other, then one would expect one sample to show a surfeit of large and small ranks, and the other to dominate the intermediate ranks. Test statistics can be constructed based on the average values or some other combinations of the ranks, and their significance level can be evaluated using randomisation arguments. Such test statistics include the sign test statistic, the rank sum test statistic, the Kolmogorov-Smirnov test statistic, and the Wilcoxon test statistic. Sometimes the term nonparametric test is also used to describe such tests - the rationale being that such tests are not testing the value of some parameter of any assumed distribution. This section has described the classical (frequentist) approach to statistical hypothesis testing. In data mining, however, things can become more complicated. Firstly, because data mining involves large data sets, one should expect to obtain statistical significance: even slight departures from the hypothesised model form will show up as real (unlikely to be merely chance fluctuations). This will be true even though the departures from the assumed model form may be of no practical significance. (If they are of practical significance, of course, then well and good.) Worse, the slight departures from the hypothesised model will show up as significant, even though these departures are due to contamination or data distortion. We have already remarked on the inevitability of this. Secondly, sequential model fitting processes are common. In Section 5.5 and Chapter 6 we describe stepwise model fitting procedures, which gradually refine a model by adding or deleting terms. Separate tests on each model, as if it was de novo, lead to incorrect probabilities. Formal sequential testing procedures have been developed, but they can be quite complex. Moreover, they may be weak - because of the multiple testing going on. This is discussed below. Thirdly, data mining is an essentially exploratory process. This has various implications. One is that many models will be examined. Suppose one tests m true (though we will not know this) null hypotheses, each at, say, the 5% level. Since they are each true, this means that, for each hypothesis separately, there is a probability of 0.05 of incorrectly rejecting the hypothesis. Since they are independent, this means that the probability of incorrectly m rejecting at least one is p 1 1 0.05 . When m = 1 we have p = 0.05, which is fine. But when m = 10 we obtain p = 0.4013 and when m = 100 we obtain p = 0.9941. That is, if we test as few as even 100 true null hypotheses, we are almost certain to incorrectly reject at least one. Alternatively, one could control the overall family error rate, setting p = 0.05, so that the probability of incorrectly rejecting one of more of the m true null hypotheses was m 0.05. In this case we use 0.05 1 1 for each given m to obtain the level at which each of the separate true null hypotheses is being tested. With m = 10 we obtain 0.0051 and with m = 100 we obtain 0.0005 . This means that we have only a very small probability of incorrectly rejecting any of the separate component hypotheses. Of course, things are much more complicated in practice: the hypotheses are unlikely to be completely independent (at the other extreme, if they are completely dependent, then accepting or rejecting one implies the acceptance or rejection of all), with an unknown and essentially unknowable dependence structure, and there will typically be a mixture of true (or approximately true) and false null hypotheses. Various simultaneous test procedures have been developed in attempts to ease these difficulties, but the problem is not really one of inadequate methods but is rather more fundamental. A basic approach is based on the Bonferroni inequality. We can expand the m probability 1 that none of the true null hypotheses are rejected to yield 1 m 1 m . It follows that 1 1 m m . That is, the probability that one or more true null hypotheses is incorrectly rejected is less than or equal to m . In general, the probability of incorrectly rejecting one or more of the true null hypotheses is less than the sum of incorrectly rejecting each of them. This is a first order Bonferroni inequality. By including other terms in the expansion, more accurate bounds can also be developed - though they require knowledge of the dependence relationships between the hypotheses. With some test procedures difficulties can arise in which a global test of a family of hypotheses rejects the null hypothesis (so one believes at least one to be false) but no single component is rejected. Once again strategies have been developed for overcoming this in particular applications. For example, in multivariate analysis of variance, which is concerned with comparing several groups of objects which have been measured on multiple variables, test procedures which overcome these problems based on comparing each test statistic with a single threshold value have been developed. See Section 5.7 for details of references to such material. It will be obvious from the above that attempts to put probabilities on statements of various kinds, via hypothesis tests, while they do have a place in data mining, are not a universal panacea. However, they can be regarded as a particular type of a more general procedure which maps the data and statement to a numerical value or score. Higher (or lower, depending upon the procedure) scores indicate that one statement, or model, is to be preferred to another, without attempting any absolute probabilistic interpretation. The penalised goodness-of-fit criteria described in Section 5.5 can be thought of as scoring methods. 5.5 Fitting complex models The previous section dealt with estimating single parameters, or, at least, simple models. In this section we consider more elaborate models, and we shall see that new kinds of problems arise. In Section 5.1 we spoke of inferring an underlying structure from the available data. In this we are regarding the available data as being a sample drawn from some distribution which is defined in terms of the underlying structure. We can then think of the set of all possible data points which could have been chosen as the population of values, as described in Chapter 2. Sometimes this population is finite (as in the population of heights of people who work for a particular corporation) but often it is infinite - as in the amount people might spend on a Friday night in a particular supermarket. Note that here the range of possible values is finite, as is the number of people who will make a purchase (if unknown) but, no matter how long a list of possible values one makes, one cannot state what the value of the next transaction will be. In what follows we shall again assume that the available data have been drawn from the population by a random sampling process. If this is not the case, and if some of the distortions outlined in Chapter 2 have occurred, then the following needs to be modified. Note that in any case in data mining contexts one should be constantly on the alert for possible sampling distortions. With large datasets such distortions are very likely, and they can totally invalidate the results of an analysis. In some contexts the data available on which to construct the model is called the design sample or training sample, and we shall sometimes use these terms. In the inferential modelling process the unknown distribution of the (potential) values in the population is the model whose structure we are seeking to infer. Following the outline in Section 5.1, an obvious strategy to follow to estimate this distribution is to define some measure which can serve as a criterion of goodness-of-fit between the model and the observed data, and then seek to minimise (or maximise, depending upon how it is expressed) this. However, although obvious, this is not an ideal solution. The available data is, after all, only a sample from the population of interest. Another sample, drawn by the same random process, would (almost certainly) have been different from the available sample. This means that, if one builds a model which very accurately predicts the available data, then it is likely that this model will not be able to predict another sample very well. This would mean that one had not modelled the underlying distributions very well. Clearly something beyond mere goodness-of-fit to the available data is required. We shall examine this in more detail via an example. Suppose our aim is to construct a model which will allow us to predict the value of a variable y from the value of a variable x. That is, we want to use the available data to infer the underlying model of the relationship between x and y, and then subsequently use this underlying model to predict values of y from values of x: presented with a case with a known (or measured) value of the variable x, our model should allow us to say what its y value is likely to be. (Models with this basic structure are very important, and will be discussed in detail in Chapter 9.) Perhaps the first thing to note is that we do not expect to be able to obtain perfect predictions. The values of x and y will vary from sample to sample (partly due to the random nature of the sample selection process and perhaps also partly due to intrinsic measurement error), so we must expect that our predictive model will also vary from sample to sample. Of course, we hope that such variation will be slight (if not then one must question the value of the model-building process in this case). Moreover, there may be other, unmeasured influences on y, beyond those included in x, so that perfect prediction of y from x is not possible, even in principle. However, we do believe that there is some relationship between y and x, some predictive power for y in x, and it is this we hope to capture. In a sense we are being pragmatic: we cannot obtain perfection but we can obtain something useful. Just for the purposes of this initial example (and this is most definitely not a real requirement in practice) we will suppose that each value of x in the design sample is associated with a different value of y. Figure 2 shows a plot of the sort of design set we have in mind. We could now produce a model which permits perfect prediction of the value of y from the value of x for the points in the design set. Such a model is shown in Figure 3. It is clear that this predictive model is quite complicated. It is also clear that other predictive models could also be built which would give perfect prediction of the design set - other wobbly lines which went through every point but which would lead to different predictions for other values of x. An obvious question then is how we should choose between these models. Moreover, the models we have proposed yield perfect predictions for the design set elements and we have already commented that we would not expect to achieve this. If we do achieve it, we are modelling not only the underlying structure (sometimes called the systematic variation) which led to the data, but also the peculiarities specific to the design sample which we happen to have (sometimes called the random variation). This means that, if we were to observe another data point with an x value equal to one of those in the design set, it would probably have a y value not equal to the corresponding design set y value, and hence it would be incorrectly predicted by our model. This suggests that we need a simpler model, one which models the underlying structure but not the extra variation in the observed y values arising from other sources. Models which go too far, as in these examples, are said to overfit the data. Overfitting does not only arise when a model fits the design data perfectly, but arises whenever it goes beyond the systematic part and begins to fit the random part. [FIGURE 2: This figure to show a scatterplot of y against x such that each x value is associated with a unique y value. The data follow a quadratic curve, with variation about it.] [FIGURE 3: As figure 2, but with the addition of a wobbly line going through all the data points.] DISPLAY 5.9 The performance of supervised classification rules (Chapter 9) is often assessed in terms of their misclassification or error rate, the proportion of future objects they are likely to misclassify. For many years this was estimated by the simple procedure of seeing how many of the design set points a rule misclassified. However, such an approach is optimistically biased - it underestimates the error rate which will be obtained on future points from the same distribution. This is because the classification rule has been optimised to fit the design set very well. (It would be perverse, after all, to pick a rule deliberately because it did not classify the design set very well.) But this very fact means that it will overfit the design set to some extent - and the more flexible the rule the more it will overfit. Because of this some very sophisticated methods of error rate have been developed. Many of them involve resampling methods, in which the data are used more than once. The leaving-one-out method, for example, follows one of the procedures described in the text. A classifier is built on all but one of the design set points, and is then used to classify this omitted point. Since it was not included in the design stage, there is no danger of overfitting. This is repeated for each design set point in turn, leading to each of them being classified by a rule which is based on all the other points. The overall estimate of error rate of a rule based on all the design set points is then the proportion of these omitted points which have been misclassified. Bootstrap methods, of which there are now many, randomly select a subset of the design set, of the same size (and hence obviously with replacement), use this to construct a classifier, and find the error rate of the remainder of the points. This is repeated for multiple bootstrap samples, the final estimate being an average of them. Some highly sophisticated (and quite complicated) variants of this have been developed. Given that the model in Figure 3 overfitted the data, Figure 4 illustrates a simpler model which we might consider. This is a very simple model - it can be described in terms of two parameters: y = ax + b. It is not as good a fit as the previous model, but then it shouldn’t be - we do not want it to provide perfect prediction for the design set. But we must ask ourselves if it is too simple. Is it failing to model some aspect of the systematic variation? Indeed, studying the figure suggests that there is a curved component, that x is quadratically related to y, and our linear model is failing to model that. So perhaps something rather more complicated than the linear model is needed, but not so complicated as the model in Figure 3. What is needed is something between the models in Figures 3 and 4. The trick in inferential modelling is to decide just how complex the model should be. We need to find some compromise, so that the model is flexible enough to fit the (unknown) systematic component, but not so flexible that it fits the additional random variation. In statistics, this compromise is known as the bias/variance trade-off (see Display 5.10): an inflexible model will mean that it gives biased predictions (at least for some values of x) while a very flexible model will give predictions at any particular x which may differ dramatically from sample to sample. The term ‘degrees of freedom’, which occurs at various places in this book, is essentially the dimension of the vector of parameters which is estimated during the process of fitting a model. A highly flexible model has many degrees of freedom, while the simple linear model above has only two. [FIGURE 4: Same data again, but now with a straight line superimposed as a possible predictive model.] DISPLAY 5.10 Suppose that the values of a variable y are related to predictor variables x through y f x (1) where f is an unknown function and is a random variable. Our aim is to formulate a model which will allow prediction of the value of y corresponding to a given value of x. (Note that implicit in this formulation is the notion that there is a ‘true’ f.) Our model is to be built using a design set which consists of a random sample of pairs x , y , i = 1,…, n from (1). i i For convenience, we will denote this design set by X. Using methods described in Chapter 9, we construct a model f x| X . The notation here means that the model is based on X and provides a prediction at x. Now, different design sets will lead to different estimators so that the prediction for a given value of x is a random variable, varying with the design set. The accuracy of the prediction can be measured by the mean square error between the true value of f at x (that is y f x ) and the predicted value 2 f x| X . That is, by E y f x| X . Note that the expectation here, and in what follows, is over the distribution of different design sets and over different values of y. This measure of predictive accuracy can be decomposed. We have E y f x| X 2 2 E y f x E f x f x| X 2 E y f x f x E f x| X E E f x| X f x| X 2 E y f x E f x E f x| X 2 2 2 E E f x| X f x| X 2 2 The first term on the right hand side is independent of the training set. No matter how clever we are in building our model, we can do nothing to reduce this term. The middle term on the right hand side is the square of the bias. The final term on the right hand side is the variance of the predicted value at x arising from variation in the design set. Clearly, in order to reduce the overall mean square error as much as possible we need to reduce both the bias term and the variance term. However, there are difficulties in reducing both of these simultaneously. Suppose that we have a very flexible estimator, which can follow the design set very well (the wobbly line in Figure 3, for example). Then its value at x will tend to be close to the expected value f x , so that the bias term above (the middle one) will tend to be small. On the other hand, since it will vary dramatically from design set to design set, the variance term (the last term above) will tend to be large. If, on the other hand, we have a rigid estimator, which does not change much as design sets change, the bias term may be large while the variance term will be small. A very flexible estimator will have low bias (over design sets) but will tend to overfit the design set, giving a spurious indication of apparent accuracy. An inflexible estimator may have large bias, and will not model the shape of f x very well. In the above example, the complexity of the model lay in the permitted flexibility of the line used to predict y from x. Formally, one might say that the more complex models were higher order polynomials in x. Other classes of models introduce flexibility in other ways, but however they do it they are all subject to the dangers of overfitting. For example, predictive models with many predictor variables are very flexible by virtue of the many parameters associated with the many predictors, tree methods (Chapter 9) are very flexible if they have many nodes, and neural network models are very flexible if they have many hidden nodes. In general, if the space of possible models is large (if a large number of candidate models are being considered) then there is a danger of overfitting (although it also depends on the size of the design set - see below). Our fundamental aim is to choose a model which will accurately predict future values of y, but since y has some intrinsic and irreducible random variation (that due to measurement error or other predictor variables not included in x) the best we can hope to do is to predict some representative value from the distribution of y corresponding to a given x. That is, our prediction, for a given value of x, should provide a good estimate of some sort of central value of the conditional distribution of y given x (we denote this conditional random variable as y|x). We have already noted that the mean provides a ‘central value’ of a distribution in the sense that it is the value which minimises the sum of squared differences to other points in the distribution. Thus, if one were to predict the mean of the y|x distribution one would be doing the best one could in terms of minimising the sum of squared errors in the future. This, then, is a sensible aim: to use as the predicted value of y for a given x the mean value of the y|x distribution. Various approaches can be used to avoid overfitting the design data. Here we shall briefly describe three. Clearly, since their aims are the same, there are links between them. One approach is as follows. In general we measure how well the model fits the design data using some goodness-of-fit criterion. For example, we could use the sum of squared deviations between the predicted values of y and the observed values in the above example. The curve in Figure 3 above would then lead to a value of 0, indicating perfect prediction, while the straight line in Figure 4 would lead to a larger (and hence poorer) sum of squared deviations. But the perfect prediction from Figure 3 is spurious, at least as far as future data go. We can attempt to alleviate the problem by modifying the goodness-of-fit criterion so that it does not simply predict performance on the design set. In the above example, an indication that the model is too flexible is given by the wobbliness of the predictor line. Thus one might penalise the criterion one uses to measure goodness-of-fit to the design set by adding a term which grows with increasing model complexity - which works against using too wobbly a line. Example 1: We have already described the fundamental role of likelihood in model fitting. As noted in Chapter 1, the likelihood is a measure of how probable it is that the data could have arisen from a particular model. Denote the likelihood of a model M based on data set Y by L(M;Y). Then a measure of how far M is from the model M* which predicts the data perfectly (more correctly, M* is the model which maximises the likelihood from amongst those models under consideration) is given by the deviance: D M 2 ln L M ; Y ln L M * , Y The larger this is, the further is the model M from the best which can be achieved. Again, however, this will improve (in this case, decrease) with increasing model flexibility, regardless of whether the resulting model better reflects the underlying structure. To allow for this an extra term is sometimes added. The Akaike information criterion (AIC) is defined as D M 2 p , where p is the number of parameters in the model: again increasing model complexity - increasing p - is compensated for by the penalisation term. Another variant on this is the Bayesian information criterion, defined as D M pn , where n is the number of data points. Example 2: In our example at the beginning of this section, in which we were concerned with predicting a single y variable from a single x variable, we noted that the model in Figure 3 was too flexible, fitting the design data perfectly and not generalising very well. This weakness can be spotted by eye, since it can be seen that the predicted value of y fluctuates rapidly with changing x. One way to allow for this is to penalise the goodness of fit by a term which measures irregularity of the prediction. A popular measure is based on integrating the square of the second derivative 2 f x 2 over the range of x. Here f is the model such that y f x . This penalty term is a measure of the curvature of the function - high curvature indicating that the model form is too flexible. 2 Example 3: The sum of squared differences between predicted values and target values is also a common goodness-of-fit criterion when fitting neural networks (Chapter 9). Again, however, increasing the flexibility of the model, here by including more and more hidden nodes for example, will improve the apparent predictive power as the model provides better and better fits to the design set. To overcome this the sum of squared deviations is often penalised by a term proportional to the sum of squared weights in the network so-called weight decay. The minimum message length and minimum description length approaches to model building are explicitly based on the notion that models which overfit the data are more complex than models which don’t go too far. They optimise a criterion which has two components. One is a description of the data in terms of the model, and the other is a description of the complexity of the model. The two components are represented in commensurate terms (code length). A more complex model will lead to a simpler description of the data, but at the cost of requiring a longer expected code to describe the model - and vice versa. The best estimator is one which provides a suitable compromise by minimising the overall length. This notion of modifying some measure of fit of a model to the design data, so as to get a better indication of how well it fits the underlying structure, is ubiquitous. Here are two further examples from standard multiple regression (Chapter 9): Example 4: In basic multiple regression a common measure of the predictive power of a model is the multiple correlation coefficient, denoted R2. Denoting the values of y predicted from the model by y , and the mean of the y values by y , this is defined as y y R y y 2 2 y y 1 y y 2 2 It thus gives the proportion of the variance in the y values which can be explained by the model (or one minus the ratio of the unexplained variance to the total variance). A large value (near 1) means that the model explains much of the variation in the observed y values, and so provides a good fit to the data. However, even if the x variables contained no predictive power for y, R2 would be non-zero and would increase as further x variables were included in the model. That is, as the model increases in flexibility (by adding more variables, and hence more parameters) it begins to fit the design data better and better, regardless of the fact that our aim is to fit the underlying structure and not the peculiarities of the design data. To overcome this, an adjusted (or ‘corrected’) form of R2 is often used. This is defined as k n 1 R 2 R2 n 1 n k 1 where k is the number of predictor variables and n is the number of data points. Whereas R2 is monotonic with increasing k (that is, it always increases as k increases) R 2 may decrease, depending on the predictive power of the new x variables. Example 5: Ridge regression modifies the standard estimates used in least squares regression by shrinking the covariance matrix of the predictor variables towards the identity matrix. Formally, if the usual estimate of the regression coefficients is given by (X’X)-1X’Y, where X is the n by k matrix of values of the k predictor variables on the n records and Y is the n by 1 vector of values of the variable to be predicted, then the ridge regression estimator is (X’X+kI)-1X’Y, where k is a parameter to be chosen and I is the identity matrix. In the above, the goodness-of-fit criterion was modified so that the model would not overfit the design data but would provide a better fit to the underlying structure. An alternative is to apply the goodness-of-fit criterion and a measure of overfitting one after the other. Sometimes these two steps are applied several times in what are called stepwise algorithms (Chapter 6). Indeed, traditional statistical methods of model fitting can be seen as adopting such a strategy. In essence, one finds a term which improves the goodness-of-fit of the model to the data and then carries out a statistical test to see if the improvement due to the extra term is real or could easily be attributed to chance variation in how the design data happen to have arisen. These are forward stepwise procedures. Backwards stepwise procedures work in the opposite directions, beginning with a complicated model and eliminating those terms where the improvement can be attributed to chance. This is a common approach with tree methods, where models with too many leaf nodes are build and then pruned back. A third strategy for avoiding overfitting is based on the observation that our fundamental aim is to predict future values, and to go directly for that, as in the following. Temporarily delete one data point from the design set. Using this reduced design set find the best fitting models of the types under consideration (e.g. in the above example, we might try linear, quadratic, and cubic models to predict y from x). Then, for each of these models, see how accurately they predict the y value of the observation which has been left out. Repeat this in turn for each observation, and combine them (for example, in a sum of squared errors between the true y values and their predicted values). This yields a measure of how well each model predicts future values - and hence is a more appropriate criterion on which to base the model selection. In the opening paragraphs of this section we noted that a model which provided too good a fit to a given design sample would be unlikely to provide a good fit to a second sample. This is by virtue of the fact that it modelled the idiosyncratic aspects of the design sample as well as the systematic aspects. The same applies the other way round: a model which went too far in providing a good fit to the second sample would be unlikely to provide a good fit to the first. This suggests that we might be better off using a model which is in some sense between the models based on the two samples - some sort of average of the two models. This idea can be generalised. If we had many samples, we could produce one model for each, each highly predictive for the sample on which it was based, but less predictive for the other samples. Then we could take an average of these models. For any particular sample, this average would not be as good as the model built for that sample, but averaged over its predictive performance on all samples it might well do better. This is the idea underlying the strategy of bagging (from bootstrap aggregating). Normally, of course, one does not have multiple samples (if one did, one would be inclined to merge them into a single large sample). But one can simulate this through a process called bootstrapping. In bootstrapping one draws a sample with replacement (Chapter 2) from the design set of the same size as the design set. This serves the role of a single sample in the above - and a model is built using this sample. This can be repeated many times, each bootstrap sample yielding a model. Finally an average model can be obtained which smoothes out the irregularities specific to each individual model. This approach is more effective with more flexible models since this means that the individual models reflect substantial parts of the intrinsic variability of their specific samples, and hence will benefit from the smoothing implicit in bagging. The bagging idea is related to the general notion of model averaging where, in general, multiple models are built and the results averaged. Take a classification tree as an example. Here the design data is recursively partitioned into smaller and smaller groups. Each partition is based (in the simpler trees, at least) on splitting some variable at a threshold. Choosing different variables and different thresholds will lead to different tree structures different predictive models. Elaborate trees correspond to overfitted models, and benefit can be obtained by averaging them. Of course, it is not immediately clear how to weight the different trees when computing the average, and many ways have been proposed. Some are outlined in Chapter 9. Sometimes data sets distinct from the design set, termed validation sets, are used to choose between models (since selection of the final model depends on the validation set, technically these are really being used as part of the design process, in a general sense). Although our example above was a predictive model, model simplification strategies can also be applied to descriptive models, those where the aim is simply to find good estimates of the underlying structure. In the above we have characterised the problem of choosing a model as one of a compromise between inflexibility and overfitting. Improve one of these by adopting a model of different complexity and the other gets worse. However, there is one way in which one can improve both simultaneously - or, at least, improve one without causing the other to deteriorate, which comes down to the same thing. This is to increase the size of the design sample on which the model is based. For a model of a given complexity, the larger the design sample the more accurate will be the estimates of the parameters - the smaller will be their variances. In the limit, if the population is finite (albeit large), as it is in many data mining problems, then, of course, the variance of the estimates will be zero when the entire population is taken. Broadly speaking, the standard deviation of parameter estimates is inversely related to the square root of the sample size. (This needs to be modified if the population is finite, but it can be taken as a rule of thumb - see Section 2.5.) This means that there is a law of diminishing returns but it also means that one can choose one’s sample sufficiently large that the uncertainty in the parameter estimates is small enough to be irrelevant. Perhaps it is worth adding here that in many data mining problems the large size of the data set means overfitting is not a problem. General ad hoc rules are dangerous to give because they may well break down for your particular problem. 5.6 Some common probability distributions Chapter 2 described how the notion of uncertainty was fundamental in data mining exercises, and how crucial it was to have methods for coping with it. That chapter also introduced the idea of a probability distribution. Here we describe some of the more important probability distributions which arise in data mining. 1. Bernoulli distribution The Bernoulli distribution has just two possible outcomes. Situations which might be described by such a distribution include the outcome of a coin toss (heads or tails) or success or failure in some situation. Denoting the outcomes by 0 and 1, let p be the probability of observing a 1 and (1-p) the probability of observing a 0. Then the probability mass function can be written as p x 1 p , with x taking the value 0 or 1. The mean of the distribution is p and its variance is p(1-p). 1 x 2. Binomial distribution This is a generalisation of the Bernoulli distribution, and describes the number of ‘type 1 outcomes’ (e.g. successes) in n independent Bernoulli trials, each with parameter p. The n n x probability mass function has the form p x 1 p . The mean is np and the variance is x np(1-p). 3. Multinomial distribution The multinomial distribution is a generalisation of the binomial distribution to the case where there are more than two potential outcome; for example, there may be k possible outcomes, the ith having probability pi of occurring. Suppose that n observations have been independently drawn from a multinomial distribution. Then the mean number of observations yielding the ith outcome is npi and its variance is npi 1 pi . Note that, since the occurrence of one outcome means the others cannot occur, the individual outcomes must be negatively correlated. In fact, the covariance between the ith and jth ( i j ) outcome is npi p j . 4. Poisson distribution If random events are observed independently, with underlying rate , then we would expect to observe t in a time interval of length t. Sometimes, of course, we would observe none in time t, at other times we would observe 1, and so on. If the rate is low, we would rarely expect to observe a large number of events (unless t was large). A distribution which describes this state of affairs is the Poisson distribution. It has probability mass function t x e t x ! . The mean and variance of the Poisson distribution are the same, both being . Given a binomial distribution with large n and small p such that np is a constant, then this may be well approximated by a Poisson distribution np e np x ! . x 5. Normal (or Gaussian) distribution The probability density function takes the form 1 2 1 exp 2 x 2 2 where is the mean of the distribution and 2 is the variance. The standard normal distribution is the special case with zero mean and unit variance. The normal distribution is very important, partly as a consequence of the central limit theorem. Roughly speaking, this says that the distribution of the mean of a sample of n observations becomes more and more like the normal distribution as n increases, regardless of the form of the populations distribution from which the data are drawn. (Of course, mathematical niceties require this to be qualified for full rigor.) This is why many statistical procedures are based on an assumption that various distributions are normal - if the sample size is large enough, this is a reasonable assumption. The normal distribution is symmetric about its mean, and 95% of its probability lies within 196 . standard deviations of the mean. 6. Student’s t-distribution Consider a sample from a normal distribution with known standard deviation . An appropriate test statistic to use to make inferences about the mean would be the ratio x n Using this, for example, one can see how far the sample mean deviates from a hypothesised value of the unknown mean. This ratio will be normally distributed by the central limit theorem (see Normal distribution above). Note that here the denominator is a constant. Of course, in real life, one is seldom in a situation of making inferences about a mean when the standard deviation is unknown. This means that one would usually want to replace the above ratio by x s n where s is the sample estimate of the standard deviation. As soon as one does this the ratio ceases to be normally distributed - extra random variation has been introduced by the fact that the denominator now varies from sample to sample. The distribution of this new ratio will have a larger spread than that of the corresponding normal distribution - it will have fatter tails. This distribution is called the t-distribution. Note that there are many - they differ according to how large is the sample size, since this affects the variability in s. They are indexed by (n-1), known as the degrees of freedom of the distribution. We can also describe this situation by saying that the ratio of two random variables, the numerator following a normal distribution and the square of the denominator following a chi-squared distribution (see below), follows a t-distribution. The probability density function is quite complicated and it is unnecessary to reproduce it here (it is available in introductory texts on mathematical statistics). The mean is (n-1) and the variance is n 1 n 3 . 7. Chi-squared distribution The distribution of the sum of the squares of n values, each following the standard normal distribution, is called the chi-squared distribution with n degrees of freedom. Such a distribution has mean n and variance 2n. Again it seems unnecessary to reproduce the probability density function here - it can be readily found in introductory mathematical statistics texts it needed. The chi-squared distribution is particularly widely used in tests of goodness-of-fit. 8. F distribution If u and v are independently distributed with n1 and n 2 degrees of freedom, respectively, then the ratio u n1 v n2 is said to follow an F distribution with n1 and n 2 degrees of freedom. This is widely used in tests to compare variances, such as arise in analysis of variance applications. F 9. The multivariate normal distribution This is an extension of the univariate normal distribution to multiple random variables. Let x x1 ,, x p denote a p component random vector. Then the probability density function of the multivariate normal distribution has the form 1 2 p 2 12 1 exp x ' 1 x 2 where is the mean vector of the distribution and is the covariance matrix. Just as the univariate normal distribution plays a unique role, so does the multivariate normal distribution. It has the property that its marginal distributions are normal, as also are its conditional distributions (the joint distribution of a subset of variables, given fixed values of the others). Note, however, that the converse is not true: just because the p marginals of a distribution are normal, this does not mean the overall distribution is multivariate normal. 5.7 Further reading The material in this chapter is covered in more detail in statistics texts - introductory ones, such as Daly et al (1995), for the more basic material and more advanced texts, such as Cox and Hinkley (1974) and Lindsey (1996), for a deeper discussion of inferential concepts. Bayesian methods now have their own books. A comprehensive one is Bernardo and Smith (1994) and a lighter introduction is Lee (1989). Miller (1980) describes simultaneous test procedures. Nonparametric methods are described in Randles and Wolfe (1979) and Maritz (1981). References: Bernardo J.M. and Smith A.F.M. (1994) Bayesian Theory. Chichester: Wiley. Cox D.R. and Hinkley D.V. (1974) Theoretical Statistics. London: Chapman and Hall. Daly F., Hand D.J., Jones M.C., Lunn A.D., and McConway K. (1995) Elements of statistics, Wokingham, England: Addison-Wesley. Lee P.M. (1989) Bayesian Statistics: an introduction. London: Edward Arnold. Lindsey J.K. (1996) Parametric Statistical Inference. Oxford: Clarendon Press. Maritz J.S. (1981) Distribution-Free Statistical Methods. London: Chapman and Hall. Randles R.H. and Wolfe D.A. (1979) Introduction to the Theory of Nonparametric Statistics. New York: Wiley.