Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Why data mining is more than statistics writ large David J. Hand Imperial College of Science, Technology, and Medicine, Department of Mathematics Huxley Building 180 Queen’s Gate London SW7 2BZ, UK [email protected] 1. Introduction Data mining is the discovery of interesting, unexpected, or valuable structures in large data sets. Clearly this is an activity which overlaps substantially with statistics and exploratory data analysis. It is also an activity which overlaps with other disciplines, notably database theory, machine learning, pattern recognition, and artificial intelligence. Each of these disciplines brings their own flavour to the enterprise. To a greater or lesser extent, each of these disciplines also feels a certain intellectual proprietorship over the fledgling domain of data mining, and this can cause tensions. With luck, these tensions will be productive, as researchers from different areas learn to appreciate each other’s perspectives and viewpoints. The reason that the discipline has appeared now, rather than twenty years ago, is the realisation that modern technology has led to the accumulation of huge collections of data. Hand (1998) gives examples. Such data sets may contain many millions, or even billions of records. Typically they will have been processed to answer the question for which they were originally collected, and then stored - because storage, nowadays, is so cheap. This means that there are vast bodies of data lying around. It is clear, or at least this is the promise held out by protagonists of data mining, that these data mountains contain information which may be valuable. To take advantage of this, all one has to do is extract that information. ‘Data mining’, then, is a term for the heterogeneous collection of tools for extracting the potentially valuable information in data mountains. 2. The aims of data mining It is useful to distinguish between two types of data mining exercises. The first is data modelling, in which the aim is produce some overall summary of a given data set, characterising its main features. Thus, for example, we may produce a Bayesian belief network, a regression model, a neural network, a tree model, and so on. Clearly this aim is very similar to the aim of standard statistical modelling. Having said that, the large sizes of the data sets often analysed in data mining can mean that there are differences. In particular, standard algorithms may be too slow and standard statistical model-building procedures may lead to over-complex models since even small features will be highly significant. We return to these points below. It is probably true to say that most statistical work is concerned with inference in one form or another. That is, the aim is to use the available data to make statements about the population from which it was drawn, values of future observations, and so on. Much data mining work is also of this kind. In such situations one conceptualises the available data as a sample from some population of values which could have been chosen. However, in many data mining situations all possible data is available, and the aim is not to make an inference beyond these data, but rather it is describe to them. In this case, it is inappropriate to use inferential procedures such as hypothesis tests to decide whether or not some feature of the describing model should be retained. Other criteria must be used. The second type of data mining exercise is pattern detection. Here the aim is not to build an overall global descriptive model, but is rather to detect peculiarities, anomalies, or simply unusual or interesting patterns in the data. Pattern detection has not been a central focus of activity for statisticians, where the (inferential) aim has rather been assessing the ‘reality’ of a pattern once detected. In data mining the aim is to locate the patterns in the first place, typically leaving the establishment of its reality, interest, or value to the database owner or a domain expert. Thus a data miner might locate clusters of people suffering from a particular disease, while an epidemiologist will assess whether the cluster would be expected to arise simply from random variation. Of course, most problems occur in data spaces of more than two variables (and with many points), which is why we must use formal analytic approaches. I shall assume that most readers of this paper have a broad familiarity with modelling and focus most of my discussion on pattern detection. 3. Patterns, patterns, everywhere We will always find patterns in large data sets. Firstly, it is the very nature of human perception that we try to interpret visual or other stimuli in terms of known objects. Thus, for example, we see faces in clouds, we see archers and crabs in star patterns, and we see all sorts of things in Rorschach ink blots. The point is that we are matching an observed structure in data with a vast collection of known patterns (objects) until we find a good match. Without some prior restriction on what we mean by ‘pattern’, we can hardly fail to identify a structure as a pattern. Secondly, it is inevitable that certain patterns will occur in data sets. Given 100 possible values that elements of a data set can take, and 101 objects measured, it is certain that at least two of them will have the same value. If we order 10,001 objects according to the values they take on a variable, then it is certain that we can find 101 objects which have either increasing order or decreasing order on any other variable. Thirdly, if we have a large enough data set, then the probability of any given small pattern may be very large. Here, in contrast to the first situation above, we are matching known patterns (objects) with a vast collection of potential occurrences in the data. The probability that any particular record shows some pattern may be only 1 in a million, but if there are 100 million of them we should not be too surprised to see the pattern. If patterns are so likely, or even inevitable, how can we decide whether an observed pattern represents something real, or something worth knowing? These two questions are different, of course, but in data mining they have the same answer: one asks an expert. The job of the data miner is to find the patterns, to draw them to the attention of someone who understands the potential substantive significance of the data and the patterns. I have found in my own work that such domain experts can often provide retrospective justifications for patterns. Moreover, it seems to me that this can also be taken as a indicator of ‘reality’; it can certainly be taken as an indicator of the faith someone should put in the discovered pattern. If one cannot think of an explanation for how such a structure could arise, then one should be suspicious of adopting it as a basis for future decisions or plans. Data mining exercises can throw up many patterns. For example, market basket analysis, concerned with finding groups of items which supermarket shoppers tend to buy together, can identify many thousands of such groups. We have developed a tool for finding local clusters in data, and it can locate large numbers of such clusters. While, in principle, one could pass these large numbers on to a domain expert (the supermarket manager, for example) in practice this is not feasible. Some way has to be found to choose between them first, only passing over the (perhaps a hundred) of those thought to be most promising in some sense. Choosing on the basis of a statistical significance test will not work. If very large numbers of related potential patterns are being considered, the probabilistic interpretation becomes doubtful. If one adopted some overall ‘experimentwise’ error rate, then it is likely that no patterns would pass. There is no adequate answer to this. The strategy generally adopted is to use a score function - a measure of interestingness, unexpectedness, or unusualness, of the patterns - and pass over those which score most highly. Sometimes, of course, the score function will be a familiar statistical measure, but without the probabilistic interpretation. 4. Data quality Poor quality data always means poor quality results, but the problem is exacerbated if the data set is large. As we point out in Section 5, if a data set is large, it means one is necessarily distanced from it: there are many things which can go on within it, many ways it can go wrong, without one being aware of them. Furthermore, large data sets have more opportunities to go wrong than small ones. Indeed, if one is presented with an apparently clean data set, one might legitimately ask if it has been cleaned up in some way. Have incomplete observations been deleted or missing values imputed? Have outliers been removed? These and other data cleaning exercises can affect the results and it is important to know if the data represent what they purport to. Even data sets one might hope will be accurate can be riddled with errors. In just one data set we analysed, describing repayment of bank loans, we found that tiny amounts unpaid (e.g. 1p or 2p) meant customers were classified as a ‘bad debt’, negative values in the amount owed, 12 month loans still active after 24 months (technically not possible under the bank’s rules), outstanding balances dropping to zero and then becoming positive again, balances which were always zero, and number of months in arrears increasing by more than a single integer in one month. Once we identified these problems, the bank could explain some of them, but not all. And these were only the ones we found. Our experience with several banks suggests this is not at all unusual. Data quality is a fundamental issue in data mining, not only because distorted data means distorted results but also because many of the ‘interesting’ or ‘unusual’ patterns discovered may well be directly due to corrupt data. I have come across examples where high correlations were an artefact of missing data, and patterns were induced by the way the data were grouped, and many others. Not to mention the many interesting ‘patterns’ I and my team have discovered which turned out to be artefacts of the measurement procedure. Indeed, we have found so many patterns attributable to problems with the data that I suggest that maybe the majority of ‘unexpected’ patterns may be attributed to this cause. This has obvious implications for the future of data mining as a discipline. So far this discussion of data quality has concentrated on individual records. Perhaps even more serious, because they are insidious, are problems arising from selection bias. What entire records are missing from the database, are they missing differentially across the population, have the chosen records been included because they were easy to obtain, and so on? Road accident statistics provide a nice example of the dangers. The more serious accidents, those resulting in fatalities, are recorded with great accuracy, but the less serious ones, those resulting in minor or no injury, are not recorded so rigorously. Indeed a high proportion are not recorded at all. This gives a distorted impression - and could lead to mistaken conclusions. In many cases, things are confounded by the difficulty of accessing data (for example, if it is distributed across many machines) and by the fact that the data may be dynamic. Real-time analysis may be necessary. 5. Algorithms Large data sets mean that one cannot ‘familiarise oneself’ with the data. The investigation has to be via the intermediary of sophisticated computer programs. While such programs provide power - without them we could not progress at all - they also mean we risk failing to notice something which should stop us from proceeding. Since these programs will be applied to large data sets, they must be fast. Sequential or adaptive methods may be necessary, and a simple but suboptimal solution may be preferable to a method which is theoretically superior but which will take 1000 times as long. For example, we have found linear regression to be have significant advantages over logistic regression in many problems where the latter might be thought more appropriate. The key role of programs has led to an increased emphasis on algorithms in data mining, in contrast to the emphasis on models in statistics. The idea is that one applies the algorithm to data sets, learning how it behaves and what properties it has, regardless of any notion of an underlying model (or pattern) which it might be building. 6. Conclusion This brief discussion of the differences between data mining and statistics has left untouched many important areas. Graphical display of large data sets (sometimes unfortunately called ‘visualisation’) is perhaps a particularly important one, though one difficult to discuss properly in the medium of the printed page. Further discussions of data mining, as well as illustrative examples, can be found in Fayyad et al (1996), Glymour et al (1996), Elder and Pregibon (1996), Heckerman et al (1997), Agrawal et al (1998), Hand (1998), and Hand et al (forthcoming). REFERENCES Agrawal R., Stolorz P., and Piatetsky-Shapiro G. (eds.) (1998) Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. Menlo Park: AAAI Press. Elder J, IV, and Pregibon D. (1996) A statistical perspective on knowledge discovery in databases. In Fayyad U.M., Piatetsky-Shapiro G., Smyth P., and Uthurusamy R. (eds.) Advances in Knowledge Discovery and Data Mining. Menlo Park, California: AAAI Press. 83-113 Fayyad U.M., Piatetsky-Shapiro G., Smyth P., and Uthurusamy R. (eds.) (1996) Advances in Knowledge Discovery and Data Mining. Menlo Park, California: AAAI Press. Glymour C., Madigan D., Pregibon D., and Smyth P. (1996) Statistical inference and data mining. Communications of the ACM, 39, 35-41. Hand D.J., (1998) Data mining: statistics and more? The American Statistician, 52, 112-118. Hand D.J., Mannila H., and Smyth P. (forthcoming) Principles of Data Mining, MIT Press. Heckerman D., Mannila H., Pregibon D., and Uthurusamy R. (eds.) (1997) Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. Menlo Park: AAAI Press. RÉSUMÉ Modern data capture technology and computer storage facilities are leading to the existence of huge data sets. Such data sets clearly contain valuable information - if only it can be extracted. But they also represent novel and challenging problems. Solving those problems requires a synergistic merger of statistics, database technology, machine learning, pattern recognition, and other disciplines. This talk focuses on some of the problems and issues which make data mining more than merely ‘scaled up’ statistics.