Download Why data mining is more than statistics writ large

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Why data mining is more than statistics writ large
David J. Hand
Imperial College of Science, Technology, and Medicine, Department of Mathematics
Huxley Building
180 Queen’s Gate
London SW7 2BZ, UK
[email protected]
1. Introduction
Data mining is the discovery of interesting, unexpected, or valuable structures in large data
sets. Clearly this is an activity which overlaps substantially with statistics and exploratory data
analysis. It is also an activity which overlaps with other disciplines, notably database theory,
machine learning, pattern recognition, and artificial intelligence. Each of these disciplines brings
their own flavour to the enterprise. To a greater or lesser extent, each of these disciplines also feels a
certain intellectual proprietorship over the fledgling domain of data mining, and this can cause
tensions. With luck, these tensions will be productive, as researchers from different areas learn to
appreciate each other’s perspectives and viewpoints.
The reason that the discipline has appeared now, rather than twenty years ago, is the
realisation that modern technology has led to the accumulation of huge collections of data. Hand
(1998) gives examples. Such data sets may contain many millions, or even billions of records.
Typically they will have been processed to answer the question for which they were originally
collected, and then stored - because storage, nowadays, is so cheap. This means that there are vast
bodies of data lying around. It is clear, or at least this is the promise held out by protagonists of data
mining, that these data mountains contain information which may be valuable. To take advantage of
this, all one has to do is extract that information. ‘Data mining’, then, is a term for the
heterogeneous collection of tools for extracting the potentially valuable information in data
mountains.
2. The aims of data mining
It is useful to distinguish between two types of data mining exercises. The first is data
modelling, in which the aim is produce some overall summary of a given data set, characterising its
main features. Thus, for example, we may produce a Bayesian belief network, a regression model, a
neural network, a tree model, and so on. Clearly this aim is very similar to the aim of standard
statistical modelling. Having said that, the large sizes of the data sets often analysed in data mining
can mean that there are differences. In particular, standard algorithms may be too slow and standard
statistical model-building procedures may lead to over-complex models since even small features
will be highly significant. We return to these points below.
It is probably true to say that most statistical work is concerned with inference in one form or
another. That is, the aim is to use the available data to make statements about the population from
which it was drawn, values of future observations, and so on. Much data mining work is also of this
kind. In such situations one conceptualises the available data as a sample from some population of
values which could have been chosen. However, in many data mining situations all possible data is
available, and the aim is not to make an inference beyond these data, but rather it is describe to
them. In this case, it is inappropriate to use inferential procedures such as hypothesis tests to decide
whether or not some feature of the describing model should be retained. Other criteria must be used.
The second type of data mining exercise is pattern detection. Here the aim is not to build an
overall global descriptive model, but is rather to detect peculiarities, anomalies, or simply unusual
or interesting patterns in the data. Pattern detection has not been a central focus of activity for
statisticians, where the (inferential) aim has rather been assessing the ‘reality’ of a pattern once
detected. In data mining the aim is to locate the patterns in the first place, typically leaving the
establishment of its reality, interest, or value to the database owner or a domain expert. Thus a data
miner might locate clusters of people suffering from a particular disease, while an epidemiologist
will assess whether the cluster would be expected to arise simply from random variation. Of course,
most problems occur in data spaces of more than two variables (and with many points), which is
why we must use formal analytic approaches.
I shall assume that most readers of this paper have a broad familiarity with modelling and
focus most of my discussion on pattern detection.
3. Patterns, patterns, everywhere
We will always find patterns in large data sets. Firstly, it is the very nature of human
perception that we try to interpret visual or other stimuli in terms of known objects. Thus, for
example, we see faces in clouds, we see archers and crabs in star patterns, and we see all sorts of
things in Rorschach ink blots. The point is that we are matching an observed structure in data with a
vast collection of known patterns (objects) until we find a good match. Without some prior
restriction on what we mean by ‘pattern’, we can hardly fail to identify a structure as a pattern.
Secondly, it is inevitable that certain patterns will occur in data sets. Given 100 possible
values that elements of a data set can take, and 101 objects measured, it is certain that at least two
of them will have the same value. If we order 10,001 objects according to the values they take on a
variable, then it is certain that we can find 101 objects which have either increasing order or
decreasing order on any other variable.
Thirdly, if we have a large enough data set, then the probability of any given small pattern
may be very large. Here, in contrast to the first situation above, we are matching known patterns
(objects) with a vast collection of potential occurrences in the data. The probability that any
particular record shows some pattern may be only 1 in a million, but if there are 100 million of them
we should not be too surprised to see the pattern.
If patterns are so likely, or even inevitable, how can we decide whether an observed pattern
represents something real, or something worth knowing? These two questions are different, of
course, but in data mining they have the same answer: one asks an expert. The job of the data miner
is to find the patterns, to draw them to the attention of someone who understands the potential
substantive significance of the data and the patterns. I have found in my own work that such domain
experts can often provide retrospective justifications for patterns. Moreover, it seems to me that this
can also be taken as a indicator of ‘reality’; it can certainly be taken as an indicator of the faith
someone should put in the discovered pattern. If one cannot think of an explanation for how such a
structure could arise, then one should be suspicious of adopting it as a basis for future decisions or
plans.
Data mining exercises can throw up many patterns. For example, market basket analysis,
concerned with finding groups of items which supermarket shoppers tend to buy together, can
identify many thousands of such groups. We have developed a tool for finding local clusters in data,
and it can locate large numbers of such clusters. While, in principle, one could pass these large
numbers on to a domain expert (the supermarket manager, for example) in practice this is not
feasible. Some way has to be found to choose between them first, only passing over the (perhaps a
hundred) of those thought to be most promising in some sense. Choosing on the basis of a
statistical significance test will not work. If very large numbers of related potential patterns are
being considered, the probabilistic interpretation becomes doubtful. If one adopted some overall
‘experimentwise’ error rate, then it is likely that no patterns would pass. There is no adequate
answer to this. The strategy generally adopted is to use a score function - a measure of
interestingness, unexpectedness, or unusualness, of the patterns - and pass over those which score
most highly. Sometimes, of course, the score function will be a familiar statistical measure, but
without the probabilistic interpretation.
4. Data quality
Poor quality data always means poor quality results, but the problem is exacerbated if the data
set is large. As we point out in Section 5, if a data set is large, it means one is necessarily distanced
from it: there are many things which can go on within it, many ways it can go wrong, without one
being aware of them. Furthermore, large data sets have more opportunities to go wrong than small
ones. Indeed, if one is presented with an apparently clean data set, one might legitimately ask if it
has been cleaned up in some way. Have incomplete observations been deleted or missing values
imputed? Have outliers been removed? These and other data cleaning exercises can affect the
results and it is important to know if the data represent what they purport to.
Even data sets one might hope will be accurate can be riddled with errors. In just one data set
we analysed, describing repayment of bank loans, we found that tiny amounts unpaid (e.g. 1p or 2p)
meant customers were classified as a ‘bad debt’, negative values in the amount owed, 12 month
loans still active after 24 months (technically not possible under the bank’s rules), outstanding
balances dropping to zero and then becoming positive again, balances which were always zero, and
number of months in arrears increasing by more than a single integer in one month. Once we
identified these problems, the bank could explain some of them, but not all. And these were only the
ones we found. Our experience with several banks suggests this is not at all unusual.
Data quality is a fundamental issue in data mining, not only because distorted data means
distorted results but also because many of the ‘interesting’ or ‘unusual’ patterns discovered may
well be directly due to corrupt data. I have come across examples where high correlations were an
artefact of missing data, and patterns were induced by the way the data were grouped, and many
others. Not to mention the many interesting ‘patterns’ I and my team have discovered which turned
out to be artefacts of the measurement procedure. Indeed, we have found so many patterns
attributable to problems with the data that I suggest that maybe the majority of ‘unexpected’ patterns
may be attributed to this cause. This has obvious implications for the future of data mining as a
discipline.
So far this discussion of data quality has concentrated on individual records. Perhaps even
more serious, because they are insidious, are problems arising from selection bias. What entire
records are missing from the database, are they missing differentially across the population, have the
chosen records been included because they were easy to obtain, and so on? Road accident statistics
provide a nice example of the dangers. The more serious accidents, those resulting in fatalities, are
recorded with great accuracy, but the less serious ones, those resulting in minor or no injury, are not
recorded so rigorously. Indeed a high proportion are not recorded at all. This gives a distorted
impression - and could lead to mistaken conclusions.
In many cases, things are confounded by the difficulty of accessing data (for example, if it is
distributed across many machines) and by the fact that the data may be dynamic. Real-time analysis
may be necessary.
5. Algorithms
Large data sets mean that one cannot ‘familiarise oneself’ with the data. The investigation has
to be via the intermediary of sophisticated computer programs. While such programs provide power
- without them we could not progress at all - they also mean we risk failing to notice something
which should stop us from proceeding.
Since these programs will be applied to large data sets, they must be fast. Sequential or
adaptive methods may be necessary, and a simple but suboptimal solution may be preferable to a
method which is theoretically superior but which will take 1000 times as long. For example, we
have found linear regression to be have significant advantages over logistic regression in many
problems where the latter might be thought more appropriate.
The key role of programs has led to an increased emphasis on algorithms in data mining, in
contrast to the emphasis on models in statistics. The idea is that one applies the algorithm to data
sets, learning how it behaves and what properties it has, regardless of any notion of an underlying
model (or pattern) which it might be building.
6. Conclusion
This brief discussion of the differences between data mining and statistics has left untouched
many important areas. Graphical display of large data sets (sometimes unfortunately called
‘visualisation’) is perhaps a particularly important one, though one difficult to discuss properly in
the medium of the printed page.
Further discussions of data mining, as well as illustrative examples, can be found in Fayyad et
al (1996), Glymour et al (1996), Elder and Pregibon (1996), Heckerman et al (1997), Agrawal et al
(1998), Hand (1998), and Hand et al (forthcoming).
REFERENCES
Agrawal R., Stolorz P., and Piatetsky-Shapiro G. (eds.) (1998) Proceedings of the Fourth
International Conference on Knowledge Discovery and Data Mining. Menlo Park: AAAI Press.
Elder J, IV, and Pregibon D. (1996) A statistical perspective on knowledge discovery in databases.
In Fayyad U.M., Piatetsky-Shapiro G., Smyth P., and Uthurusamy R. (eds.) Advances in Knowledge
Discovery and Data Mining. Menlo Park, California: AAAI Press. 83-113
Fayyad U.M., Piatetsky-Shapiro G., Smyth P., and Uthurusamy R. (eds.) (1996) Advances in
Knowledge Discovery and Data Mining. Menlo Park, California: AAAI Press.
Glymour C., Madigan D., Pregibon D., and Smyth P. (1996) Statistical inference and data mining.
Communications of the ACM, 39, 35-41.
Hand D.J., (1998) Data mining: statistics and more? The American Statistician, 52, 112-118.
Hand D.J., Mannila H., and Smyth P. (forthcoming) Principles of Data Mining, MIT Press.
Heckerman D., Mannila H., Pregibon D., and Uthurusamy R. (eds.) (1997) Proceedings of the Third
International Conference on Knowledge Discovery and Data Mining. Menlo Park: AAAI Press.
RÉSUMÉ
Modern data capture technology and computer storage facilities are leading to the existence
of huge data sets. Such data sets clearly contain valuable information - if only it can be extracted.
But they also represent novel and challenging problems. Solving those problems requires a
synergistic merger of statistics, database technology, machine learning, pattern recognition, and
other disciplines. This talk focuses on some of the problems and issues which make data mining
more than merely ‘scaled up’ statistics.