Download Hard hats for data miners: Myths and pitfalls of data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved.
Web: www.witpress.com Email [email protected]
Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors).
ISBN 1-85312-925-9
Hard hats for data miners:
Myths and pitfalls of data mining
T. Khabaza
SPSS Advanced Data Mining Group
Abstract
The intrepid data miner runs many risks, such as being buried under mountains
of data or vanishing along with the “mysterious disappearing terabyte”.
This
paper debunks some myths and sketches some “hard hats for data miners”.
1 Introduction
Data mining is a business process, finding patterns in your data which you can
use to do your business better. Through data mining we gain insight into a
business problem; this insight may be of use in itself, but it also helps us to gain
the other benefits of data mining, such as a predictive capability.
This paper is about the practice of data mining; it is not a research paper, but
reports lessons learned through solving practical business problems and through
contact with many data mining users and potential users.
There are many myths and misconceptions about data mining, and holding
these misconceptions leads data mining users to run specific risks. The first half
of this paper lists some common misconceptions
about data mining, corrects
them, and describes the risks to which they can lead. The second half of the
paper lists other common problems or pitfalls of data mining, with their
symptoms and cures.
2 Myths and misconceptions
about data mining
2.1 Myth #1: Data mining is all about algorithms
attending a typical data mining conference,
The ordinary business-person,
reading its proceedings, or even reading only the contents page of such a
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved.
Web: www.witpress.com Email [email protected]
Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors).
ISBN 1-85312-925-9
14
Data
Mining III
proceedings,
could be forgiven for thinking that data mining is all about
advanced data analysis algorithms. This misconception might be summarised as
“all you need for data mining is good algorithms; the better your algorithms, the
better your data mining”, and its corollary “advancing the state of the art in data
mining means advancing knowledge of algorithms”.
To hold this view is to misunderstand the data mining process completely.
Data mining is a business process, involving many elements such as formulating
business goals, mapping business to data mining goals, acquiring, understanding
and pre-processing the data, evaluating and presenting the results of analysis and
deploying these results to achieve business benefit, as well as the modelling
component.
(A good explanation of this process can be found in the emerging
industry standard process model CRISP-DM [1].)
In their extreme form, the consequences of holding this misconception are
disastrous for a data mining project, and such a project will fail to produce any
useful results. In practice, this occurs only in the narrowest, most academic of
projects, where useful results for the business are not absolutely required. In any
project where there is a requirement for the results to benefit the business, the
data miner who holds this misconception is forced to discard it, at least partially,
and face the need for a broader view of the data mining process.
This is not to denigrate those parts of data mining research which develop or
improve data mining algorithms. Algorithms play a key role in data mining, and
new or improved algorithms are one way in which the art of data mining
advances. The problem occurs when we focus mainly or solely on algorithms
and ignore the other !)()-!)5~0 of the data mining process.
2.2 Myth #2: Data mining is all about predictive
accuracy
Above I have rejected the notion that data mining is all about modelling
algorithms, but within that part of data mining which is about algorithms, how
can we judge the quality of an algorithm?
Readin< data mining research
literature might lead us to suppose that the main criterion for judging an
algorithm is the predictive accuracy of the models it generates.
This view completely misrepresents the role of algorithms in the data mining
process. It is true that in order to be useful a predictive model should have some
degree of accuracy, because this reflects whether the algorithm has really
discovered patterns in the data. However, many other properties of an algorithm
or a model affect its usefidness; examples include whether the model can be
understood by the analyst, and whether it requires technical knowledge to
understand the model or apply the algorithm.
Considering the properties (other than predictive accuracy) which the data
mining process requires of algorithms, we can see the likely consequences of
holding this mistaken view: algorithms will be produced which can be used only
by technology experts. These algorithms will have only the most limited role in
a process which is driven by business expertise.
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved.
Web: www.witpress.com Email [email protected]
Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors).
ISBN 1-85312-925-9
Data Mining III
~5
2.3 Myth #3: Data mining requires a data warehouse
Data mining practitioners often hear statements like “we are not ready for data
mining yet, we need to build our data warehouse first”. Such statements are
based on the view that data warehousing is a pre-requisite for data mining. This
is a subtle misconception about the relationship between data warehousing and
data mining.
It is true that data mining can benefit from the warehoused data being well
organised, relatively clean, and easy to access. These benefits can accrue if the
warehouse has been constructed with data mining specifically in mind, and with
knowledge of the requirements of the data mining envisaged. If it has not, the
warehoused data may be less useful for data mining than the source operational
data, or in the worst case completely useless (for example in cases where only
summary data is warehoused).
To avoid this risk, it is usefhl to perform pilot data mining projects using
operational data in order to determine the correct content and organisation for
the warehouse.
It is misleading to state that data mining requires a data
warehouse; a more accurate summary of the relation would be that data mining
can benefit from a data warehouse, but that to construct such a warehouse often
requires data mining.
2.4 Myth #4: Data mining is all about vast quantities of data
Early explanations of data mining in the computing press often start with
statements like “We now collect more data than ever, yet how are we to gain
To focus on the size of data stores
benefit from these vast data stores?”.
provides a convenient introduction to the topic of data mining, but subtly
misrepresents its nature.
Data mining becomes useful when data becomes too large or too complex to
analyse “by eye”, that is anything larger than a few tens of examples and a
handful of attributes. Many usefil data mining projects are performed on small
or medium-sized datasets, for example containing only hundreds or thousands of
records.
Apart ffom its convenience in popular explanations, an association of data
mining with vast datasets is also connected with the recent emphasis on
performance and scalability of data mining tools. This drive to extend the reach
of data mining tools to large data is perfectly justified – there are many large
datasets which it benefits us to mine. However it would be a mistake to believe
that these large datasets are the sole focus of data mining.
Holding this erroneous belief would lead us to produce tools which sacrifice
usability for scalability, whereas in fact both aspects are essential. To quote a
customer of a leading data mining tool: “other data mining tools optimise
machine time, but this tool optimises my time”. Whether the datasets are large or
small we much strive to optimise the user’s time, and this may be assisted by
scalability and performance.
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved.
Web: www.witpress.com Email [email protected]
Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors).
ISBN 1-85312-925-9
16
Data Mining III
2.5 Myth #5: Data mining should be done by a technology
expert
Data mining technology, particularly modelling techniques, is of an advanced
sort, and its workings are unlikely to be understood by the wider IT community.
Some would claim that this means they should be applied only by technology
experts who understand their workings.
(This claim may be influenced by a
historical association with statistical modelling algorithms, which are more open
to misinterpretation than most data mining algorithms.)
In fact, the very reverse is true, because of the paramount importance of
When performed
without business
business knowledge
in data mining.
knowledge, data mining usually produces nonsensical or useless results (see
pitfall #3 below).
It is therefore essential that data mining is performed by
someone with extensive knowledge of the business problem, which is very
seldom combined with knowledge of the technology.
It is the responsibility of
data mining tool providers to ensure that tools are accessible to business rather
than technology experts. It behoves the data mining community at large to make
clear to potential users that data mining provides insight and useful suggestions,
rather than mathematical certainty.
2.6 Myth #6: Neural networks are opaque and consequently
– an over-simplistic view of data mining
useless
Myth #6 is a relatively specific misconception about one family of modelling
techniques (neural networks) which arises from a broader misunderstanding
about the data mining process.
One sometimes encounters the view that neural networks are not very useful
in data mining because one cannot discover why they make the predictions that
they do, or the “rules” that they use. This means that their predictions cannot be
justified, and that they will not contribute much insight. While this argument
reflects a correct emphasis on understandability
of models and the insight
produced by data mining, the conclusion about the disutility of neural networks
is erroneous, and the argument reflects a mistaken view of the data mining
process, possibly related to myth #1.
This mistaken view regards data mining as a rather simple process: “take the
data, apply a modelling technique, use the results”.
This omits the iterative
nature of the data mining process, and the way in which many techniques are
used together to produce a result.
Neural networks are used in a variety of ways in data mining projects, uses
which are not impacted by the opacity of the models. Here are some examples:
.
●
Neural networks can be used for attribute selection, either by training them
repeatedly with different combinations of attributes, or by using techniques
of “sensitivity analysis” to rank the attributes by their impact on predictions.
Neural networks can be used for “pattern confiiation”
– because they are
particularly
powerful
“pattern finders” for many applications,
neural
networks can be used to confirm that a pattern exists, before spending effort
on tuning other techniques to find it.
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved.
Web: www.witpress.com Email [email protected]
Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors).
ISBN 1-85312-925-9
Data Mining III
1‘7
Neural networks can be used in conjunction with other techniques, for
example to improve the confidence of predictions by discarding those where
the neural network disagrees with the predictions of another technique, or by
using other techniques to analyse the behaviour of the neural networks.
All of these uses of neural networks reflect the fact that the data mining
process cannot be summarised as “apply a modelling technique and use the
results”. Data mining facilities form a “toolbox”, whose contents are used in
varied and sometimes surprising ways to solve a problem.
●
3 Pitfalls of data mining and their cures
3.1 Pitfall #1: Buried under mountains of data
Data mining should be an interactive, iterative process where the analyst applies
substantial business knowledge and is “engaged” with the data. However, those
who hold myth #4 (that data mining is about vast quantities of data) often
suppose that this process must be applied to all of the available data.
This can lead to attempts to mine volumes of data for which the available
hardware and software cannot provide an acceptable interactive response (for
example, building a model within a few minutes).
The data mining process
becomes sluggish, and by the time a question is answered, the analyst cannot
remember why it was asked. It is hard to feel that this process is generating
insight.
The cure for this malaise is usually some form of sampling. For example, if
we have a million customers and a 20°/0 annual attrition (or “chum”) rate, we
need not plot our graphs or build our models using the fill million examples, or
even half a million (leaving, say, half for independent results validation).
Consider the following questions and answers:
Q:
A:
Q:
A:
How many chum profiles do we expect to find?
Maybe ten.
How many examples of each profile do we need?
Maybe a thousand.
Conclusion: A sample of ten or twenty thousand churners,
number of non-churners, will be sufficient for this analysis.
and an equivalent
Note that this does not mean that we will never encounter the need to build
models from millions of examples, only that we should not assume that we must
do so if this data is available.
One interesting class of cases is those where we wish to find a “rare” profile.
Suppose that we wish to find a specific phenomenon which causes only 1’% of
churn. It might be thought that we must build models against the whole dataset
in order to find it. However there are other approaches. For example we might
find the common chum profiles first, using a relatively small sample to build the
models, use these initial profdes to score the entire database, and then focus
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved.
Web: www.witpress.com Email [email protected]
Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors).
ISBN 1-85312-925-9
18
Data Mining III
subsequent analysis on the relatively
predicted not to do so.
3.2 Pitfall #2: The Mysterious
small
Disappearing
number
who
chum
but were
Terabyte
This is a common phenomenon, but not always a pitfall. The phrase “mysterious
disappearing terabyte” refers to the fact that for a given data mining problem, the
amount of available and relevant data maybe much less than initially supposed.
Consider the following scenario: You are a data mining consultant, and your
client is a large bank, holding terabytes of data on its customers. There is some
concern that the available computing resources will be inadequate to mining this
volume of data. The bank wishes to mine information on credit risk. Different
types of credit (for example personal loans, business loans, overdrafts) would
present different patterns of credit risk, so each data mining project will
concentrate on one type of borrower. A number of factors are judged (by the
bank’s domain experts) to be relevant. Are these factors collected by the bank?
Yes, they have looked ahead and started collecting the relevant factors, eighteen
months ago. Lots of borrowing has taken place in the intervening time so there
should be no problem about data! How many bad debts of the relevant kind
have occurred in that time? Plenty - almost a thousand! Thus the relevant data
consists of less than a thousand cases of bad debt plus a sample from a plentifid
supply of cases of good debt - say 3,000 records in all. Somehow, terabytes of
data have ‘softly and silently vanished away’, fortunately not (quite) taking the
data miner with them (this time).
3.3 Pitfall #3: Insufficient
business knowledge
I have emphasised previously the crucial role played by business knowledge in
data mining. Without it, we can neither recognise useful results nor guide the
data mining process towards them.
It is sometimes supposed that the end user of data mining can reasonably take
the attitude: “here is the data, please go away and mine it, and come back with
the answers”. When a data mining project is organised in this way, at best the
project will take many long and costly iterations to produce useful results, and at
worst the results will be gibberish and the project will fail.
This pitfall can only be avoided by involving the end user, and more
specifically someone with a detailed knowledge of the business, at every stage of
the data mining process. Ideally the data miner should be part of that business,
but if a data mining consultant is used then the consultant should literally sit next
to someone with the required business knowledge who understands the question
under consideration.
For this to work, a highly interactive data mining
environment with good response time is required. . (A data mining consultant
with general knowledge of the relevant industry is not sufficient – detailed
knowledge of the specific business is needed.)
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved.
Web: www.witpress.com Email [email protected]
Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors).
ISBN 1-85312-925-9
Data Mining III
3.4 Pitfall #4: Insufficient
19
data knowledge
In order to perfonm data mining we must be able to answer questions like “what
do the codes in this field mean?”, and “can there be more than one record per
customer in this table?”. In some cases this information is surprisingly hard to
come by – for example because the data expert has left the organisation or
moved to another department, or in the case of legacy systems there may be no
data expert at all. This problem is exacerbated when the database or data
warehouse management
is outsourced – the external supplier is even less
motivated than the user organisation to maintain the information “in case it is
needed in future”.
There is no simple cure for this problem.
IT departments should be made
aware of the need to maintain information about the organisation’s databases,
and when a data mining project is proposed we should consider how much data
knowledge is available, and any risks caused by its absence or scarcity.
3.5 Pitfall #5: Erroneous
assumptions,
courtesy of the experts
Business and data expertise are crucial resources for data mining, but that does
not mean that the data miner should accept unquestioningly
every statement of
the experts.
One benefit from data mining is that organisations
discover
surprising facts about their data and about their business. The data miner should
seek to confirm the truth of experts’ statements so far as they relate to the data.
Typical examples of erroneous or misleading statements would include:
●
No customer can hold accounts of both these types.
●
No case will include more than one event of this type.
Only the following codes will be present in this field.
Statements like this should be verified by examining the data. Data mining
tools should make this easy. It is particularly important to check these issues
when processing of the data will depend on them, so that mistakes in these
assumptions can be spotted before they lead to errors in the treatment of data.
●
3.6 Pitfall #6: Incompatibility
of data mining tools
The data mining process requires a wide range of facilities, so it might be
supposed that a wide variety of tools will be used. This can lead to a high
overhead in switching contexts and converting data between different formats.
At its worst this can lead to the omission of necessary steps, and even mild cases
can seriously interfere with the exploratory character of data mining.
The most readily available solution is to use a data mining toolkit in which
all the required facilities are present in an integrated form. However, no toolkit
will provide every possible facility, especially when the individual preferences
of analysts are taken into account, so toolkits should also be “open”, and
interface easily with other available tools and third-party options.
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved.
Web: www.witpress.com Email [email protected]
Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors).
ISBN 1-85312-925-9
20
Data Mining III
3.7 Pitfall #7: Locked in the data jail-house
In addition to openness with regard to tools, data mining systems should be open
with regard to data. Some data mining tools require the data to be held in a
proprietary format which is not compatible with commonly used database
systems. (This is sometimes referred to as the “data jail-house”.)
This can result
in large overheads to transfer data into the format required, and difficulty in
deploying the results into an organisation’s systems. A good data mining tool
will interface to your data via common standards.
3.8 Pitfall #8: Disorganized
data mining
This common pitfall is often a consequence of the “apply the algorithm, use the
results” misconception (see myth #6). The data mining takes place in an ad-hoc
manner, with no clear goals and no idea of how the results will be used. The
consequences can be unusable results.
To produce useful results, it is necessary to have clearly defined business and
data mining goals, formulated early in the project, along with deployment plans.
A simple way of ensuring this is to use a standard process such as CRISP-DM
[1]; this ensures the correct preparation for data mining, and provides a common
language for communication of methods and results. Data mining tools should
support standard process models.
4 Conclusions
Data mining is a business process, requiring extensive business knowledge and
best practiced by, or in very close collaboration with, business experts.
Data mining uses a variety of different kinds of techniques, and should not be
focussed mainly or exclusively on modelling algorithms and their predictive
accuracy. Each technique can play a variety of roles.
Data miners should make intelligent decisions about the amount of data
required, assuming neither that all of an organisation’s data will be relevant, nor
that all the available data will be required.
Effective data mining requires flexible and interoperable techniques; this
requirement is best met by integrated, open toolkits, which can interface to data
via open standards.
The data mining process can be characterised by interaction and engagement
with the data in an iterative fashion. A standard data mining process model such
as CRISP-DM helps to ensure the correct preparation for and use of data mining,
and should be supported by data mining tools.
References
[1] Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C.
and Wirth, R. CRISP-DM I. O Step-by-step data mining guide, CRISP-DM
Consortium, 2000, available at http://www.crisp-dm. org .