Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
© 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 Hard hats for data miners: Myths and pitfalls of data mining T. Khabaza SPSS Advanced Data Mining Group Abstract The intrepid data miner runs many risks, such as being buried under mountains of data or vanishing along with the “mysterious disappearing terabyte”. This paper debunks some myths and sketches some “hard hats for data miners”. 1 Introduction Data mining is a business process, finding patterns in your data which you can use to do your business better. Through data mining we gain insight into a business problem; this insight may be of use in itself, but it also helps us to gain the other benefits of data mining, such as a predictive capability. This paper is about the practice of data mining; it is not a research paper, but reports lessons learned through solving practical business problems and through contact with many data mining users and potential users. There are many myths and misconceptions about data mining, and holding these misconceptions leads data mining users to run specific risks. The first half of this paper lists some common misconceptions about data mining, corrects them, and describes the risks to which they can lead. The second half of the paper lists other common problems or pitfalls of data mining, with their symptoms and cures. 2 Myths and misconceptions about data mining 2.1 Myth #1: Data mining is all about algorithms attending a typical data mining conference, The ordinary business-person, reading its proceedings, or even reading only the contents page of such a © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 14 Data Mining III proceedings, could be forgiven for thinking that data mining is all about advanced data analysis algorithms. This misconception might be summarised as “all you need for data mining is good algorithms; the better your algorithms, the better your data mining”, and its corollary “advancing the state of the art in data mining means advancing knowledge of algorithms”. To hold this view is to misunderstand the data mining process completely. Data mining is a business process, involving many elements such as formulating business goals, mapping business to data mining goals, acquiring, understanding and pre-processing the data, evaluating and presenting the results of analysis and deploying these results to achieve business benefit, as well as the modelling component. (A good explanation of this process can be found in the emerging industry standard process model CRISP-DM [1].) In their extreme form, the consequences of holding this misconception are disastrous for a data mining project, and such a project will fail to produce any useful results. In practice, this occurs only in the narrowest, most academic of projects, where useful results for the business are not absolutely required. In any project where there is a requirement for the results to benefit the business, the data miner who holds this misconception is forced to discard it, at least partially, and face the need for a broader view of the data mining process. This is not to denigrate those parts of data mining research which develop or improve data mining algorithms. Algorithms play a key role in data mining, and new or improved algorithms are one way in which the art of data mining advances. The problem occurs when we focus mainly or solely on algorithms and ignore the other !)()-!)5~0 of the data mining process. 2.2 Myth #2: Data mining is all about predictive accuracy Above I have rejected the notion that data mining is all about modelling algorithms, but within that part of data mining which is about algorithms, how can we judge the quality of an algorithm? Readin< data mining research literature might lead us to suppose that the main criterion for judging an algorithm is the predictive accuracy of the models it generates. This view completely misrepresents the role of algorithms in the data mining process. It is true that in order to be useful a predictive model should have some degree of accuracy, because this reflects whether the algorithm has really discovered patterns in the data. However, many other properties of an algorithm or a model affect its usefidness; examples include whether the model can be understood by the analyst, and whether it requires technical knowledge to understand the model or apply the algorithm. Considering the properties (other than predictive accuracy) which the data mining process requires of algorithms, we can see the likely consequences of holding this mistaken view: algorithms will be produced which can be used only by technology experts. These algorithms will have only the most limited role in a process which is driven by business expertise. © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 Data Mining III ~5 2.3 Myth #3: Data mining requires a data warehouse Data mining practitioners often hear statements like “we are not ready for data mining yet, we need to build our data warehouse first”. Such statements are based on the view that data warehousing is a pre-requisite for data mining. This is a subtle misconception about the relationship between data warehousing and data mining. It is true that data mining can benefit from the warehoused data being well organised, relatively clean, and easy to access. These benefits can accrue if the warehouse has been constructed with data mining specifically in mind, and with knowledge of the requirements of the data mining envisaged. If it has not, the warehoused data may be less useful for data mining than the source operational data, or in the worst case completely useless (for example in cases where only summary data is warehoused). To avoid this risk, it is usefhl to perform pilot data mining projects using operational data in order to determine the correct content and organisation for the warehouse. It is misleading to state that data mining requires a data warehouse; a more accurate summary of the relation would be that data mining can benefit from a data warehouse, but that to construct such a warehouse often requires data mining. 2.4 Myth #4: Data mining is all about vast quantities of data Early explanations of data mining in the computing press often start with statements like “We now collect more data than ever, yet how are we to gain To focus on the size of data stores benefit from these vast data stores?”. provides a convenient introduction to the topic of data mining, but subtly misrepresents its nature. Data mining becomes useful when data becomes too large or too complex to analyse “by eye”, that is anything larger than a few tens of examples and a handful of attributes. Many usefil data mining projects are performed on small or medium-sized datasets, for example containing only hundreds or thousands of records. Apart ffom its convenience in popular explanations, an association of data mining with vast datasets is also connected with the recent emphasis on performance and scalability of data mining tools. This drive to extend the reach of data mining tools to large data is perfectly justified – there are many large datasets which it benefits us to mine. However it would be a mistake to believe that these large datasets are the sole focus of data mining. Holding this erroneous belief would lead us to produce tools which sacrifice usability for scalability, whereas in fact both aspects are essential. To quote a customer of a leading data mining tool: “other data mining tools optimise machine time, but this tool optimises my time”. Whether the datasets are large or small we much strive to optimise the user’s time, and this may be assisted by scalability and performance. © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 16 Data Mining III 2.5 Myth #5: Data mining should be done by a technology expert Data mining technology, particularly modelling techniques, is of an advanced sort, and its workings are unlikely to be understood by the wider IT community. Some would claim that this means they should be applied only by technology experts who understand their workings. (This claim may be influenced by a historical association with statistical modelling algorithms, which are more open to misinterpretation than most data mining algorithms.) In fact, the very reverse is true, because of the paramount importance of When performed without business business knowledge in data mining. knowledge, data mining usually produces nonsensical or useless results (see pitfall #3 below). It is therefore essential that data mining is performed by someone with extensive knowledge of the business problem, which is very seldom combined with knowledge of the technology. It is the responsibility of data mining tool providers to ensure that tools are accessible to business rather than technology experts. It behoves the data mining community at large to make clear to potential users that data mining provides insight and useful suggestions, rather than mathematical certainty. 2.6 Myth #6: Neural networks are opaque and consequently – an over-simplistic view of data mining useless Myth #6 is a relatively specific misconception about one family of modelling techniques (neural networks) which arises from a broader misunderstanding about the data mining process. One sometimes encounters the view that neural networks are not very useful in data mining because one cannot discover why they make the predictions that they do, or the “rules” that they use. This means that their predictions cannot be justified, and that they will not contribute much insight. While this argument reflects a correct emphasis on understandability of models and the insight produced by data mining, the conclusion about the disutility of neural networks is erroneous, and the argument reflects a mistaken view of the data mining process, possibly related to myth #1. This mistaken view regards data mining as a rather simple process: “take the data, apply a modelling technique, use the results”. This omits the iterative nature of the data mining process, and the way in which many techniques are used together to produce a result. Neural networks are used in a variety of ways in data mining projects, uses which are not impacted by the opacity of the models. Here are some examples: . ● Neural networks can be used for attribute selection, either by training them repeatedly with different combinations of attributes, or by using techniques of “sensitivity analysis” to rank the attributes by their impact on predictions. Neural networks can be used for “pattern confiiation” – because they are particularly powerful “pattern finders” for many applications, neural networks can be used to confirm that a pattern exists, before spending effort on tuning other techniques to find it. © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 Data Mining III 1‘7 Neural networks can be used in conjunction with other techniques, for example to improve the confidence of predictions by discarding those where the neural network disagrees with the predictions of another technique, or by using other techniques to analyse the behaviour of the neural networks. All of these uses of neural networks reflect the fact that the data mining process cannot be summarised as “apply a modelling technique and use the results”. Data mining facilities form a “toolbox”, whose contents are used in varied and sometimes surprising ways to solve a problem. ● 3 Pitfalls of data mining and their cures 3.1 Pitfall #1: Buried under mountains of data Data mining should be an interactive, iterative process where the analyst applies substantial business knowledge and is “engaged” with the data. However, those who hold myth #4 (that data mining is about vast quantities of data) often suppose that this process must be applied to all of the available data. This can lead to attempts to mine volumes of data for which the available hardware and software cannot provide an acceptable interactive response (for example, building a model within a few minutes). The data mining process becomes sluggish, and by the time a question is answered, the analyst cannot remember why it was asked. It is hard to feel that this process is generating insight. The cure for this malaise is usually some form of sampling. For example, if we have a million customers and a 20°/0 annual attrition (or “chum”) rate, we need not plot our graphs or build our models using the fill million examples, or even half a million (leaving, say, half for independent results validation). Consider the following questions and answers: Q: A: Q: A: How many chum profiles do we expect to find? Maybe ten. How many examples of each profile do we need? Maybe a thousand. Conclusion: A sample of ten or twenty thousand churners, number of non-churners, will be sufficient for this analysis. and an equivalent Note that this does not mean that we will never encounter the need to build models from millions of examples, only that we should not assume that we must do so if this data is available. One interesting class of cases is those where we wish to find a “rare” profile. Suppose that we wish to find a specific phenomenon which causes only 1’% of churn. It might be thought that we must build models against the whole dataset in order to find it. However there are other approaches. For example we might find the common chum profiles first, using a relatively small sample to build the models, use these initial profdes to score the entire database, and then focus © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 18 Data Mining III subsequent analysis on the relatively predicted not to do so. 3.2 Pitfall #2: The Mysterious small Disappearing number who chum but were Terabyte This is a common phenomenon, but not always a pitfall. The phrase “mysterious disappearing terabyte” refers to the fact that for a given data mining problem, the amount of available and relevant data maybe much less than initially supposed. Consider the following scenario: You are a data mining consultant, and your client is a large bank, holding terabytes of data on its customers. There is some concern that the available computing resources will be inadequate to mining this volume of data. The bank wishes to mine information on credit risk. Different types of credit (for example personal loans, business loans, overdrafts) would present different patterns of credit risk, so each data mining project will concentrate on one type of borrower. A number of factors are judged (by the bank’s domain experts) to be relevant. Are these factors collected by the bank? Yes, they have looked ahead and started collecting the relevant factors, eighteen months ago. Lots of borrowing has taken place in the intervening time so there should be no problem about data! How many bad debts of the relevant kind have occurred in that time? Plenty - almost a thousand! Thus the relevant data consists of less than a thousand cases of bad debt plus a sample from a plentifid supply of cases of good debt - say 3,000 records in all. Somehow, terabytes of data have ‘softly and silently vanished away’, fortunately not (quite) taking the data miner with them (this time). 3.3 Pitfall #3: Insufficient business knowledge I have emphasised previously the crucial role played by business knowledge in data mining. Without it, we can neither recognise useful results nor guide the data mining process towards them. It is sometimes supposed that the end user of data mining can reasonably take the attitude: “here is the data, please go away and mine it, and come back with the answers”. When a data mining project is organised in this way, at best the project will take many long and costly iterations to produce useful results, and at worst the results will be gibberish and the project will fail. This pitfall can only be avoided by involving the end user, and more specifically someone with a detailed knowledge of the business, at every stage of the data mining process. Ideally the data miner should be part of that business, but if a data mining consultant is used then the consultant should literally sit next to someone with the required business knowledge who understands the question under consideration. For this to work, a highly interactive data mining environment with good response time is required. . (A data mining consultant with general knowledge of the relevant industry is not sufficient – detailed knowledge of the specific business is needed.) © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 Data Mining III 3.4 Pitfall #4: Insufficient 19 data knowledge In order to perfonm data mining we must be able to answer questions like “what do the codes in this field mean?”, and “can there be more than one record per customer in this table?”. In some cases this information is surprisingly hard to come by – for example because the data expert has left the organisation or moved to another department, or in the case of legacy systems there may be no data expert at all. This problem is exacerbated when the database or data warehouse management is outsourced – the external supplier is even less motivated than the user organisation to maintain the information “in case it is needed in future”. There is no simple cure for this problem. IT departments should be made aware of the need to maintain information about the organisation’s databases, and when a data mining project is proposed we should consider how much data knowledge is available, and any risks caused by its absence or scarcity. 3.5 Pitfall #5: Erroneous assumptions, courtesy of the experts Business and data expertise are crucial resources for data mining, but that does not mean that the data miner should accept unquestioningly every statement of the experts. One benefit from data mining is that organisations discover surprising facts about their data and about their business. The data miner should seek to confirm the truth of experts’ statements so far as they relate to the data. Typical examples of erroneous or misleading statements would include: ● No customer can hold accounts of both these types. ● No case will include more than one event of this type. Only the following codes will be present in this field. Statements like this should be verified by examining the data. Data mining tools should make this easy. It is particularly important to check these issues when processing of the data will depend on them, so that mistakes in these assumptions can be spotted before they lead to errors in the treatment of data. ● 3.6 Pitfall #6: Incompatibility of data mining tools The data mining process requires a wide range of facilities, so it might be supposed that a wide variety of tools will be used. This can lead to a high overhead in switching contexts and converting data between different formats. At its worst this can lead to the omission of necessary steps, and even mild cases can seriously interfere with the exploratory character of data mining. The most readily available solution is to use a data mining toolkit in which all the required facilities are present in an integrated form. However, no toolkit will provide every possible facility, especially when the individual preferences of analysts are taken into account, so toolkits should also be “open”, and interface easily with other available tools and third-party options. © 2002 WIT Press, Ashurst Lodge, Southampton, SO40 7AA, UK. All rights reserved. Web: www.witpress.com Email [email protected] Paper from: Data Mining III, A Zanasi, CA Brebbia, NFF Ebecken & P Melli (Editors). ISBN 1-85312-925-9 20 Data Mining III 3.7 Pitfall #7: Locked in the data jail-house In addition to openness with regard to tools, data mining systems should be open with regard to data. Some data mining tools require the data to be held in a proprietary format which is not compatible with commonly used database systems. (This is sometimes referred to as the “data jail-house”.) This can result in large overheads to transfer data into the format required, and difficulty in deploying the results into an organisation’s systems. A good data mining tool will interface to your data via common standards. 3.8 Pitfall #8: Disorganized data mining This common pitfall is often a consequence of the “apply the algorithm, use the results” misconception (see myth #6). The data mining takes place in an ad-hoc manner, with no clear goals and no idea of how the results will be used. The consequences can be unusable results. To produce useful results, it is necessary to have clearly defined business and data mining goals, formulated early in the project, along with deployment plans. A simple way of ensuring this is to use a standard process such as CRISP-DM [1]; this ensures the correct preparation for data mining, and provides a common language for communication of methods and results. Data mining tools should support standard process models. 4 Conclusions Data mining is a business process, requiring extensive business knowledge and best practiced by, or in very close collaboration with, business experts. Data mining uses a variety of different kinds of techniques, and should not be focussed mainly or exclusively on modelling algorithms and their predictive accuracy. Each technique can play a variety of roles. Data miners should make intelligent decisions about the amount of data required, assuming neither that all of an organisation’s data will be relevant, nor that all the available data will be required. Effective data mining requires flexible and interoperable techniques; this requirement is best met by integrated, open toolkits, which can interface to data via open standards. The data mining process can be characterised by interaction and engagement with the data in an iterative fashion. A standard data mining process model such as CRISP-DM helps to ensure the correct preparation for and use of data mining, and should be supported by data mining tools. References [1] Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. CRISP-DM I. O Step-by-step data mining guide, CRISP-DM Consortium, 2000, available at http://www.crisp-dm. org .