Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Data mining has become a valuable technique relied upon by businesses all over the world so they can better understand their markets and consequently gain competitive advantage. Data mining is only possible because of the amazing development of computer hardware—dazzling processing speeds, tons of available RAM options, and the evolution of data storage devices. All over the planet business are learning to take advantage of this combination of factors. There are many data mining software programs available for businesses, but as usually is the case in MIS, the best system for you depends on what you want to accomplish and your current situation. What Is Data Mining? Data mining can be defined as “the process of searching and analyzing data in order to find implicit, but potentially useful, information. It involves selecting, exploring, and modelling large amounts of data to uncover previously unknown patterns, and ultimately comprehensible information, from large databases” (Shaw, Subramanian, Tan, and Welge 2001:128). This technique allows organizations to use the data residing in enormous databases to their own advantage. It provides organizations with statistical tools to uncover hidden patterns that may provide the useful information needed to understand why a certain group of customers behave in a certain way. Your textbook provides a few cases in which the use of data mining techniques has helped organizations succeed. The statistical tools available for organizations in data mining software applications are broad in range, from regression analyses to genetic algorithms. Some examples of these tools area Association Discovery; Bayesian Statistics; Bayesian Networks; Classification trees; Regression Trees; Conceptual Clustering; Decision Trees; Fuzzy Logic; Genetic Algorithms; and Neural Networks. Every software package will provide you with different combinations of tools. Therefore, when selecting data mining software, one should carefully analyze the needs of the organization. For the information to be useful, knowledge workers who perform such analyses must be careful to avoid common mistakes that will not only invalidate the results, but could provide the organization with the wrong answer to the initial problem. What Are Some Key Considerations Before Mining Data? As in any research or statistical analysis, one must start with the correct understanding of what needs to be done and why it needs to be done. This requires the correct understanding of the problem, the selection of the right set of actions to address the problem, and the analyses of the results in light of the initial question(s). If data mining (or any other data analysis) is undertaken without this initial consideration, chances are that the final product will not solve any problems. On the contrary, it might create more serious ones. For example, if a business would like to know what its customers’ buying behaviour is at Thanksgiving, what would be a sensible approach? First, the analyst would have to look at the data and restrict it to a certain period of time (e.g., a month before the Thanksgiving holiday to a few days after since many people buy goods after the holiday to increase their savings). It would be nonsense to include year-round data. Also, if the business has operations in Canada and the United States, for each relevant population different time periods would apply since Thanksgiving in Canada is earlier than in the U.S. This is an example of correct analysis and understanding of the task before moving on to data preparation. There have been many discussions on how to correctly address the data mining process. One result of such discussions among professionals in the field is the CRISP-DM Model. CRISP-DM stands for CRoss Industry Standard Process for Data Mining. You can find more detailed information at the following Web site: http://www.crisp-dm.org. The model calls for a six-phase approach to data mining: (1) Business Understanding— understand the project objectives and requirements; (2) Data Understanding—collect the relevant data and become familiar with it, and also critically look into the quality of the data you have; (3) Data Preparation—clean, organize, and prepare the data for the analysis by one or many of the statistical tools available. Good data preparation leads to data that can be trusted and that will yield valid results; (4) Modeling—the phase in which data analysis will take place. It is important to notice that some of the modelling techniques require preparing data in specific ways, therefore the importance of the previous phase; (5) Evaluation—this phase requires a serious and non-biased analysis of your results. Here you need to find out if the model(s) is (are) really valid and useful. You need to make sure that the analysis was done right, that the data was prepared right, and that the results make business and research sense; and (6) Deployment—put the model(s) into action. Please refer to the above-mentioned Web site to gain a more indepth understanding of such a critical and important process. If you follow such a model and perform the tasks with seriousness and high standards, you will avoid incurring many serious pitfalls that will invalidate your findings. Some of the software available provides a step-by-step approach to perform some of the above-mentioned phases. For example, SAS Enterprise Miner makes use of what is called the SEMMA approach: Sample, Explore, Modify, Model, and Assess. The goal of this approach is to provide the user with a logical, organized framework for conducting data mining. It should be noted that, in the case of the example presented, phases 1 and 6 are not part of SEMMA. It is assumed that you and your organization have done your "homework" (phase 1) before using the data mining tools provided with a data mining package. Phase 6 is how you will put the model in practice for your organization, which is not part of a data mining package but part of the overall business understanding and purpose for the model train of thought. You can find more information on the SEMMA approach and SAS Miner at http://www.sas.com. Problems Occurring When Data Is Not Correctly Cleaned and Prepared Not spending enough time cleaning and preparing data will bring questions on validity and reliability to your analysis. Textbooks on the topics of research design and statistical techniques provide the analyst with many reasons for carefully cleaning and preparing data before any analysis can be run. Here are some examples: a) Missing values. When we survey people (customers, clients, employees, colleagues, etc.), many questions tend not to be answered. In this case, the analyst should understand (via descriptive statistics) the importance the missing values have to his/her analysis. Let’s say that what the analyst is looking at requires information on income, family size, and age. If respondents did not provide enough income information (people on higher income brackets do not report income as often as people on lower income brackets), a comparison between different levels of income may reflect a bias. The results will not demonstrate the difference in behaviour between higher-income customers and lower-income customers. Age is another big issue when it comes to missing values. If the sample selected does not represent the age groups accordingly (due to lack of responses), any analysis dependent on age differences is biased. b) Correct sample selection. Not selecting the correct sample also brings problems. Let’s say that the question the analyst needs to answer requires enough subjects from the teenage and young adults group of customers. If the analyst fails to understand that his/her analysis should zoom in on these two groups, the results will not provide enough statistical evidence for correct decision making and the analysis (or data collection) will have to be re-done. c) Outliers. Outliers are answers to questions that significantly change the distribution of all the other answers to the same question. Consider this example: The analyst has a nice breakdown of subjects in income brackets. Let’s also say the analyst didn’t notice that, due to an external factor (market behaviour), for a certain period of time (say March to April) a sub-sample of subjects in the higherincome bracket bought substantially more goods than they usually do. If the analyst had plotted the data, s/he will be able to spot the phenomena described. (The sub-sample above will show as dots that are far away to the right of the "money spent in purchases" normal curve). S/he will then make a decision on what to do with those data points. If the analyst did not realize what was happening and ran the analysis, s/he would be tricked into believing that clients are spending more money purchasing their goods than is true. Good data preparation catches outliers and good decision making (based on knowledge of the data, the problem, and research/statistics) allows the analyst to correctly deal with such extreme cases in order not to bias the distribution and the final results of the analysis d) Normality. Many statistical techniques require that variables be normally distributed (the bell-shaped curve). Failure to test for normality and to perform transformation of those variables to bring their distributions back to the normal distribution will invalidate the results of the analysis. There are many other considerations that a knowledge worker (analyst) should tackle when collecting, cleaning, preparing, and analyzing data. Good knowledge on research and statistical techniques is required to perform a correct analysis. SAS Enterprise Miner Parts of the information below can be found in "Enterprise Miner: Applying Data Mining Techniques Course Notes" by SAS. SAS Miner is a user-friendly, graphic-oriented tool that allows the user to define what needs to be done during the data mining process by selecting “nodes” (using a mouse, clicking on them) and placing the nodes (dragging them) in the “workspace” area. These nodes can then be easily connected by graphically inserted lines (similar to when you are using drawing software or other visualization software). That is, when the knowledge worker/analyst makes decisions about what needs to be done, s/he can easily tell the software what to do by using this graphical user-friendly feature instead of having to write lines of syntaxes. The SEMMA approach that SAS Miner makes use of guides the knowledge worker/analyst through the process of preparing the data and running the desired analysis. The first step is Sample. In this step, the analyst will identify the input data sets. The task here is to identify the data to be input in the software, followed by the selection of a sample from the larger data set, and the partition of the data set intro training, validation, and test data. The second step is Explore. In this step the analyst should explore the data statistically and graphically by means of plotting and descriptive statistics. This step allows one to better know the data and to correctly select the appropriate and important variables. The third step is Modify. Data is prepared for the subsequent analysis. Here the analyst will transform variables (e.g., to compensate for skew), re-group answers (e.g., married status from single, married, divorced, widowed, living with partner—five categories—to having a partner/spouse and not having a partner/spouse—two categories), deal with missing values (replace or not), and identify outliers, select control, dependent, and independent variables, and perform any other analyses that are required by the investigation the analyst plans to perform. These first steps will allow for what, in research methodology, is referred to as "getting to know and cleaning your data." This step is of extreme importance since the correct preparation of data for analysis is what guarantees the reliability of the results (see previous item). Step number 4 is Model. A model or models will be created by using a regression, decision tree, neural network, or user-defined model. Step number 5 is Assess. The analyst compares competing predictive models. One very important point in all of this process is that the knowledge worker can only make decisions that make sense and are right if s/he has understood the business needs and has an understanding of the research and statistical analysis methodology. This is why it is important to delegate such tasks to those workers that have been trained in research methods and statistics. Conclusion Data mining does, indeed, provide organizations with very powerful analytical tools to extract relevant and useful information from their ever-growing data pools. It is no doubt that lying dormant in databases is a wealth of relevant and crucial information businesses can learn from. But, as this textbook correctly enforces, one needs to understand the organization and its needs in order to ask the right questions, choose the right systems, and implement the right solutions. Data mining without careful consideration of factors such as the ones presented in the CRISP-DM methodology is doomed to sub-optimal results. References 1. CRISP-DM: CRoss Industry Standard Process for Data Mining Methodology. Downloaded from web site: http://www.crisp-dm.org. 2. Shaw, M.J., Subramanian, C., Tan, G.W. and Welge, M.E. (2001). Knowledge Management and Data Mining for Marketing. Decision Support Systems, 31, 127137. 3. Enterprise Miner: Applying Data Mining Techniques – Course Notes. Edited by SAS. 1999.