Download The Data Mining Process

The Data Mining Process With CRM becoming more of a business philosophy for most organizations today, data mining is often viewed as the analytical technology piece which is required to achieve a given solution. But what does this really mean? For many people, the fact that data mining is viewed as the technological component implies that purchasing the right software and hardware is the key to effective data mining. Other schools of thought regard the use of statistics and/or computer programming along with their machinelearning type algorithms as the data mining component. Yet, these preconceived perceptions miss the mark in the sense that these notions represent specific components within the overall data mining process. For data mining is a step by step process requiring both the human element interacting with technology in order to produce the best business solution for a given process. This is best understood by explaining this process and what is involved within each step. In the many articles and books that have been written about data mining, authors will have differing opinions on the number of steps or stages within a given data mining project. However, as one reads through a number of these books or articles, common themes emerge about concerning the critical junctures within a data mining project. In this article, we look at the data mining process as comprising four major steps or stages: 1)Identification of the Business Problem or Challenge 2)Creation of the Analytical File 3)Application of The Appropriate Tools and Technology 4)Implementation and Tracking Identification of the Business Problem or Challenge You will not get much argument from any of the pundits in recognizing that this is probably the most important stage within the entire process. In fact, it is rather ironic when one hears about the plethora of data mining technology available in the marketplace, yet realizing that this is the part of the process when the human element is most important. This capability relies on the expertise of being able to critically assess a given business situation both quantitatively and qualitatively and being able to determine its overall importance given the overall business strategy. An example best illustrates this argument: An organization’s overall sales had significantly decreased within the last year. It was also found that 75% of the company’s overall sales resulted from their best 20% of customers. The analyst and marketer identified that customer attrition had significantly increased within this last year. With this information, it was decided that an attrition model to identify high-risk attritors would enable marketers to allocate more resources to this vulnerable group in the hope of retaining them. Yet this thinking was flawed on two business fronts. Analysis should have been conducted on the high value risk group to determine the extent of the attrition problem within this segment. Given the preliminary information of this specific business case, it is extremely likely that the attrition problem is highly prevalent within this segment. The other thing to consider before embarking on developing a model is to even understand whether CRM or data mining has any relevance in this situation. For instance, if the reason for losing sales is because the competition has created new products or services which have price points and benefits far superior to the company, the capability of being superior CRM practitioners through data mining is not going to stop this sales erosion. As you can see in this case, some upfront data analysis complemented with some market research is the preferred route to undertake at the present time. Another important consideration in being able to identify a problem is to understand the current data environment and its implications in resolving a certain business problem. A company may want to launch a new product and develop a cross-sell model to target those customers who are most likely to purchase this product. In building a model, the analyst needs to understand that there is no prior information or history on this specific product in which to develop a model. The analyst might then think that there may be other similar type products with previous history which could be used to develop a broad profile. If this were not the case, the analyst might want to think of ways to identify customers who might be early adopters (i.e. the early pioneer purchasers of new products). As one can surmise from these above examples, it is evident that some creativity in thinking needs to be employed in ensuring that we are identifying the right problem and challenge given the current business circumstances. Once again, strong collaborative efforts between the marketing and the data mining area represent the keys towards really optimizing the creative thinking from both functional areas. Creation of the Analytical File Once the business challenge or problem is identified, the analyst needs to understand the data and information requirements which will enable him/her to conduct the necessary analytics. This does not mean that the analyst needs to undergo a rigorous data needs analysis as one would undertake in building or designing a database. The analyst is only concerned with what is already there and not what should be there. Once the analyst has identified a file and some of its contents as being potentially relevant for an analysis, the analyst in most cases would request the entire file as one source of data in the project. Other files would also be requested depending on their relevance within the project and whether or not customer-related type information can be extracted. With the source files being determined, the analyst then needs to understand the quality of the data. In other words, are there certain fields that have a large proportion of missing fields? # of Customers 49000 49000 59500 42000 150500 350000 Tenure 1998 1999 2000 2001 Missing Total % of Customers 14% 14% 17% 12% 43% 100% In the above case, we can see that 43% of the customer base has no start date as a customer. A number of techniques can be used to handle these missing values. For instance, using the average or median value of the non-missing values to impute an overall value is the popular way of dealing with missing values. A more robust but much more time consuming way is to build a model or algorithm that predicts the value of a variable based on the other fields or characteristics within the database. Another problem concerning the quality of data pertains to values within a field that don’t make sense. Product Category Code ABC DEF GHI 999 Total # of Customers 103810 118650 74165 49875 350000 % of Customers 29.66% 33.90% 21.19% 14.25% 99% In this above case, we see that all the product codes are comprised of letters and in all likelihood relate to a specific product category. The product category ‘999’ suggests that some investigation is required to better understand what this relates to. After the data quality issues are ironed out, consideration needs to be given as to how certain fields should be summarized or grouped. This is particularly relevant for purchase history. For instance, we may want to summarize the spend into yearly buckets with one bucket looking at the overall lifetime spend of the customer. At the same time, we may have hundreds of different product purchase codes. The challenge in meaningfully using this data is to group these product codes into broad categories such that the grouped information has enough data for any future statistical exercise within a given data mining project. Once it has been determined how to handle data quality issues as well as how to group or summarize the data, algorithms are then written to organize the data and information into one overall analytical file. This stage can in many cases represent the area where the data miner or analyst best demonstrates their worth or value to the organization. It is at this point that the analyst utilizes their knowledge of the information environment to create meaningful variables or fields of information which will be most relevant for a given analysis. For instance, the creation of trend variables(growth and decline)as well as purchase variables related to time and type of purchase are derived from the analyst’s work and are not directly obtained from the source database files. In fact, the breakdown of sourced(direct from the source files) vs. derived variables(created by the analyst) is about 10% vs. 90%. Application of The Appropriate Tools and Technology With the completion of the analytical file, the analyst is now in a position to deploy the appropriate tools,technologies and techniques in order to obtain the right solution. Keep in mind that not all data mining solutions require tools with statistical analysis. The notion of using indexes, in particular RFM(recency,frequency, and monetary) represents one non-statistical method of targetting customers for a given business initiative. Yet, the use of statistics represents an unbiased and objective means of letting the science determine the most appropriate characteristics or variables. In using solutions that require some statistical analysis, there are a large number of vendors that provide products in this area. The two most common vendors in this field are SAS(www.sas.com) and SPSS(www.spss.com). Both vendors offer a variety of statistical techniques which can be used depending on the specific solution and tactics that are required. Response Rate In applying the right tools and technologies, we need to consider the types of analytics that we require. For instance, we will need reports that demonstrate the key trends and behaviours against a given business metric(response rate,retention rate,ROI,etc.). Therefore, our tools need to be able to provide the capability of creating these reports. These are often referred to as EDA(Exploratory Data Analysis reports). Listed below is one example of an EDA report. 0.30% 0.20% 0.10% 0.00% Average 30 to 36 37 to 41 42 to 46 Customers who are older 47 to 52 53 to 60 As you can see from the above report, the analyst can determine that as the age of a customer increases, the likelihood that they will respond also increases. The statistical tools are required when we need to scientifically determine key triggers and behaviours which relate to a given business behaviour that we are trying to optimize. For instance, we want to identify the top 4 characteristics of what constitutes customer fraud within a given credit card database or at the same time identify the top 5 characteristics that result in a person upselling to a higher premium credit card. A variety of statistical techniques can be employed. Correlation analysis can help determine these variables if we are unconcerned about the interaction between the variables themselves which is often referred to as multicollinearity. In other words, we are only looking at the variables in a univariate way i.e. one variable at a time vs. the desired business metric. This can be useful if the outcome is to merely rank variables or characteristics against the desired business metric. However, if the desired outcome is to build a model, then we need to consider the interactions between variables. A variety of multivariate statistical techniques such as discriminant analysis ,regression analysis,CHAID can then be used to build the model. Some of the more advanced modeling technologies have started to capitalize on the learning and research from those disciplines relating to artificial intelligence. Techniques such as neural nets and genetic algorithms have adopted this machine-rule learning in order to optimize the performance of these models. More discussion about these tools and technologies will be the subject of a subsequent article. The other tools, that we need ,contain the ability to demonstrate the business impact of the desired solution. This is best demonstrated by observing how well the model will perform in a given business situation. In the chart below, we are assessing the business impact of a cross-sell model that was developed across the customer base. Given a revenue per order amount of $125 and a cost per promotion effort of $.85, we obtain the following business results. % of Average Number of prospects Response Names Mailed in Rate in Mailed Interval Interval 0-10% 10%-20% 20%-30% 30%-40% 40%-50% 50%-60% 60%-70% 70%-80% 80%-90% 90%-100% total 24,344 24,344 24,344 24,344 24,344 24,344 24,344 24,344 24,344 24,344 243,437 2.90% 1.44% 1.31% 0.97% 0.97% 0.64% 0.49% 0.42% 0.47% 0.40% 1.00% ROI 326.14% 111.68% 92.18% 42.05% 42.04% -5.30% -27.59% -38.73% -30.37% -41.51% 47.06% In the above chart, the model is applied against 243,437 customers whereby these names are ranked by descending model score into ten deciles. Since the model was a response rate model, we want to observe how well the model actually classifies observed or actual response rate. For this group of 243,437 customers, we actually have actual response rates since there was a prior cross-sell campaign to this group of customers. In terms of assessing the model, you can see that the model rank orders observed response rate quite well and achieved a ratio of 7.25 to 1 when comparing the response rate of the top decile to the response rate of the bottom decile. From the response rate performance, we can then translate these numbers into ROI since we know the cost per promotion as well as the revenue per order. The ROI numbers can then be used by the decision maker to determine the appropriate quantity of persons who should be promoted. Implementation and Tracking With the solution being completed, the next step is to action it within some business initiative. In some cases, these initiatives could be non-marketing related. For example, the development of credit-risk models could be applied within the operations area whereby the actual output of these models to the account representative are risk segments along with their specified courses of action. In applying solutions, the most important consideration is to ensure that the solution is being applied correctly. Initially, this implies that we conduct some data quality checks on several records to ensure the integrity of the solution on these records. Another consideration is to ensure that the information environment has not changed substantially between the time of the developed solution and the time of its application. This can be done by creating frequency distributions of key elements within the solution. For example, a given model could be examined by comparing the score distribution ranges between time of development and time of implementation . See the example below: % of List 0- 10% 10- 20% 20- 30% 30- 40% 40- 50% etc... Minimum Score (development) 0.08 0.07 0.06 0.05 0.04 Minimum Score (application) applicationcore) 0.04 0.03 0.02 0.01 0.004 In the above example, the score ranges have changed quite drastically between time of development and the most current application. Forging ahead with this solution without understanding these score discrepancies is an invitation for failure. Investigation and analysis needs to be conducted on the database to clearly understand why these score discrepancies exist. Once we are comfortable with the application of the solution, we then need to create a testing and tracking environment in order to evaluate the impact of this solution within a live business initiative. Listed below is an example of one test/tracking matrix which would be used within a marketing campaign. % of File (Ranked by model Score) 0-5% 5-10% 10-15% 15%-100% # of Names Mailed 50,000 50,000 50,000 50,000 Test Cell 1:45,000 Control Cell 2:5,000 Test Cell 3:45,000 Control Cell 4:5,000 Test Cell 5: 45,000 Control Cell 6:5,000 Test Cell 7: 45,000 Control Cell 8:5,000 As you can observe from the matrix, the intent in creating this matrix was to both evaluate the performance of the model as well as a particular new communication piece. Comparison of the performance of the control cells across each model-ranked interval will indicate how well the model is performing. Comparison of the performance of all test cells vs. the performance of the control cells will indicate the effectiveness of this new communication piece. At a finer level, we can determine where the sensitivities of modeling and communication exert their greatest impact within the overall list. From the above examples, the implementation process is highly detailed. Time needs to be devoted such that the solution is correct and makes sense in today’s information environment. As well, this time commitment also needs to be allotted in order to set up the proper testing and tracking environment for evaluating performance. Conclusion As you can see from the above discussion, data mining is much more than just technology. It is a step by step process with the end result being the development and application of a solution. Yet, at the same time, this process should always yield learning which can be potentially utilized for future campaigns. The development and application of solutions as well as acquisition of new learning provide the ingredients whereby continuous business improvement is really the long-term goal. By Richard Boire – Partner, Boire Filler Group

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The Data Mining Process