Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Difficulties 2.3 Honest Assessment 2.4 Methodology 2.5 Recommended Reading 1 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Difficulties 2.3 Honest Assessment 2.4 Methodology 2.5 Recommended Reading 2 Objectives 3 Name two major types of data mining analyses. List techniques for supervised and unsupervised analyses. Analytical Methodology A methodology clarifies the purpose and implementation of analytics. Define/Refine business objective. Assess results. Select data Deploy models Explore input data Prepare and Repair data Apply Analysis Transform input data 4 Business Analytics and Data Mining Data mining is a key part of effective business analytics. Components of data mining: data management data management data management customer segmentation predictive modeling forecasting standard and nonstandard statistical modeling practices 5 What Is Data Mining? 6 Information Technology – Complicated database queries Machine Learning – Inductive learning from examples Statistics – What we were taught not to do Translation for This Course Segmentation Predictive Modeling Unsupervised classification Supervised classification – Cluster Analysis – Linear regression – Association Rules – Logistic regression Other techniques – Decision trees Other techniques 7 Customer Segmentation Segmentation is a vague term with many meanings. Segments can be based on the following: A Priori Judgment – Alike based on business rules, not based on data analysis Unsupervised Classification – Alike with respect to several attributes Supervised Classification – Alike with respect to a target, defined by a set of inputs 8 Segmentation: Unsupervised Classification 9 Training Data Training Data case 1: inputs, ? case 2: inputs, ? case 3: inputs, ? case 4: inputs, ? case 5: inputs, ? case 1: inputs, cluster 1 case 2: inputs, cluster 3 case 3: inputs, cluster 2 case 4: inputs, cluster 1 case 5: inputs, cluster 2 new case new case Segmentation: A Selection of Methods Barbie Candy Beer Diapers Peanut butter Meat k-means clustering 10 Association rules (Market basket analysis) Predictive Modeling: Supervised Classification Training Data case 1: inputs prob class case 2: inputs prob class case 3: inputs prob class case 4: inputs prob class case 5: inputs prob class new case 11 new case Predictive Modeling: Supervised Classification Inputs Target Cases .. .. .. .. .. .. .. .. .. . . . . . . . . . ... ... ... ... ... ... ... ... ... ... 12 .. .. . . Types of Targets 13 Logistic Regression – event/no event (binary target) – class label (multiclass problem) Regression – continuous outcome Survival Analysis – time-to-event (possibly censored) Discrete Targets 14 Healthcare – Target = favorable/unfavorable outcome Credit Scoring – Target = defaulted/did not default on a loan Marketing – Target = purchased product A, B, C, or none Continuous Targets 15 Healthcare Outcomes – Target = hospital length of stay, hospital cost Liquidity Management – Target = amount of money at an ATM machine or in a branch vault Merchandise Returns – Target = time between purchase and return (censored) Application: Target Marketing Cases = Inputs = Target Action = = 16 customers, prospects, suspects, households geo/demo-graphics, psychometrics, RFM variables response to a past or test solicitation target high-responding segments of customers in future campaigns Application: Attrition Prediction/Defection Detection Cases Inputs = = Target = Action = 17 existing customers payment history, product/service usage, demographics churn, brand-switching, cancellation, defection customer loyalty promotion Application: Fraud Detection 18 Cases Inputs Target Action = = = = past transaction or claims particulars and circumstances fraud, abuse, deception impede or investigate suspicious cases Application: Credit Scoring Cases Inputs = = Target = Action = 19 past applicants application information, credit bureau reports default, charge-off, serious delinquency, repossession, foreclosure accept or reject future applicants for credit The Fallacy of Univariate Thinking What is the most important cause of churn? Prob(churn) International Usage 20 Daytime Usage A Selection of Modeling Methods Linear Regression, Logistic Regression 21 Decision Trees Hard Target Search Transactions 22 ... Hard Target Search Transactions 23 Fraud Undercoverage Accepted Bad Accepted Good Rejected No Follow-up 24 ... Undercoverage Next Generation Accepted Bad Accepted Good Rejected No Follow-up 25 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Difficulties 2.3 Honest Assessment 2.4 Methodology 2.5 Recommended Reading 26 Objectives 27 Discuss several of the challenges of data mining and ways to address these challenges. Initial Challenges in Data Mining 1. What do I want to predict? 2. What level of granularity is needed to obtain data about the customer? 28 ... Initial Challenges in Data Mining 1. What do I want to predict? a transaction an individual a household a store a sales team 2. What level of granularity is needed to obtain data about the customer? 29 ... Initial Challenges in Data Mining 30 1. What do I want to predict? a transaction an individual a household a store a sales team 2. What level of granularity is needed to obtain data about the customer? transactional regional daily monthly other Typical Data Mining Time Line Allotted Time Projected: Actual: Dreaded: (Data Acquisition) Needed: Data Preparation 31 Data Analysis Data Challenges What identifies a unit? 32 Cracking the Code What identifies a unit? ID1 ID2 2612 2613 2614 2615 2616 2617 2618 2618 2619 2620 2620 2620 33 624 625 626 627 628 629 630 631 632 633 634 635 DATE 941106 940506 940809 941010 940507 940812 950906 951107 950112 950802 950908 950511 JOB SEX FIN PRO3 CR_T ERA 06 04 11 16 04 09 09 13 10 11 06 01 8 5 5 1 2 1 2 2 5 1 0 1 DEC ETS PBB RVC ETT OFS RFN PBB SLP STL DES DLF . . . . . . 71 0 0 34 0 0 . . . . . . 612 623 504 611 675 608 . . . . . . 12 23 04 11 75 08 Data Challenges What should the data look like to perform an analysis? 34 Data Arrangement What should the data look like to perform an analysis? Long-Narrow Acct type 2133 2133 2133 2653 2653 3544 3544 3544 3544 3544 35 MTG SVG CK CK SVG MTG CK MMF CD LOC Short-Wide Acct CK SVG MMF CD LOC MTG 2133 2653 3544 1 1 1 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 Data Challenges What variables do I need? 36 Derived Inputs What variables do I need? Claim Accident Date Time 11nov96 22dec95 26apr95 02jul94 08mar96 15dec96 09nov94 37 102396/12:38 012395/01:42 042395/03:05 070294/06:25 123095/18:33 061296/18:12 110594/22:14 Delay Season Dark 19 fall 0 333 3 0 69 winter spring summer winter 1 1 0 0 186 4 summer fall 0 1 Data Challenges How do I convert my data to the proper level of granularity? 38 Roll-Up How do I convert my data to the proper level of granularity? HH Acct Sales 4461 4461 4461 4461 4461 4911 5630 5630 6225 6225 39 2133 2244 2773 2653 2801 3544 2496 2635 4244 4165 160 42 212 250 122 786 458 328 27 759 HH Acct Sales 4461 2133 4911 3544 ? ? 5630 2496 6225 4244 ? ? Rolling Up Longitudinal Data How do I convert my data to the proper level of granularity? Frequent Flying VIP Flier Month Mileage Member 40 10621 10621 Jan Feb 650 0 No No 10621 10621 Mar Apr 0 250 No No 33855 33855 33855 33855 Jan Feb Mar Apr 350 300 1200 850 No No Yes Yes Data Challenges What sorts of raw data quality problems can I expect? 41 Errors, Outliers, and Missings What sorts of raw data quality problems can I expect? cking #cking ADB NSF dirdep SVG bal Y Y Y y Y Y Y Y Y 42 1 1 1 . 2 1 1 . . 1 3 2 468.11 1 68.75 0 212.04 0 . 0 585.05 0 47.69 2 4687.7 0 . 1 . . 0.00 0 89981.12 0 585.05 0 1876 0 6 0 7218 1256 0 0 1598 0 0 7218 Y Y Y Y Y Y Y 1208 0 0 4301 234 238 0 1208 0 0 45662 234 Missing Value Imputation What sorts of raw data quality problems can I expect? Inputs ? ? ? ? ? Cases ? ? ? ? 43 Data Challenges Can I (more importantly, should I) analyze all the data that I have? All the observations? All the variables? 44 Massive Data Can I (more importantly, should I) analyze all the data that I have? Bytes Paper 45 Kilobyte 210 ½ sheet Megabyte 220 1 ream Gigabyte 230 167 feet Terabyte 240 32 miles Petabyte 250 32,000 miles Sampling Can I (more importantly, should I) analyze all the data that I have? 46 Oversampling Can I (more importantly, should I) analyze all the data that I have? OK Fraud 47 The Curse of Dimensionality Can I (more importantly, should I) analyze all the data that I have? 1–D 2–D 3–D 48 Dimension Reduction Input3 E(Target) Can I (more importantly, should I) analyze all the data that I have? Redundancy Irrelevancy Input1 49 Catalog Case Study Analysis goal: A mail-order catalog retailer wants to save money on mailing and increase revenue by targeting mailed catalogs to customers who are most likely to purchase in the future. Data set: CATALOG Number of rows: 48,356 Number of columns: 98 Contents: sales figures summarized across departments and quarterly totals for 5.5 years of sales Targets: RESPOND (binary) ORDERSIZE (continuous) 50 Catalog Case Study: Basics Throughout this chapter, you work with data in SAS Enterprise Miner to perform exploratory analysis. 1. Import the CATALOG data. 2. Identify the target variables. 3. Define and transform the variables for use in RFM analysis. 4. Perform graphical RFM analysis in SAS Enterprise Miner. Later, you use the CATALOG data for predictive modeling and scoring. 51 Accessing and Importing Data for Modeling First, get familiar with the data! The data file is a SAS data set. 1. Create a project in SAS Enterprise Miner. 2. Create a diagram. 3. Locate and import the CATALOG data. 4. Define characteristics of the data set, such as the variable roles and measurement levels. 5. Perform a basic exploratory analysis of the data. 52 Defining a Data Source Catalog data ABA1 SAS Foundation Server Libraries 53 Metadata Definition Metadata Definition Select a table. Set the metadata information. Three purposes for metadata: Define variable roles (input, target, ID, etc.). Define measurement levels (binary, interval, nominal, etc.). Define table role (raw data, transactional data, scoring data, etc.). 54 Creating Projects and Diagrams in SAS Enterprise Miner Catalog Case Study Task: Create a project and a diagram in SAS Enterprise Miner. 55 Defining a Data Source Catalog Case Study Task: Define the CATALOG data source in SAS Enterprise Miner. 56 Defining Column Metadata Catalog Case Study Task: Define column metadata. 57 Changing the Explore Window Sampling Defaults and Exploring a Data Source Catalog Case Study Tasks: Change preference settings in the Explore window and explore variable associations. 58 IDEA EXCHANGE Consider an academic retention example. Freshmen enter a university in the fall term, and some of them drop out before the second term begins. Your job is to try to predict whether a student is likely to drop out after the first term. What kinds of variables would you consider using to assess this question? Continued… 59 IDEA EXCHANGE As an administrator, do you have this information? Could you obtain it? What kinds of data quality issues do you anticipate? Are there any ethical considerations in accessing the information in your study? Continued… 60 IDEA EXCHANGE How does time factor into your data collection? Do inferences about students five years ago apply to students today? How do changes in technology, university policies, and teaching trends affect your conclusions? 61 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Difficulties 2.3 Honest Assessment 2.4 Methodology 2.5 Recommended Reading 62 Objectives 63 Explain what is meant by a model giving the best prediction. Describe data splitting. Discuss the advantages of using honest assessment to evaluate a model and obtain the model with the best prediction. Predictive Modeling Implementation 64 Model Selection and Comparison – Which model gives the best prediction? Decision/Allocation Rule – What actions should be taken on new cases? Deployment – How can the predictions be applied to new cases? ... Predictive Modeling Implementation 65 Model Selection and Comparison – Which model gives the best prediction? Decision/Allocation Rule – What actions should be taken on new cases? Deployment – How can the predictions be applied to new cases? Getting the “Best” Prediction: Fool’s Gold My model fits the training data perfectly... I’ve struck it rich! 66 Model Complexity 67 Model Complexity Too flexible 68 Model Complexity Too flexible 69 Model Complexity Just right Too flexible 70 Data Splitting and Honest Assessment 71 Overfitting Training Set 72 Test Set Better Fitting Training Set 73 Test Set Predictive Modeling Implementation 74 Model Selection and Comparison – Which model gives the best prediction? Decision/Allocation Rule – What actions should be taken on new cases? Deployment – How can the predictions be applied to new cases? Decisions, Decisions Predicted 0 Actual 0 360 540 1 20 80 0 540 360 1 40 60 0 720 180 1 60 75 1 40 .08 44% 80% 1.3 .10 60% 60% 1.4 .12 76% 40% 1.8 Misclassification Costs Predicted Class Actual 0 76 Action 1 Accept 0 True Neg False Pos 1 False Neg True Pos Deny OK 0 1 Fraud 9 0 Predictive Modeling Implementation 77 Model Selection and Comparison – Which model gives the best prediction? Decision/Allocation Rule – What actions should be taken on new cases? Deployment – How can the predictions be applied to new cases? Scoring Model Deployment Model Development 78 Scoring Recipe 79 The model results in a formula or rules. The data requires modifications. – Derived inputs – Transformations – Missing value imputation The scoring code is deployed. – To score, you do not re-run the algorithm; apply score code (equations) obtained from the final model to the scoring data. Scorability Training Data Classifier X1 1 Tree .8 .6 .4 New Case .2 0 0 80 .2 .4 .6 .8 X2 1 Scoring Code If x1<.47 and x2<.18 or x1>.47 and x2>.29, then red. Scoring Pitfalls: Population Drift Data generated Data cleaned Model deployed Time Data analyzed Data acquired 81 The Secret to Better Predictions Fraud OK Transaction Amt. 82 ... The Secret to Better Predictions Fraud OK Transaction Amt. 83 ... The Secret to Better Predictions Fraud OK Transaction Amt. 84 Cheatin’ Heart IDEA EXCHANGE Think of everything you have done in the past week. What transactions or actions created data? For example, point of sale transactions, internet activity, surveillance, and questionnaires are all data collection avenues that many people encounter daily. How do you think that the data about you will be used? How could models be deployed that use data about you? 85 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Difficulties 2.3 Honest Assessment 2.4 Methodology 2.5 Recommended Reading 86 Objectives 87 Describe a methodology for implementing business analytics through data mining. Discuss each of the steps, with examples, in the methodology. Create a project and diagram in SAS Enterprise Miner. Methodology Data mining is not a linear process, but it is a cycle, where later results can lead back to previous steps. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 88 Why Have a Methodology? 89 To avoid learning things that are not true To avoid learning things that are not useful – results that arise from past marketing decisions – results that you already know – results that you already should know – results that you are not allowed to use To create stable models To avoid making the mistakes that you made in the past To develop useful tips from what you learned Methodology 1. Define the business objective and state it as a data mining task. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 90 1) Define the Business Objective 91 Improve the response rate for a direct marketing campaign. Increase the average order size. Determine what drives customer acquisition. Forecast the size of the customer base in the future. Choose the right message for the right groups of customers. Target a marketing campaign to maximize incremental value. Recommend the next, best product for existing customers. Segment customers by behavior. A lot of good statistical analysis is directed at solving the wrong business problem. Define the Business Goal Example: Who is the yogurt lover? What is a yogurt lover? One answer prints coupons at the cash register. Another answer mails coupons to people’s homes. Another results in advertising. 92 MEDIUM LOW $$ Spent on Yogurt HIGH Big Challenge: Defining a Yogurt Lover LOW MEDIUM Yogurt as % of All Purchases 93 HIGH “Yogurt lover” is not in the data. You can impute it, using business rules: Yogurt lovers spend a lot of money on yogurt. Yogurt lovers spend a relatively large amount of their shopping dollars on yogurt. Next Challenge: Profile the Yogurt Lover You have identified a segment of customers that you believe are yogurt lovers. But who are they? How would I know them in the store? Identify them by demographic data. Identify them by other things that they purchase (for example, yogurt lovers are people who buy nutrition bars and sports drinks). What action can I take? Set up “yogurt-lover-attracting” displays 94 IDEA EXCHANGE If a customer is identified as a yogurt lover, what action should be taken? Should you give yogurt coupons, even though these individuals will buy yogurt anyway? Is there a cross-sell opportunity? Is there an opportunity to identify potential yogurt lovers? What would you do? 95 Profiling in the Extreme: Best Buy Using analytical methodology, electronics retailer Best Buy discovered that a small percentage of customers accounted for a large percentage of revenue. Over the past several years, the company has adopted a customer-centric approach to store design and flow, staffing, and even corporate acquisitions such as the Geek Squad support team. The company’s largest competitor has gone bankrupt while Best Buy has seen growth in market share. See Gulati (2010) 96 Define the Business Objective What Is the business objective? Example: Telco Churn Initial problem: Assign a churn score to all customers. Recent customers with little call history Telephones? Individuals? Families? Voluntary churn versus involuntary churn How will the results be used? Better objective: By September 24, provide a list of the 10,000 elite customers who are most likely to churn in October. The new objective is actionable. 97 Define the Business Objective Example: Credit Churn How do you define the target? When did a customer leave? When she has not made a new charge in six months? When she had a zero balance for three months? When the balance does not support the cost of carrying the customer? When she cancels her card? 3.0% 1.0% When the contract ends? 0.8% 0.6% 0.4% 0.2% 0.0% 0 1 2 3 4 5 6 Tenure (months) 98 7 8 9 10 11 12 13 14 15 Translate Business Objectives into Data Mining Tasks Do you already know the answer? In supervised data mining, the data has examples of what you are looking for, such as the following: customers who responded in the past customers who stopped transactions identified as fraud In unsupervised data mining, you are looking for new patterns, associations, and ideas. 99 Data Mining Tasks Lead to Specific Techniques Objectives Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value 100 ... Data Mining Tasks Lead to Specific Techniques Objectives Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value Tasks Exploratory Data Analysis Binary Response Modeling Multiple Response Modeling Estimation Forecasting Detecting Outliers Pattern Detection 101 ... Data Mining Tasks Lead to Specific Techniques Objectives Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value 102 Tasks Techniques Exploratory Data Analysis Decision Trees Binary Response Modeling Neural Networks Regression Multiple Response Modeling Survival Analysis Estimation Association Rules Forecasting Link Analysis Detecting Outliers Hypothesis Testing Pattern Detection Visualization Clustering Data Analysis Is Pattern Detection Patterns might not represent any underlying rule. Some patterns reflect some underlying reality. The party that holds the White House tends to lose seats in Congress during off-year elections. Others do not. When the American League wins the World Series in Major League Baseball, Republicans take the White House. Stars cluster in constellations. Sometimes, it is difficult to tell without analysis. In U.S. presidential contests, the taller candidate usually wins. 103 Example: Maximizing Donations Example from the KDD Cup, a data mining competition associated with the KDD Conference (www.sigkdd.org): Purpose: Maximizing profit for a charity fundraising campaign Tested on actual results from mailing (using data withheld from competitors) Competitors took multiple approaches to the modeling: Modeling who will respond Modeling how much people will give Perhaps more esoteric approaches However, the top three winners all took the same approach (although they used different techniques, methods, and software). 104 The Winning Approach: Expected Revenue Task: Estimate responseperson, the probability that a person responds to the mailing (all customers). Task: Estimate the value of response, dollarsperson (only customers who respond). Choose prospects with the highest expected value, responseperson * dollarsperson. 105 An Unexpected Pattern An unexpected pattern suggests an approach. When people give money frequently, they tend to donate less money each time. In most business applications, as people take an action more often, they spend more money. Donors to a charity are different. This suggests that potential donors go through a two-step process: Shall I respond to this mailing? How much money should I give this time? Modeling can follow the same logic. 106 Methodology 2. Select or collect the appropriate data to address the problem. Identify the customer signature. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 107 2) Select Appropriate Data What is available? What is the right level of granularity? How much data is needed? How much history is required? How many variables should be used? What must the data contain? Assemble results into customer signatures. 108 Representativeness of the Training Sample The model set might not reflect the relevant population. Customers differ from prospects. Survey responders differ from non-responders. People who read e-mail differ from people who do not read e-mail. Customers who started three years ago might differ from customers who started three months ago. People with land lines differ from those without. 109 Availability of Relevant Data Elevated printing defect rates might be due to humidity, but that information is not in press run records. Poor coverage might be the number one reason for wireless subscribers canceling their subscriptions, but data about dropped calls is not in billing data. Customers might already have potential cross-sell products from other companies, but that information is not available internally. 110 Types of Attributes in Data Readily Supported Binary Categorical (nominal) Numeric (interval) Date and time 111 Require More Work Text Image Video Links IDEA EXCHANGE Suppose that you were in charge of a charity similar to the KDD example above. What kind of data are you likely to have available before beginning the project? Is there additional data that you would need? Do you have to purchase the data, or is it publicly available for free? How could you make the best use of a limited budget to acquire high quality data about individual donation patterns? 112 The Customer Signature The primary key uniquely identifies each row, often corresponding to customer ID. The target A foreign key columns are gives access to what you are data in another looking for. table, such as Sometimes, the ZIP code information is in demographics. multiple columns, such as a churn flag and churn date. Some columns are ignored because the values are not predictive or they contain future information, or for other reasons. Each row generally corresponds to a customer. 113 Data Assembly Operations Copying Pivoting Table lookup Derivation of new variables Summarization of values from data Aggregation 114 Methodology 3. Explore the data. Look for anomalies. Consider timedependent variables. Identify key relationships among variables. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 115 3) Explore the Data Examine distributions. Study histograms. Think about extreme values. Notice the prevalence of missing values. Compare values with descriptions. Validate assumptions. Ask many questions. 116 Ask Many Questions 117 Why were some customers active for 31 days in February, but none were active for more than 28 days in January? How do some retail card holders spend more than $100,000 in a week in a grocery store? Why were so many customers born in 1911? Are they really that old? Why do Safari users never make second purchases? What does it mean when the contract begin date is after the contract end date? Why are there negative numbers in the sale price field? How can active customers have a non-null value in the cancellation reason code field? Be Wary of Changes over Time Price-related cancellations Does the same code have the same meaning in historical data? Did different data elements start being loaded at different points in time? Did something happen at a particular point in time? May 118 Price increase price complaint stops Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Methodology 4. Prepare and repair the data. Define metadata correctly. Partition the data and create balanced samples, if necessary. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 119 4) Prepare and Repair the Data 120 Set up a proper temporal relationship between the target variable and inputs. Create a balanced sample, if possible. Include multiple time frames if necessary. Split the data into Training, Validation, and (optionally) Test data sets. Temporal Relationship: Prediction or Profiling? The same techniques work for both. Earlier In a predictive model, values of explanatory variables are from an earlier time frame than the target variable. Same Timeframe Later In a profiling model, the explanatory variables and the target variable might all be from the same time frame. 121 Balancing the Input Data Set A very accurate model simply predicts that no one wants a brokerage account: 98.8% accurate 1.2% error rate This is useless for differentiating among customers. Distribution of Brokerage Target Variable Brokerage = "Y" 2,355 Brokerage = "N" 228,926 0 122 50,000 100,000 150,000 200,000 250,000 Two Ways to Create Balanced Data 123 Data Splitting and Validation Error Rate Improving the model causes the error rate to decline on the data used to build it. At the same time, the model becomes more complex. Models Getting More Complex 124 Validation Data Prevents Overfitting Sweet spot Validation Data Error Rate Signal Noise Training Data Models Getting More Complex 125 Partitioning the Input Data Set Training Use the training set to find patterns and create an initial set of candidate models. Validation Use the validation set to select the best model from the candidate set of models. Test Use the test set to measure performance of the selected model on unseen data. The test set can be an out-of-time sample of the data, if necessary. Partitioning data is an allowable luxury because data mining assumes a large amount of data. Test sets do not help select the final model; they only provide an estimate of the model’s effectiveness in the population. Test sets are not always used. 126 Fix Problems with the Data Data imperfectly describes the features of the real world. Data might be missing or empty. Samples might not be representative. Categorical variables might have too many values. Numeric variables might have unusual distributions and outliers. Meanings can change over time. Data might be coded inconsistently. 127 No Easy Fix for Missing Values Throw out the records with missing values? No. This creates a bias for the sample. Replace missing values with a “special” value (-99)? No. This resembles any other value to a data mining algorithm. Replace with some “typical” value? Maybe. Replacement with the mean, median, or mode changes the distribution, but predictions might be fine. Impute a value? (Imputed values should be flagged.) Maybe. Use distribution of values to randomly choose a value. Maybe. Model the imputed value using some technique. Use data mining techniques that can handle missing values? Yes. One of these, decision trees, is discussed. Partition records and build multiple models? Yes. This action is possible when data is missing for a canonical reason, such as insufficient history. 128 Methodology 5. Transform data. Standardize, bin, combine, replace, impute, log, etc. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 129 5) Transform Data Standardize values into z-scores. Turn counts into percentages. Remove outliers. Capture trends with ratios, differences, or beta values. Combine variables to bring information to the surface. Replace categorical variables with some numeric function of the categorical values. Impute missing values. Transform using mathematical functions, such as logs. Translate dates to durations. Example: Body Mass Index (kg/m2) is a better predictor of diabetes than either variable separately. 130 A Selection of Transformations Standardize numeric values. All numeric values are replaced by the notion of “how far is this value from the average.” Conceptually, all numeric values are in the same range. (The actual range differs, but the meaning is the same.) Although it sometimes has no effect on the results (such as for decision trees and regression), it never produces worse results. Standardization is so useful that it is often built into SAS Enterprise Miner modeling nodes. 131 A Selection of Transformations “Stretching” and “squishing” transformations Log, reciprocal, square root are examples. Replace categorical values with appropriate numeric values. Many techniques work better with numeric values than with categorical values. Historical projections (such as handset churn rate or penetration by ZIP code) are particularly useful. 132 IDEA EXCHANGE What are some other warning signs you can think of in modeling? Have you experienced any pitfalls that were memorable, or that changed the way you approach the data analysis objectives? 133 Methodology 6. Apply analysis. Fit many candidate models, try different solutions, try different sets of input variables, select the best model. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 134 6) Apply Analysis 135 Regression Decision trees Cluster detection Association rules Neural networks Memory-based reasoning Survival analysis Link analysis Genetic algorithms Train Models OUTPUT INPUT INPUT 136 INPUT INPUT INPUT INPUT INPUT INPUT MODEL 3 MODEL 2 MODEL 1 INPUT Build candidate models by applying a data mining technique (or techniques) to the training data. OUTPUT OUTPUT Assess Models OUTPUT INPUT INPUT 137 INPUT INPUT INPUT INPUT INPUT INPUT MODEL 3 MODEL 2 MODEL 1 INPUT Assess models by applying the models to the validation data set. OUTPUT OUTPUT Assess Models Score the validation data using the candidate models and then compare the results. Select the model with the best performance on the validation data set. Communicate model assessments through quantitative measures graphs. 138 Look for Warnings in Models Trailing Indicators: Learning Things That Are Not True What happens in month 8? Minutes of Use by Tenure 120 Minutes of Use 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 Tenure (Months) Does declining usage in month 8 predict attrition in month 9? 139 Look for Warnings in Models Perfect Models: Things that are too good to be true. 100% of customers who spoke to a customer support representative cancelled a contract. Eureka! It’s all I need to know! •If a customer cancels, they are automatically flagged to get a call from customer support •The information is useless in predicting cancellation. Models that seem too good usually are. 140 Methodology 7. Deploy models. Score new observations, make modelbased decisions. Gather results of model deployment. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 141 7) Deploy Models and Score New Data 142 Methodology 8. Assess the usefulness of the model. If model has gone stale, revise the model. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 143 8) Assess Results 144 Compare actual results against expectations. Compare the challenger’s results against the champion’s. Did the model find the right people? Did the action affect their behavior? What are the characteristics of the customers most affected by the intervention? Good Test Design Measures the Impact of Both the Message and the Model NO Message YES Impact of model on group getting message Control Group Target Group Chosen at random; receives message. Chosen by model; receives message. Response measures message without model. Response measures message with model. Holdout Group Modeled Holdout Chosen at random; receives no message. Chosen by model; receives no message. Response measures background response. Response measures model without message. YES NO Picked by Model 145 Impact of message on group with good model scores Test Mailing Results E-mail campaign test results lift 3.5 E-Mail Test 0.8 Response Rate 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Target Group 146 Control Group Holdout Group Methodology 9. As you learn from earlier model results, refine the business goals to gain more from the data. Define/Refine business objective Assess results Select data Deploy models Explore input data Prepare and repair data Apply analysis Transform input data 147 9) Begin Again Revisit business objectives. Define new objectives. Gather and evaluate new data. model scores cluster assignments responses Example: A model discovers that geography is a good predictor of churn. What do the high-churn geographies have in common? Is the pattern your model discovered stable over time? 148 Lessons Learned Data miners must be careful to avoid pitfalls. Learning things that are not true or not useful Confusing signal and noise Creating unstable models A methodology is a way of being careful. 149 IDEA EXCHANGE Outline a business objective of your own in terms of the methodology described here. What is your business objective? Can you frame it in terms of a data mining problem? How will you select the data? What are the inputs? What do you want to look at to get familiar with the data? Continued… 150 IDEA EXCHANGE Anticipate any data quality problems you might encounter and how you could go about fixing them. Do any variables require transformation? Proceed through the remaining steps of the methodology as you consider your example. 151 Basic Data Modeling A common approach to modeling customer value is RFM analysis, so named because it uses three key variables: Recency – how long it has been since the customer’s last purchase Frequency – how many times the customer has purchased something Monetary value – how much money the customer has spent RFM variables tend to predict responses to marketing campaigns effectively. 152 RFM Cell Approach Frequency Monetary value Recency 153 RFM Cell Approach A typical approach to RFM analysis is to bin customers into (roughly) equal-sized groups on each of the rankordered R,F, and M variables. For example, Bin five groups on R (highest bin = most recent) Bin five groups on F (highest bin = most frequent) Bin five groups on M (highest bin = highest value) The combination of the bins gives an RFM “score” that can be compared to some target or outcome variable. Customer score 555 = most recent quintile, most frequent quintile, highest spending quintile. 154 Computing Profitability in RFM Break-even response rate = current cost of promotion per dollar of net profit. Cost of promotion to an individual Average net profit per sale Example: It costs $2.00 to print and mail each catalog. Average net profit per transaction is $30. 2.00/30.00 = 0.067 Profitable RFM cells are those with a response rate greater than 6.7%. 155 RFM Analysis of the Catalog Data 156 Recode frequency so that the highest values are the most recent. Bin the R, F, and M variables into 5 groups each, numbered 1-5, so that 1 is the least valuable and 5 is the most valuable bin. Concatenate the RFM variables to obtain a single RFM “score.” Graphically investigate the response rates for the different groups. Performing RFM Analysis of the Catalog Data Catalog Case Study Task: Perform RFM analysis on the catalog data. 157 Performing Graphical RFM Analysis Catalog Case Study Task: Perform graphical RFM analysis. 158 Limitations of RFM Only uses three variables Modern data collection processes offer rich information about preferences, behaviors, attitudes, and demographics. Scores are entirely categorical 515 and 551 and 155 are equally good, if RFM variables are of equal importance. Sorting by the RFM values is not informative and overemphasizes recency. So many categories Simple example above results in 125 groups. Not very useful for finding prospective customers Statistics are descriptive. 159 IDEA EXCHANGE Would RFM analysis apply to a business objective you are considering? If so, what would be your R, F, and M variables? What other basic analytical techniques could you use to explore your data and get preliminary answers to your questions? 160 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Difficulties 2.3 Honest Assessment 2.4 Methodology 2.5 Recommended Reading 161 Recommended Reading Davenport, Thomas H., Jeanne G. Harris, and Robert Morison. 2010. Analytics at Work: Smarter Decisions, Better Results. Boston: Harvard Business Press. Chapters 2 through 6, DELTA method, optional 162