Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading 1 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading 2 Objectives 3 Name two major types of data mining analyses. List techniques for supervised and unsupervised analyses. Analytical Methodology A methodology clarifies the purpose and implementation of analytics. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and Repair data Transform input data 4 Business Analytics and Data Mining Data mining is a key part of effective business analytics. Components of data mining: data management data management data management customer segmentation predictive modeling forecasting standard and nonstandard statistical modeling practices 5 What Is Data Mining? 6 Information Technology – complicated database queries Machine Learning – inductive learning from examples Statistics – what we were taught not to do Translation for This Course Segmentation Predictive Modeling Unsupervised classification Supervised classification – cluster analysis – linear regression – association rules – logistic regression other techniques – decision trees other techniques 7 Customer Segmentation Segmentation is a vague term with many meanings. Segments can be based on the following: a priori judgment – alike based on business rules, not based on data analysis unsupervised classification – alike with respect to several attributes supervised classification – alike with respect to a target, defined by a set of inputs 8 Segmentation: Unsupervised Classification 9 Training Data Training Data case 1: inputs, ? case 2: inputs, ? case 3: inputs, ? case 4: inputs, ? case 5: inputs, ? case 1: inputs, cluster 1 case 2: inputs, cluster 3 case 3: inputs, cluster 2 case 4: inputs, cluster 1 case 5: inputs, cluster 2 new case new case Segmentation: A Selection of Methods Barbie Candy Beer Diapers Peanut butter Meat k-means clustering 10 Association rules (Market basket analysis) Predictive Modeling: Supervised Classification Training Data case 1: inputs prob class case 2: inputs prob class case 3: inputs prob class case 4: inputs prob class case 5: inputs prob class new case 11 new case Predictive Modeling: Supervised Classification Inputs Target Cases .. .. .. .. .. .. .. .. .. . . . . . . . . . ... ... ... ... ... ... ... ... ... ... 12 .. .. . . 2.01 Poll The primary difference between supervised and unsupervised classification is whether a dependent, or target, variable is known. Yes No 13 2.01 Poll – Correct Answer The primary difference between supervised and unsupervised classification is whether a dependent, or target, variable is known. Yes No 14 Types of Targets 15 Logistic Regression – event/no event (binary target) – class label (multiclass problem) Regression – continuous outcome Survival Analysis – time-to-event (possibly censored) Discrete Targets 16 Health Care – target = favorable/unfavorable outcome Credit Scoring – target = defaulted/did not default on a loan Marketing – target = purchased product A, B, C, or none Continuous Targets 17 Health Care Outcomes – target = hospital length of stay, hospital cost Liquidity Management – target = amount of money at an ATM machine or in a branch vault Merchandise Returns – target = time between purchase and return (censored) Application: Target Marketing Cases = Inputs = Target Action = = 18 customers, prospects, suspects, households geographics, demographics, psychometrics, RFM variables response to a past or test solicitation target high-responding segments of customers in future campaigns Application: Attrition Prediction/Defection Detection Cases Inputs = = Target = Action = 19 existing customers payment history, product/service usage, demographics churn, brand switching, cancellation, defection customer loyalty promotion Application: Fraud Detection 20 Cases Inputs Target Action = = = = past transaction or claims particulars and circumstances fraud, abuse, deception impede or investigate suspicious cases Application: Credit Scoring Cases Inputs = = Target = Action = 21 past applicants application information, credit bureau reports default, charge-off, serious delinquency, repossession, foreclosure accept or reject future applicants for credit The Fallacy of Univariate Thinking What is the most important cause of churn? Prob(churn) International Usage 22 Daytime Usage A Selection of Modeling Methods Linear Regression, Logistic Regression 23 Decision Trees Hard Target Search Transactions 24 ... Hard Target Search Transactions 25 Fraud Undercoverage Accepted Bad Accepted Good Rejected No Follow-up 26 ... Undercoverage Next Generation Accepted Bad Accepted Good Rejected No Follow-up 27 2.02 Poll Impediments to high-quality business data can lie in the very nature of business decision-making: the worst prospects are not marketed to. Therefore, information about the sort of customer that they would be (profitable or unprofitable) is usually unknown, making supervised classification more difficult. Yes No 28 2.02 Poll – Correct Answer Impediments to high-quality business data can lie in the very nature of business decision-making: the worst prospects are not marketed to. Therefore, information about the sort of customer that they would be (profitable or unprofitable) is usually unknown, making supervised classification more difficult. Yes No 29 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading 30 Objectives 31 Explain the concept of data integration. Describe SAS Enterprise Guide and how it fits in with data integration and management for business analytics. Data Management and Business Analytics Data management brings together data components that can exist on multiple machines, from different software vendors, throughout the organization. Data management is the foundation for business analytics. Without correctly consolidated data, those working in the analytics, reporting, and solutions areas might not be working with the most current, accurate data. 32 Advanced Analytics Basic Analytics Reporting Managing Data for Business Analytics 33 Business analytics requires data management activities such as data access, movement, transformation, aggregation, and augmentation. These tasks can involve many different types of data (for example, simple flat files, files with commaseparated values, Microsoft Excel files, SAS tables, and Oracle tables). The data likely combines individual transactions, customer summaries, product summaries, or other levels of data granularity – or some combination of those things. Planning from the Top Down What mission-critical questions must be answered? What data will help you answer these questions? What data do you have that will help you build the needed data? 34 Implementing from the Bottom Up Create Reports Define Target Data Identify Source Data 35 Collaboration Is Key to Business Analytics 36 Business Expert IT Expert Analytical Expert Data Marts: Tying Questions to Data Stated simplistically, data marts are implemented at organizations because there are questions that must be answered. Data is typically collected in daily operations but might not be organized in a way that answers the questions. An IT professional can use the questions and the data collected from daily operations to construct the tables for a data warehouse or data mart. 37 Building a Data Mart Foundation of a Data Mart Identify source tables. Identify target tables. Create target tables. Building the foundation of the data mart consists of the three basic steps listed above. 38 39 Analytic Objective Example Business: Large financial institution Objective: From a population of existing clients with sufficient tenure and other qualifications, identify a subset most likely to have interest in an insurance investment product (INS). 39 Financial Institution’s Data The financial institution has highly detailed data that is challenging to transform into a structure suitable for predictive modeling. As is the case with most organizations, the financial institution has a large amount of data about its customers, products, and employees. Much of this information is stored in transactional systems in various formats. Using SAS Enterprise Guide, this transactional information is extracted, transformed, and loaded into a data mart for the Marketing Department. You continue to work with this data set for some basic exploratory analysis and reporting. 40 A Target Star Schema One goal of creating a data mart is to produce, from the source data, a dimensional data model that is a star schema. Customer Dimension Organization Dimension Fact Table Time Dimension 41 Product Dimension Financial Institution Target Star Schema The analyst can produce, from the financial institution’s source data, a dimensional data model that is a star schema. Customer Dimension Credit Bureau Dimension 42 Checking Fact Table Insurance Dimension Checking_transactions Table The checking_transactions table contains the following attributes, one per a record fact. This fact contains some measured or observed variables. The fact table contains the data, and the dimensions identify each tuple in the data. 43 CHECKING_ID CHKING_TRANS_DT CHKING_TRANS_AMT CHKING_TRANS_CHANNEL_CD CHKING_TRANS_METHOD_CD CHKING_TRANS_TYPE_CD Client Table The client table contains client information. In practice, this data set could also contain address and other information. For this demonstration, only CLIENT_ID, FST_NM, LST_NM, ORIG_DT, BIRTH_DT, and ZIP_5 are used. CLIENT_ID FST_NM LST_NM ORIG_DT BIRTH_DT ZIP_5 44 Client_ins_account Table The client_ins_account table matches client IDs to INS account IDs. CLIENT_ID CLIENT_INS_ID 45 Ins_account Table The ins_account table contains the insurance account information. In practice, this data set would contain other fields such as rates, maturity dates, and initial deposit amount. For this demonstration, only INS_ACT_ID and INS_ACT_OPEN_DT are used. 46 INS_ACT_ID INS_ACT_OPEN_DT … … … … Credit_bureau Table The credit_bureau table contains credit bureau information. In practice, this data set could contain credit scores from more than one credit bureau and also a history of credit scores. CLIENT_ID TL_CNT FST_TL_TR FICO_CR_SCR CREDIT_YQ 47 Advantages of Data Marts 48 There is one version of the truth. Downstream tables are updated as source data is updated, so analyses are always based on the latest information. The problem of a proliferation of spreadsheets is avoided. Information is clearly identified by standardized variable names and data types. Multiple users can access the same data. SAS Enterprise Guide Overview SAS Enterprise Guide can be used for data management, as well as a wide variety of other tasks: data exploration querying and reporting graphical analysis statistical analysis scoring 49 Example: Financial Institution Data Management The head of Marketing wants to know which customers have the highest propensity for buying insurance products from the institution. This could present a cross-selling opportunity. Create part of an analytical data mart by combining information from many tables: checking account data, customer records, insurance data, and credit bureau information. 50 Input Files client_ins_account.sas7bdat credit_bureau.sas7bdat ins_account.sas7dbat client.sas7bdat 51 Final Data 52 A Data Management Process Using SAS Enterprise Guide Financial Institution Case Study Task: Join several SAS tables and use separate sampling to obtain a training data set. 53 Exploring the Data and Creating a Report Investigate the distribution of credit scores. Create a report of credit scores by customers without insurance and customers with insurance. Does age have an influence on credit scores? Which customers have the highest credit scores, young customers or older customers? Create a graph of credit scores by age. 54 Exploratory Analysis 55 Exploring the Data and Creating a Basic Report Financial Institution Case Study Task: Investigate the distribution of credit scores by creating a report of credit scores by customers without insurance and customers with insurance. 56 Graphical Exploration Financial Institution Case Study Task: Create a graph of credit scores by age. 57 Idea Exchange 58 What conclusions would you draw from this basic data exploration? Are there additional plots or reports that you would like to explore from the orders data to help you better understand your customers and their propensity to buy insurance? What additional data would you need to help you make a case to the head of the Marketing Department that marketing dollars should be spent in a particular way? Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading 59 Objectives 60 Identify several of the challenges of data mining and present ways to address these challenges. Initial Challenges in Data Mining 1. What do I want to predict? 2. What level of granularity is needed to obtain data about the customer? 61 ... Initial Challenges in Data Mining 1. What do I want to predict? a transaction an individual a household a store a sales team 2. What level of granularity is needed to obtain data about the customer? 62 ... Initial Challenges in Data Mining 1. What do I want to predict? a transaction an individual a household a store a sales team 2. What level of granularity is needed to obtain data about the customer? transactional regional daily monthly other 63 2.03 Multiple Answer Poll Which of the following might constitute a case in a predictive model? a. a household b. loan amount c. an individual d. the number of products purchased e. a company f. a ZIP code g. salary 64 2.03 Multiple Answer Poll – Correct Answers Which of the following might constitute a case in a predictive model? a. a household b. loan amount c. an individual d. the number of products purchased e. a company f. a ZIP code g. salary 65 Typical Data Mining Time Line Allotted Time Projected: Actual: Dreaded: (Data Acquisition) Needed: Data Preparation 66 Data Analysis Data Challenges What identifies a unit? 67 Cracking the Code What identifies a unit? ID1 ID2 2612 2613 2614 2615 2616 2617 2618 2618 2619 2620 2620 2620 68 624 625 626 627 628 629 630 631 632 633 634 635 DATE 941106 940506 940809 941010 940507 940812 950906 951107 950112 950802 950908 950511 JOB SEX FIN PRO3 CR_T ERA 06 04 11 16 04 09 09 13 10 11 06 01 8 5 5 1 2 1 2 2 5 1 0 1 DEC ETS PBB RVC ETT OFS RFN PBB SLP STL DES DLF . . . . . . 71 0 0 34 0 0 . . . . . . 612 623 504 611 675 608 . . . . . . 12 23 04 11 75 08 Data Challenges What should the data look like to perform an analysis? 69 Data Arrangement What should the data look like to perform an analysis? Long-Narrow Acct type 2133 2133 2133 2653 2653 3544 3544 3544 3544 3544 70 MTG SVG CK CK SVG MTG CK MMF CD LOC Short-Wide Acct CK SVG MMF CD LOC MTG 2133 2653 3544 1 1 1 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 Data Challenges What variables do I need? 71 Derived Inputs What variables do I need? Claim Accident Date Time 11nov96 22dec95 26apr95 02jul94 08mar96 15dec96 09nov94 72 102396/12:38 012395/01:42 042395/03:05 070294/06:25 123095/18:33 061296/18:12 110594/22:14 Delay Season Dark 19 fall 0 333 3 0 69 winter spring summer winter 1 1 0 0 186 summer 0 4 fall 1 Data Challenges How do I convert my data to the proper level of granularity? 73 Roll-Up How do I convert my data to the proper level of granularity? HH Acct Sales 4461 4461 4461 4461 4461 4911 5630 5630 6225 6225 74 2133 2244 2773 2653 2801 3544 2496 2635 4244 4165 160 42 212 250 122 786 458 328 27 759 HH Acct Sales 4461 2133 4911 3544 ? ? 5630 2496 6225 4244 ? ? Rolling Up Longitudinal Data How do I convert my data to the proper level of granularity? Frequent Flying VIP Flier Month Mileage Member 75 10621 10621 Jan Feb 650 0 No No 10621 10621 Mar Apr 0 250 No No 33855 33855 33855 33855 Jan Feb Mar Apr 350 300 1200 850 No No Yes Yes Data Challenges What sorts of raw data quality problems can I expect? 76 Errors, Outliers, and Missings What sorts of raw data quality problems can I expect? cking #cking ADB NSF dirdep SVG bal Y Y Y y Y Y Y Y Y 77 1 1 1 . 2 1 1 . . 1 3 2 468.11 1 68.75 0 212.04 0 . 0 585.05 0 47.69 2 4687.7 0 . 1 . . 0.00 0 89981.12 0 585.05 0 1876 0 6 0 7218 1256 0 0 1598 0 0 7218 Y Y Y Y Y Y Y 1208 0 0 4301 234 238 0 1208 0 0 45662 234 Missing Value Imputation What sorts of raw data quality problems can I expect? Inputs ? ? ? ? ? Cases ? ? ? ? 78 Data Challenges Can I (more importantly, should I) analyze all the data that I have? All the observations? All the variables? 79 Massive Data Can I (more importantly, should I) analyze all the data that I have? Bytes Paper 80 Kilobyte 210 ½ sheet Megabyte 220 1 ream Gigabyte 230 167 feet Terabyte 240 32 miles Petabyte 250 32,000 miles Sampling Can I (more importantly, should I) analyze all the data that I have? 81 Oversampling Can I (more importantly, should I) analyze all the data that I have? OK Fraud 82 The Curse of Dimensionality Can I (more importantly, should I) analyze all the data that I have? 1-D 2-D 3-D 83 Dimension Reduction Input3 E(Target) Can I (more importantly, should I) analyze all the data that I have? Redundancy Irrelevancy Input1 84 2.04 Multiple Answer Poll Which of the following statements are true? a. The more data you can get, the better. b. Too many variables can make it difficult to detect patterns in data. c. Too few variables can make it difficult to learn interesting facts about the data. d. Cases with missing values should generally be deleted from modeling. 85 2.04 Multiple Answer Poll – Correct Answers Which of the following statements are true? a. The more data you can get, the better. b. Too many variables can make it difficult to detect patterns in data. c. Too few variables can make it difficult to learn interesting facts about the data. d. Cases with missing values should generally be deleted from modeling. 86 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading 87 88 Objectives 88 Describe the basic navigation of SAS Enterprise Miner. SAS Enterprise Miner 89 SAS Enterprise Miner – Interface Tour Menu bar and shortcut buttons 90 SAS Enterprise Miner – Interface Tour Project panel 91 SAS Enterprise Miner – Interface Tour Properties panel 92 SAS Enterprise Miner – Interface Tour Help panel 93 SAS Enterprise Miner – Interface Tour Diagram workspace 94 SAS Enterprise Miner – Interface Tour Process flow 95 SAS Enterprise Miner – Interface Tour Node 96 SAS Enterprise Miner – Interface Tour SEMMA tools palette 97 Catalog Case Study Analysis Goal: A mail-order catalog retailer wants to save money on mailing and increase revenue by targeting mailed catalogs to customers who are most likely to purchase in the future. Data set: CATALOG2010 Number of rows: 48,356 Number of columns: 98 Contents: sales figures summarized across departments and quarterly totals for 5.5 years of sales Targets: RESPOND (binary) ORDERSIZE (continuous) 98 Catalog Case Study: Basics Throughout this chapter, you work with data in SAS Enterprise Miner to perform exploratory analysis. 1. Import the CATALOG2010 data. 2. Identify the target variables. 3. Define and transform the variables for use in RFM analysis. 4. Perform graphical RFM analysis in SAS Enterprise Miner. Later, you use the CATALOG2010 data for predictive modeling and scoring. 99 Accessing and Importing Data for Modeling First, get familiar with the data! The data file is a SAS data set. 1. Create a project in SAS Enterprise Miner. 2. Create a diagram. 3. Locate and import the CATALOG2010 data. 4. Define characteristics of the data set, such as the variable roles and measurement levels. 5. Perform a basic exploratory analysis of the data. 100 Defining a Data Source CATALOG data ABA1 SAS Foundation Server Libraries 101 Metadata Definition Metadata Definition Select a table. Set the metadata information. Three purposes for metadata: Define variable roles (such as input, target, or ID). Define measurement levels (such as binary, interval, or nominal). Define table role (such as raw data, transactional data, or scoring data). 102 Creating Projects and Diagrams in SAS Enterprise Miner Catalog Case Study Task: Create a project and a diagram in SAS Enterprise Miner. 103 Defining a Data Source Catalog Case Study Task: Define the CATALOG data source in SAS Enterprise Miner. 104 Defining Column Metadata Catalog Case Study Task: Define column metadata. 105 Changing the Sampling Defaults in the Explore Window and Exploring a Data Source Catalog Case Study Tasks: Change preference settings in the Explore window and explore variable associations. 106 Idea Exchange Consider an academic retention example. Freshmen enter a university in the fall term, and some of them drop out before the second term begins. Your job is to try to predict whether a student is likely to drop out after the first term. 107 continued... Idea Exchange 108 What types of variables would you consider using to assess this question? How does time factor into your data collection? Do inferences about students five years ago apply to students today? How do changes in technology, university policies, and teaching trends affect your conclusions? continued... Idea Exchange 109 As an administrator, do you have this information? Could you obtain it? What types of data quality issues do you anticipate? Are there any ethical considerations in accessing the information in your study? Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading 110 Objectives 111 Explain the characteristics of a good predictive model. Describe data splitting. Discuss the advantages of using honest assessment to evaluate a model and obtain the model with the best prediction. Predictive Modeling Implementation 112 Model Selection and Comparison – Which model gives the best prediction? Decision/Allocation Rule – What actions should be taken on new cases? Deployment – How can the predictions be applied to new cases? ... Predictive Modeling Implementation 113 Model Selection and Comparison – Which model gives the best prediction? Decision/Allocation Rule – What actions should be taken on new cases? Deployment – How can the predictions be applied to new cases? Getting the “Best” Prediction: Fool’s Gold My model fits the training data perfectly... I’ve struck it rich! 114 2.05 Poll The best model is a model that does a good job of predicting your modeling data. Yes No 115 2.05 Poll – Correct Answer The best model is a model that does a good job of predicting your modeling data. Yes No 116 Model Complexity 117 ... Model Complexity Too flexible 118 ... Model Complexity Too flexible 119 ... Model Complexity Just right Too flexible 120 Data Splitting and Honest Assessment 121 Overfitting Training Set 122 Test Set Better Fitting Training Set 123 Validation Set Predictive Modeling Implementation 124 Model Selection and Comparison – Which model gives the best prediction? Decision/Allocation Rule – What actions should be taken on new cases? Deployment – How can the predictions be applied to new cases? Decisions, Decisions Predicted 0 Actual 0 360 540 1 20 80 0 540 360 1 40 60 0 720 180 1 60 125 1 40 .08 44% 80% 1.3 .10 60% 60% 1.4 .12 76% 40% 1.8 Misclassification Costs Predicted Class Actual 0 126 Action 1 Accept 0 True Neg False Pos 1 False Neg True Pos Deny OK 0 1 Fraud 9 0 Predictive Modeling Implementation 127 Model Selection and Comparison – Which model gives the best prediction? Decision/Allocation Rule – What actions should be taken on new cases? Deployment – How can the predictions be applied to new cases? Scoring Model Deployment Model Development 128 Scoring Recipe 129 The model results in a formula or rules. The data requires modifications. – Derived inputs – Transformations – Missing value imputation The scoring code is deployed. – To score, you do not rerun the algorithm; apply score code (equations) obtained from the final model to the scoring data. Scorability Training Data Classifier X1 1 Tree .8 .6 .4 New Case .2 0 0 130 .2 .4 .6 .8 X2 1 Scoring Code If x1<.47 and x2<.18 or x1>.47 and x2>.29, then red. Scoring Pitfalls: Population Drift Data generated Data cleaned Model deployed Time Data analyzed Data acquired 131 The Secret to Better Predictions Fraud OK Transaction Amt. 132 ... The Secret to Better Predictions Fraud OK Transaction Amt. 133 ... The Secret to Better Predictions Fraud OK Transaction Amt. 134 Cheatin’ Heart Idea Exchange Think of everything that you have done in the past week. What transactions or actions created data? For example, point-of-sale transactions, Internet activity, surveillance, and questionnaires are all data collection avenues that many people encounter daily. How do you think that the data about you will be used? How could models be deployed that use data about you? 135 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading 136 Objectives 137 Describe a methodology for implementing business analytics through data mining. Discuss each of the steps, with examples, in the methodology. Create a project and diagram in SAS Enterprise Miner. Methodology Data mining is not a linear process. It is a cycle, where later results can lead back to previous steps. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 138 Why Have a Methodology? 139 To avoid learning things that are not true To avoid learning things that are not useful – results that arise from past marketing decisions – results that you already know – results that you already should know – results that you are not allowed to use To create stable models To avoid making the mistakes that you made in the past To develop useful tips from what you learned Methodology 1. Define the business objective and state it as a data mining task. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 140 1) Define the Business Objective 141 Improve the response rate for a direct marketing campaign. Increase the average order size. Determine what drives customer acquisition. Forecast the size of the customer base in the future. Choose the right message for the right groups of customers. Target a marketing campaign to maximize incremental value. Recommend the next, best product for existing customers. Segment customers by behavior. A lot of good statistical analysis is directed at solving the wrong business problem. Define the Business Goal Example: Who is the yogurt lover? What is a yogurt lover? One answer prints coupons at the cash register. Another answer mails coupons to people’s homes. Another results in advertising. 142 MEDIUM LOW $$ Spent on Yogurt HIGH Big Challenge: Defining a Yogurt Lover LOW MEDIUM Yogurt as % of All Purchases 143 HIGH “Yogurt lover” is not in the data. You can impute it, using business rules: Yogurt lovers spend a lot of money on yogurt. Yogurt lovers spend a relatively large amount of their shopping dollars on yogurt. Next Challenge: Profile the Yogurt Lover You have identified a segment of customers that you believe are yogurt lovers. But who are they? How would I know them in the store? Identify them by demographic data. Identify them by other things that they purchase (for example, yogurt lovers are people who buy nutrition bars and sports drinks). What action can I take? Set up “yogurt-lover-attracting” displays. 144 Idea Exchange If a customer is identified as a yogurt lover, what action should you take? Should you give yogurt coupons, even though these individuals buy yogurt anyway? Is there a cross-sell opportunity? Is there an opportunity to identify potential yogurt lovers? What would you do? 145 Profiling in the Extreme: Best Buy Using analytical methodology, electronics retailer Best Buy discovered that a small percentage of customers accounted for a large percentage of revenue. Over the past several years, the company has adopted a customer-centric approach to store design and flow, staffing, and even corporate acquisitions such as the Geek Squad support team. The company’s largest competitor has gone bankrupt while Best Buy has seen growth in market share. See Gulati (2010) 146 Define the Business Objective What Is the business objective? Example: Telco Churn Initial problem: Assign a churn score to all customers. Recent customers with little call history Telephones? Individuals? Families? Voluntary churn versus involuntary churn How will the results be used? Better objective: By September 24, provide a list of the 10,000 elite customers who are most likely to churn in October. The new objective is actionable. 147 Define the Business Objective Example: Credit Churn How do you define the target? When did a customer leave? When she has not made a new charge in six months? When she had a zero balance for three months? When the balance does not support the cost of carrying the customer? When she cancels her card? 3.0% 1.0% When the contract ends? 0.8% 0.6% 0.4% 0.2% 0.0% 0 1 2 3 4 5 6 Tenure (months) 148 7 8 9 10 11 12 13 14 15 Translate Business Objectives into Data Mining Tasks Do you already know the answer? In supervised data mining, the data has examples of what you are looking for, such as the following: customers who responded in the past customers who stopped transactions identified as fraud In unsupervised data mining, you are looking for new patterns, associations, and ideas. 149 Data Mining Tasks Lead to Specific Techniques Objectives Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value 150 ... Data Mining Tasks Lead to Specific Techniques Objectives Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value Tasks Exploratory Data Analysis Binary Response Modeling Multiple Response Modeling Estimation Forecasting Detecting Outliers Pattern Detection 151 ... Data Mining Tasks Lead to Specific Techniques Objectives Customer Acquisition Credit Risk Pricing Customer Churn Fraud Detection Discovery Customer Value Tasks Exploratory Data Analysis Binary Response Modeling Multiple Response Modeling Estimation Forecasting Techniques Decision Trees Regression Neural Networks Survival Analysis Clustering Association Rules Detecting Outliers Link Analysis Pattern Detection Hypothesis Testing Visualization 152 Data Analysis Is Pattern Detection Patterns might not represent any underlying rule. Some patterns reflect some underlying reality. The party that holds the White House tends to lose seats in Congress during off-year elections. Others do not. When the American League wins the World Series in Major League Baseball, Republicans take the White House. Stars cluster in constellations. Sometimes, it is difficult to tell without analysis. In U.S. presidential contests, the taller candidate usually wins. 153 Example: Maximizing Donations Example from the KDD Cup, a data mining competition associated with the KDD Conference (www.sigkdd.org): Purpose: Maximizing profit for a charity fundraising campaign Tested on actual results from mailing (using data withheld from competitors) Competitors took multiple approaches to the modeling: Modeling who will respond Modeling how much people will give Perhaps more esoteric approaches However, the top three winners all took the same approach (although they used different techniques, methods, and software). 154 The Winning Approach: Expected Revenue Task: Estimate responseperson, the probability that a person responds to the mailing (all customers). Task: Estimate the value of response, dollarsperson (only customers who respond). Choose prospects with the highest expected value, responseperson * dollarsperson. 155 An Unexpected Pattern An unexpected pattern suggests an approach. When people give money frequently, they tend to donate less money each time. In most business applications, as people take an action more often, they spend more money. Donors to a charity are different. This suggests that potential donors go through a two-step process: Shall I respond to this mailing? How much money should I give this time? Modeling can follow the same logic. 156 Methodology 2. Select or collect the appropriate data to address the problem. Identify the customer signature. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 157 2) Select Appropriate Data What is available? What is the right level of granularity? How much data is needed? How much history is required? How many variables should be used? What must the data contain? Assemble results into customer signatures. 158 Representativeness of the Training Sample The model set might not reflect the relevant population. Customers differ from prospects. Survey responders differ from non-responders. People who read e-mail differ from people who do not read e-mail. Customers who started three years ago might differ from customers who started three months ago. People with land lines differ from those without. 159 Availability of Relevant Data Elevated printing defect rates might be due to humidity, but that information is not in press run records. Poor coverage might be the number one reason for wireless subscribers canceling their subscriptions, but data about dropped calls is not in billing data. Customers might already have potential cross-sell products from other companies, but that information is not available internally. 160 Types of Attributes in Data Readily Supported Binary Categorical (nominal) Numeric (interval) Date and time 161 Require More Work Text Image Video Links Idea Exchange Suppose that you were in charge of a charity similar to the KDD example above. What type of data are you likely to have available before beginning the project? Is there additional data that you would need? Do you have to purchase the data, or is it publicly available for free? How could you make the best use of a limited budget to acquire high quality data about individual donation patterns? 162 The Customer Signature The primary key uniquely identifies each row, often corresponding to customer ID. The target A foreign key columns are gives access to what you are data in another looking for. table, such as Sometimes, the ZIP code information is in demographics. multiple columns, such as a churn flag and churn date. Some columns are ignored because the values are not predictive or they contain future information, or for other reasons. Each row generally corresponds to a customer. 163 Data Assembly Operations Copying Pivoting Table lookup Derivation of new variables Summarization of values from data Aggregation 164 Methodology 3. Explore the data. Look for anomalies. Consider timedependent variables. Identify key relationships among variables. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 165 3) Explore the Data Examine distributions. Study histograms. Think about extreme values. Notice the prevalence of missing values. Compare values with descriptions. Validate assumptions. Ask many questions. 166 Ask Many Questions 167 Why were some customers active for 31 days in February, but none were active for more than 28 days in January? How do some retail card holders spend more than $100,000 in a week in a grocery store? Why were so many customers born in 1911? Are they really that old? Why do Safari users never make second purchases? What does it mean when the contract begin date is after the contract end date? Why are there negative numbers in the sale price field? How can active customers have a non-null value in the cancellation reason code field? Be Wary of Changes over Time Price-related cancelations Does the same code have the same meaning in historical data? Did different data elements start being loaded at different points in time? Did something happen at a particular point in time? May 168 Price increase price complaint stops Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Methodology 4. Prepare and repair the data. Define metadata correctly. Partition the data and create balanced samples, if necessary. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 169 4) Prepare and Repair the Data 170 Set up a proper temporal relationship between the target variable and inputs. Create a balanced sample, if possible. Include multiple time frames if necessary. Split the data into training, validation, and (optionally) test data sets. Temporal Relationship: Prediction or Profiling? The same techniques work for both. Earlier In a predictive model, values of explanatory variables are from an earlier time frame than the target variable. Later Same Time Frame In a profiling model, the explanatory variables and the target variable might all be from the same time frame. 171 Balancing the Input Data Set A very accurate model simply predicts that no one wants a brokerage account: 98.8% accurate 1.2% error rate This is useless for differentiating among customers. Distribution of Brokerage Target Variable Brokerage = "Y" 2,355 Brokerage = "N" 228,926 0 172 50,000 100,000 150,000 200,000 250,000 Two Ways to Create Balanced Data 173 Data Splitting and Validation Error Rate Improving the model causes the error rate to decline on the data used to build it. At the same time, the model becomes more complex. Models Getting More Complex 174 Validation Data Prevents Overfitting Sweet spot Validation Data Error Rate Signal Noise Training Data Models Getting More Complex 175 Partitioning the Input Data Set Training Use the training set to find patterns and create an initial set of candidate models. Validation Use the validation set to select the best model from the candidate set of models. Use the test set to measure performance of the selected model on unseen data. The test Test set can be an out-of-time sample of the data, if necessary. Partitioning data is an allowable luxury because data mining assumes a large amount of data. Test sets do not help select the final model; they only provide an estimate of the model’s effectiveness in the population. Test sets are not always used. 176 Fix Problems with the Data Data imperfectly describes the features of the real world. Data might be missing or empty. Samples might not be representative. Categorical variables might have too many values. Numeric variables might have unusual distributions and outliers. Meanings can change over time. Data might be coded inconsistently. 177 No Easy Fix for Missing Values Throw out the records with missing values? No. This creates a bias for the sample. Replace missing values with a “special” value (-99)? No. This resembles any other value to a data mining algorithm. Replace with some “typical” value? Maybe. Replacement with the mean, median, or mode changes the distribution, but predictions might be fine. Impute a value? (Imputed values should be flagged.) Maybe. Use distribution of values to randomly choose a value. Maybe. Model the imputed value using some technique. Use data mining techniques that can handle missing values? Yes. One of these, decision trees, is discussed. Partition records and build multiple models? Yes. This action is possible when data is missing for a canonical reason, such as insufficient history. 178 Methodology 5. Transform data. Standardize, bin, combine, replace, impute, log, and so on. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 179 5) Transform Data Standardize values into z-scores. Change counts into percentages. Remove outliers. Capture trends with ratios, differences, or beta values. Combine variables to bring information to the surface. Replace categorical variables with some numeric function of the categorical values. Impute missing values. Transform using mathematical functions, such as logs. Translate dates to durations. Example: Body Mass Index (kg/m2) is a better predictor of diabetes than either variable separately. 180 A Selection of Transformations Standardize numeric values. All numeric values are replaced by the notion of “how far is this value from the average?” Conceptually, all numeric values are in the same range. (The actual range differs, but the meaning is the same.) Although it sometimes has no effect on the results (such as for decision trees and regression), it never produces worse results. Standardization is so useful that it is often built into SAS Enterprise Miner modeling nodes. 181 A Selection of Transformations “Stretching” and “squishing” transformations Log, reciprocal, and square root are examples. Replace categorical values with appropriate numeric values. Many techniques work better with numeric values than with categorical values. Historical projections (such as handset churn rate or penetration by ZIP code) are particularly useful. 182 Methodology 6. Apply analysis. Fit many candidate models, try different solutions, try different sets of input variables, select the best model. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 183 6) Apply Analysis 184 Regression Decision trees Cluster detection Association rules Neural networks Memory-based reasoning Survival analysis Link analysis Genetic algorithms Train Models OUTPUT INPUT INPUT 185 INPUT INPUT INPUT INPUT INPUT INPUT MODEL 3 MODEL 2 MODEL 1 INPUT Build candidate models by applying a data mining technique (or techniques) to the training data. OUTPUT OUTPUT Assess Models OUTPUT INPUT INPUT 186 INPUT INPUT INPUT INPUT INPUT INPUT MODEL 3 MODEL 2 MODEL 1 INPUT Assess models by applying the models to the validation data set. OUTPUT OUTPUT Assess Models Score the validation data using the candidate models and then compare the results. Select the model with the best performance on the validation data set. Communicate model assessments through quantitative measures graphs. 187 Look for Warnings in Models Trailing Indicators: Learning Things That Are Not True What happens in month 8? Minutes of Use by Tenure 120 Minutes of Use 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 11 Tenure (Months) Does declining usage in month 8 predict attrition in month 9? 188 Look for Warnings in Models Perfect Models: Things that are too good to be true. 100% of customers who spoke to a customer support representative canceled a contract. Eureka! It’s all I need to know! If a customer cancels, that customer is automatically flagged to get a call from customer support. The information is useless in predicting cancellation. Models that seem too good usually are. 189 Idea Exchange What are some other warning signs that you can think of in modeling? Have you experienced any pitfalls that were memorable or that changed how you approach the data analysis objectives? 190 Methodology 7. Deploy models. Score new observations, make modelbased decisions. Gather results of model deployment. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 191 7) Deploy Models and Score New Data 192 Methodology 8. Assess the usefulness of the model. If the model has gone stale, revise it. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 193 8) Assess Results 194 Compare actual results against expectations. Compare the challenger’s results against the champion’s. Did the model find the right people? Did the action affect their behavior? What are the characteristics of the customers most affected by the intervention? Good Test Design Measures the Impact of Both the Message and the Model NO Message YES Impact of model on group getting message Control Group Target Group Chosen at random; receives message. Chosen by model; receives message. Response measures message without model. Response measures message with model. Holdout Group Modeled Holdout Chosen at random; receives no message. Chosen by model; receives no message. Response measures background response. Response measures model without message. YES NO Picked by Model 195 Impact of message on group with good model scores Test Mailing Results E-mail campaign test results lift 3.5 E-Mail Test 0.8 Response Rate 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Target Group 196 Control Group Holdout Group Methodology 9. As you learn from earlier model results, refine the business goals to gain more from the data. Define or refine business objective Assess results Select data Deploy models Explore input data Apply analysis Prepare and repair data Transform input data 197 9) Begin Again Revisit business objectives. Define new objectives. Gather and evaluate new data. model scores cluster assignments responses Example: A model discovers that geography is a good predictor of churn. What do the high-churn geographies have in common? Is the pattern your model discovered stable over time? 198 Lessons Learned Data miners must be careful to avoid pitfalls, particularly with regard to spurious patterns in the data: learning things that are not true or not useful confusing signal and noise creating unstable models A methodology is a way of being careful. 199 Idea Exchange Outline a business objective of your own in terms of the methodology described here. What is your business objective? Can you frame it in terms of a data mining problem? How will you select the data? What are the inputs? What do you want to look at to get familiar with the data? 200 continued... Idea Exchange Anticipate any data quality problems that you might encounter and how you could go about fixing them. Do any variables require transformation? Proceed through the remaining steps of the methodology as you consider your example. 201 Basic Data Modeling A common approach to modeling customer value is RFM analysis, so named because it uses three key variables: Recency – how long it has been since the customer’s last purchase Frequency – how many times the customer has purchased something Monetary value – how much money the customer has spent RFM variables tend to predict responses to marketing campaigns effectively. RFM is a special case of OLAP. 202 RFM Cell Approach Frequency Monetary value Recency 203 RFM Cell Approach A typical approach to RFM analysis is to bin customers into (approximately) equal-sized groups on each of the rank-ordered R,F, and M variables. For example: Bin five groups on R (highest bin = most recent) Bin five groups on F (highest bin = most frequent) Bin five groups on M (highest bin = highest value) The combination of the bins gives an RFM “score” that can be compared to some target or outcome variable. Customer score 555 = most recent quintile, most frequent quintile, highest spending quintile. 204 Computing Profitability in RFM Break-even response rate = current cost of promotion per dollar of net profit. Cost of promotion to an individual Average net profit per sale Example: It costs $2.00 to print and mail each catalog. Average net profit per transaction is $30. 2.00/30.00 = 0.067 Profitable RFM cells are those with a response rate greater than 6.7%. 205 RFM Analysis of the Catalog Data 206 Recode recency so that the highest values are the most recent. Bin the R, F, and M variables into five groups each, numbered 1-5, so that 1 is the least valuable and 5 is the most valuable bin. Concatenate the RFM variables to obtain a single RFM “score.” Graphically investigate the response rates for the different groups. Performing RFM Analysis of the Catalog Data Catalog Case Study Task: Perform RFM analysis on the catalog data. 207 Performing Graphical RFM Analysis Catalog Case Study Task: Perform graphical RFM analysis. 208 Limitations of RFM Only uses three variables Modern data collection processes offer rich information about preferences, behaviors, attitudes, and demographics. Scores are entirely categorical 515 and 551 and 155 are equally good, if RFM variables are of equal importance. Sorting by the RFM values is not informative and overemphasizes recency. So many categories The simple example above results in 125 groups. Not very useful for finding prospective customers Statistics are descriptive. 209 Idea Exchange Would RFM analysis apply to a business objective that you are considering? If so, what would be your R, F, and M variables? What other basic analytical techniques could you use to explore your data and get preliminary answers to your questions? 210 211 Exercise Scenario Practice with a charity direct mail example. Analysis Goal: A veteran’s organization seeks continued contributions from lapsing donors. Use lapsing donor response from an earlier campaign to predict future lapsing donor response. 211 ... 212 Exercise Scenario Practice with a charity direct mail example. Analysis Goal: A veteran’s organization seeks continued contributions from lapsing donors. Use lapsing donor response from an earlier campaign to predict future lapsing donor response. Exercise Data (PVA97NK): The data is extracted from previous year’s campaign. The sample is balanced with regard to response/non-response rate. The actual response rate is approximately 5%. 212 213 R, F, M Variables in the Charity Data Set In the data set PVA97NK, the following variables should be used for RFM analysis: GiftTimeLast Time since last gift (Recency) GiftCntAll Gift count over all months (Frequency) Monetary value must be computed as follows: GiftAvgAll*GiftCntAll Average gift amount over lifetime * total gift count Use SAS Enterprise Miner to create the RFM variables and bins, and then perform graphical RFM analysis. 213 Exercise This exercise reinforces the concepts discussed previously. 214 Chapter 2: Basics of Business Analytics 2.1 Overview of Techniques 2.2 Data Management 2.3 Data Difficulties 2.4 SAS Enterprise Miner: A Primer 2.5 Honest Assessment 2.6 Methodology 2.7 Recommended Reading 215 Recommended Reading Davenport, Thomas H., Jeanne G. Harris, and Robert Morison. 2010. Analytics at Work: Smarter Decisions, Better Results. Boston: Harvard Business Press. Chapters 2 through 6, the DELTA method These chapters present a complementary perspective to this chapter on how to integrate analytics at various levels of the organization. 216