Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IBML Intelligent Miner for Data Applications Guide Peter Cabena, Hyun Hee Choi, Il Soo Kim, Shuichi Otsuka, Joerg Reinschmidt, Gary Saarenvirta International Technical Support Organization http://www.redbooks.ibm.com SG24-5252-00 IBML International Technical Support Organization Intelligent Miner for Data Applications Guide March 1999 SG24-5252-00 Take Note! Before using this information and the product it supports, be sure to read the general information in Appendix A, “Special Notices” on page 137. First Edition (March 1999) This edition applies to Version 2, Release 1 of the Intelligent Miner for Data, Program Number 5801-AAR for use with the AIX Operating System. Comments may be addressed to: IBM Corporation, International Technical Support Organization Dept. QXXE Building 80-E2 650 Harry Road San Jose, California 95120-6099 When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. Copyright International Business Machines Corporation 1999. All rights reserved. Note to U.S. Government Users — Documentation related to restricted rights — Use, duplication or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp. Contents Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface . . . . . . . . . . . . . . . . The Team That Wrote This Redbook . . . . . . . . Comments Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Why Now? . . 1.1.1 Changed Business Environment . . . . . . . . . . . . . . . . . 1.1.2 Drivers . . . . . . . . . . . . . . . . 1.1.3 Enablers . . . . . . . . . . . 1.2 What Is Data Mining? 1.3 Data Mining and Business Intelligence . 1.3.1 Where to from Here? . . . . . . . . . 1.4 Data Mining Applications . . . . . . . . . 1.5 Data Mining Techniques . . . . . . . . . . 1.5.1 Predictive Modeling . . . . . . . . . . . . . . . . . 1.5.2 Database Segmentation 1.5.3 Link Analysis . . . . . . . . . . . . . . 1.6 General Approach to Data Mining . . . . 1.6.1 Business Requirements Analysis . . 1.6.2 Project Management . . . . . . . . . . . . . . . 1.6.3 Business Solution Design 1.6.4 Data Mining Run . . . . . . . . . . . . 1.6.5 Business Implementation Design . . . . . . . . 1.6.6 Business Implementation 1.6.7 Results Tracking . . . . . . . . . . . . 1.6.8 Final Business Result Determination . . . . . . 1.6.9 Business Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2. Introduction to the Intelligent Miner . . . . . . . . . . . . . . . . . . . . . . . . 2.1 History . . . . . . . . . . . . . . . 2.2 Intended Customers . . . . . . . . . 2.3 What Is the Intelligent Miner? . . . . 2.4 Data Mining with the Intelligent Miner 2.5 Overview of the Intelligent Miner Components 2.5.1 Intelligent Miner Architecture . . . . . . . . . . . . . . 2.5.2 Intelligent Miner TaskGuides 2.5.3 Mining and Statistics Functions . . . . . . 2.5.4 Processing Functions . . . . . . . . . . . . 2.5.5 Modes . . . . . . . . . . . . . . . . . . . . . Chapter 3. Case Study Framework . . . 3.1 Customer Relationship Management . . . . . . . . . . . . . 3.2 Case Studies 3.3 Strategic Customer Segmentation . . . . . . . . . . . . . . 3.4 Case Studies Chapter 4. Customer Segmentation Copyright IBM Corp. 1999 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix ix x 1 1 1 2 4 5 6 7 8 9 9 11 13 14 15 15 16 16 16 16 17 17 17 19 19 19 19 20 20 20 22 23 24 24 . . . . . . . . . . . . . . . . . . . . . . . 27 27 29 29 30 . . . . . . . . . . . . . . . . . . . . . . . . . 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii 4.1 Executive Summary . . . . . . . . . . . . 4.2 Business Requirements . . . . . . . . . . 4.3 Data Mining Process . . . . . . . . . . . . 4.3.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Data Preparation . . . . . . . . . . . . . . 4.3.3 Data Mining 4.4 Data Mining Results . . . . . . . . . . . . . . . . . . . 4.4.1 Cluster Details Analysis 4.4.2 Cluster Characterization . . . . . . . 4.4.3 Cluster Profiling . . . . . . . . . . . . 4.4.4 Decision Tree Characterization . . . 4.5 Business Implementation and Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 5. Cross-Selling Opportunity Identification . . . . . . . . . . . . . . . 5.1 Executive Summary 5.2 Business Requirement . . . . . . . . . . . . . . 5.3 Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Cluster Selection 5.3.2 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Data Preparation 5.3.4 Product Association Analysis . . . . . . . 5.4 Data Mining Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Cluster Selection 5.4.2 Association Rule Discovery . . . . . . . . . . 5.5 Business Implementation and Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Executive Summary 6.2 Business Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Create Objective Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Data Preparation . . . . . . . . . . . . . . . . 6.3.3 Data Sampling for Training and Test 6.3.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.5 Train and Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.6 Select ″Best Model″ 6.3.7 Perform Population Stability Tests on Application Universe . . . 6.4 Data Mining Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 RBF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Neural Network 6.5 Business Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 7. Attrition Model to Improve Customer Retention . . . . . . . . . . . . . . . . . . . . 7.1 Executive Summary 7.2 Business Requirement . . . . . . . . . . . . . . . . . . . 7.3 Data Mining Process . . . . . . . . . . . . . . . . . . . . 7.3.1 Data Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Gains Chart 7.3.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Data Mining Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . 7.4.2 RBF Modeling iv Intelligent Miner Applications Guide 33 33 34 36 38 44 50 53 55 64 66 67 69 69 69 70 71 72 73 74 76 76 77 85 87 87 . 88 . 89 . 90 . 92 . 93 . 95 . 95 102 103 103 103 106 108 109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 111 112 113 115 116 117 120 120 120 120 122 7.4.3 Neural Network . . . 7.4.4 Clustering . . . . . . . 7.4.5 Time-Series Prediction 7.5 Business Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 8. Intelligent Miner Advantages Appendix A. Special Notices . . . . . . . . . . . . . . . . . . . . . 133 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Appendix B. Related Publications . . . . . . . . . . . . . . . . B.1 International Technical Support Organization Publications B.2 Redbooks on CD-ROMs . . . . . . . . . . . . . . . . . . . . B.3 Other Publications . . . . . . . . . . . . . . . . . . . . . . . How to Get ITSO Redbooks . . . . . . . . . . How IBM Employees Can Get ITSO Redbooks How Customers Can Get ITSO Redbooks . . . . . . . . . . . . . IBM Redbook Order Form Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 139 139 139 . . . . . . . . . . . . . . . . . . . 141 141 142 143 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 List of Abbreviations Index 126 127 128 131 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 ITSO Redbook Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 155 v vi Intelligent Miner Applications Guide Figures 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. Copyright IBM Corp. 1999 New Customer Relationships Out of Reach . . . . . . . . . . . Data Mining Positioning . . . . . . . . . . . . . . . . . . . . . . . Data Mining and Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictive Modeling Database Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pattern Matching . . . . . . . . . . . . . . . . . . . . . . The Data Mining Process The Intelligent Miner Architecture . . . . . . . . . . . . . . . . . The Data Task Guide . . . . . . . . . . . . . . . . . . . . . . . . . Customer Segmentation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Mining Process: Customer Segmentation Customer Transaction Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Original Data Profile Post-Discretized Data Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Post Logarithm Transformed Data Profile Clustering Process Flow . . . . . . . . . . . . . . . . . . . . . . . Shareholder Value Demographic Clusters . . . . . . . . . . . . Shareholder Value Neural Network Clusters . . . . . . . . . . . Shareholder Value Demographic Cluster Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cluster 6 Detailed View . . . . . . . . . . . . . . . . . . . . . . . Cluster 3 Detailed View . . . . . . . . . . . . . . . . . . . . . . . Cluster 5 Detailed View Cluster 5 Tabulated Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cluster 1 Detailed View . . . . . . . . . . . . . . . . . . Decision Tree Confusion Matrix Decision Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Mining Process: Cross-Selling Opportunity Typical Transaction Record . . . . . . . . . . . . . . . . . . . . . Product Association Analysis Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parameter Settings for Associations . . . . . . . . . . . . . . . Associations on Good Customer Set . . . . . . . . . . . Associations on Good Customer Set Detail . . . . . . Associations for Good Customer Set: LIS Removed Associations for Good Customer Set: LIS Removed, Detail . . Associations on Okay Customer Set . . . . . . . . . . . . . . . . Associations on Okay Customer Set Detail . . . . . . . . . . . . . . . . . . Associations for Okay Customer Set: LIS Removed Associations for Good Customer Set: LIS Removed, Summary Associations for Good Customer Set: LIS Removed, Detail . . Associations for Good Customer Set: LIS and Certain Products . . . . . . . . . . . . . . . . . . . . . . . . . Removed, Summary Associations for Good Customer Set: LIS and Certain Products Removed, Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . Associations for Okay Customer Set: LIS and Certain Products . . . . . . . . . . . . . . . . . . . . . . . . . Removed, Summary Associations for Okay Customer Set: LIS and Certain Products Removed, Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . Associations for All Transactions: LIS Removed, Summary . . Associations for All Transactions LIS Removed Detail . . . . . Data Mining Process: Cross-Selling . . . . . . . . . . . . . . . . Creating an Objective Variable . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 7 10 12 13 14 21 23 30 35 37 41 42 43 45 51 52 54 56 58 60 61 63 66 67 71 72 74 78 78 79 79 80 80 81 81 82 82 . . . . . . 82 . . . . . . 83 . . . . . . 83 . . . . . . 83 84 84 90 91 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. viii Cross Selling: Data Sampling (5252f405/50) . . . . . . . . . . . Detailed Predictive Modeling Process . . . . . . . . . . . . . . . . . Decision Tree Results: Isolating the Key Decision Criteria Gains Chart for Decision Tree Results (s406/0.0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RBF Results . . . . Cross-Selling: Comparison of Three Predictive Models Cross-Selling: ROI Analysis Figures . . . . . . . . . . . . . . . . Reducing Defections 5% Boosts Profits 25% to 85% . . . . . . Data Mining Process: Attrition Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Attrition Analysis: Data Definition Times Series: Setting the Parameters . . . . . . . . . . . . . . . Attrition Analysis: Decision Tree Structure . . . . . . . . . . . . Decision Tree Gains Chart: Training and Testing . . . . . . . . RBF: Results Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Attrition Analysis: Predicting Values Result Attrition Analysis: Predicting Values . . . . . . . . . . . . . . . . Attrition Analysis: Comparative Gains Charts for All Methods Attrition Analysis: Demographic Clustering of Likely Defectors Profile of Time-Series Prediction . . . . . . . . . . . . . . . . . . . . . . . . . Time Profile of Defection Probability for Defectors Time Profile of Defection Probability for Nondefectors . . . . . Intelligent Miner Applications Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 96 104 105 107 109 110 111 114 116 118 121 122 123 124 125 126 128 129 130 131 Tables 1. 2. 3. 4. 5. 6. Copyright IBM Corp. 1999 Customer Revenue by Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Neural and Demographic Clustering Results Demographic Clustering Results: Percentage . . . . . . . . . . . . . . . Cross-selling: Summary - Predictive Modeling More Than Doubles ROI . . . . . . . . . . . . . . . . . . Cross-Selling: Baseline ROI Calculation Cross-Selling: ROI Analysis Figures . . . . . . . . . . . . . . . . . . . . . 64 65 . 77 88 . 88 109 . . ix x Intelligent Miner Applications Guide Preface This redbook is a step-by-step guide to data mining with Intelligent Miner Version 2. It will help customers better understand the usability and the business value of the product. The focus is on helping the Intelligent Miner V2 user determine which algorithms to use and how to effectively exploit them. The business utilized as a case study in the book is a retail bank client of Loyalty Consulting, an IBM business partner based in Toronto, Canada. After a short introduction to data mining technology and Intelligent Miner V2, the case study framework is described. The rest of the book covers each data mining technique in detail and provides ideas on how to implement the techniques. Although no in-depth knowledge of the Intelligent Miner V2 is required, a basic understanding of data mining technology is assumed. The Team That Wrote This Redbook This redbook was produced by a team of specialists from around the world working at the International Technical Support Organization, San Jose Center. Peter Cabena is a data warehouse and data mining specialist at IBM′ s International Technical Support Organization - San Jose Center. He holds a Bachelor of Science degree in computer science from Trinity College, Dublin, Ireland. Peter has been extensively involved in the IBM data warehouse effort since its inception in 1991. In recent years, he has taught and presented internationally on the subjects of data warehousing and data mining. Peter conceived and managed the project that produced this book. Hyun Hee Choi is a data mining researcher at the Korea Software Development Institute, a branch of IBM in Korea. She holds a Master of Science degree in statistics from Korea University, Seoul, Korea, where she focused her research on time-series analysis. Hyun Hee has several years of experience in data mining and business intelligence consulting projects for airline, banking, insurance, and cerdit card customer data analysis. She can be reached by e-mail at [email protected]. Il Soo Kim is a Business Intelligence Solution Specialist at IBM Korea. He holds a Master of Science degree in engineering from Seoul National University, Seoul, Korea. Il Soo specializes in content management. Recently he has been involved in constructing an in-house patent data warehouse and designing a patent data analysis program. Shuichi Otsuka works for the Business Intelligence Solution Center, IBM Japan. He has been engaged for several years in data mining projects, mainly in distribution industries. Shuichi and his collegues have translated Data Mining with Neural Networks by Joe Bigus into Japanese. Joerg Reinschmidt is a data management and data mining specialist at IBM′ s International Technical Support Organization, San Jose Center. He has been Copyright IBM Corp. 1999 xi engaged for several years in all data-management-related topics such as second level suport and technical marketing support. For the last several years, Joerg has taught several technical classes on DB2, while focusing on DB2 and IMS Internet connectivity. Gary Saarenvirta is a principal consultant of Loyalty Consulting at The Loyalty Group in Toronto, Canada. He has worked in the business intelligence industry for more than eight years, providing data mining and data warehousing consulting services for Global 2000 companies. Gary joined The Loyalty Group to manage the design, construction, and operation of the company′s data warehouse. He played a key role in the development of Loyalty Consulting′ s Decision Support business over the last few years. Gary was the lead editor of this book and conceived the framework and data mining methodology for each case study. Thanks to the following people for their invaluable contributions to this project: Hanspeter Nagel International Technical Support Organization, San Jose Center Susan Dahm IBM Santa Teresa Laboratory Ingrid Foerster IBM Santa Teresa Laboratory Comments Welcome Your comments are important to us! We want our redbooks to be as helpful as possible. Please send us your comments about this or other redbooks in one of the following ways: • Fax the evaluation form found in “ITSO Redbook Evaluation” on page 155 to the fax number shown on the form. • Use the electronic evaluation form found on the Redbooks Web sites: For Internet users For IBM Intranet users • Send us a note at the following address: [email protected] xii http://www.redbooks.ibm.com http://w3.itso.ibm.com Intelligent Miner Applications Guide Chapter 1. Introduction Data mining is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large databases. The genesis of the field came with the realization that traditional decision-support methodologies, which combine simple statistical techniques with executive information systems, do not scale to the point where they can deal with large databases and data warehouses within the time limits imposed by today′ s business environment. Data mining has captured the imagination of the business and academic worlds, moving very quickly from a niche research discipline in the mid-eighties to a flourishing field today. In fact, 80% of the Fortune 500 companies are currently involved in a data mining pilot project or have already deployed one or more data mining production systems. 1.1 Why Now? Much of the current upsurge of interest in data mining arises from the confluence of two forces: the need for data mining (drivers) and the means to implement it (enablers). The drivers are primarily the business environment changes that have resulted in an increasingly competitive marketplace. The enablers are mostly recent technical advances in machine learning research, database, and technologies. This happy coincidence of growing commercial pressures and major advances in research and information technology lends an inevitable push toward a more advanced approach to advise critical business decisions. Before looking at these drivers and enablers in some detail, it is worth reviewing the commercial backdrop against which these two forces are coming together. 1.1.1 Changed Business Environment Today′s business environment is in flux. Fundamental changes are influencing the way organizations view and plan to approach their customers. Among these changes are: • Customer behaviour patterns Consumers are becoming more demanding and have access to better information through buyers′ guides, catalogs, and the Web. New demographics are emerging: Only 15% of U.S. families are now traditional single-earner units, that is, a married couple with or without children where only the husband works outside the home. Many consumers are reportedly confused by too many choices and are starting to limit the number of businesses with which they are prepared to deal. They are starting to put more value on the time they spend shopping for goods and services. • Market saturation Many markets have become saturated. For example, in the United States almost everyone uses a bank account, has at least one credit card, has some form of automobile and property insurance, and has well-established purchasing patterns in basic food items. Thus, in these areas, few options are available to organizations wanting to expand their market share. If a merger or takeover is not possible, such organizations often must resort to effectively stealing customers from competitors, frequently by what is called Copyright IBM Corp. 1999 1 predatory pricing . Lowering prices is not a sound long-term strategy, however, as only one supplier can be the lowest-cost provider. • New niche markets New, untapped markets are opening up. Examples are the handicapped and ethnic groups or the current U.S. inner-city hip-hop culture. Also, highly specialized stores such as SunGlass Hut are emerging. • Increased commoditization Increased commoditization, where even many leading brand products and services are finding it increasingly difficult to differentiate themselves, has sent many suppliers in search of new distribution channels. Witness the increase in online service outlets, from catalogs to banking and insurance to Internet-based shopping malls. • Traditional marketing approaches under pressure Traditional mass marketing and even database marketing approaches are becoming ineffective, as customers are increasingly turning to more targeted channels. Customers are shopping in fewer stores and are expecting to do more one-stop shopping. • Time to market Time to market has become increasingly important. Witness the recent emergence and spectacular rise of Netscape Communications Corporation in the Web browser marketplace. With only a few months′ lead over its rivals, Netscape captured an estimated 80% of the browser market within a year of establishment. This is the exception, of course; most companies operate by making small incremental changes to services or products to capture additional customers. • Shorter product life cycles Today products are brought to market quickly but often have a short life cycle. This phenomenon is currently exemplified by the personal computer and Internet industries where new products and services are offered at arguably faster rates than at any other time in the history of computing. The result of these shortened life cycles is that providers have less time to turn a profit or to ″milk″ their products and services. • Increased competition and business risks Many of the above changes tend to combine to create a climate that is significantly competitive and a challenging risk management environment for many organizations. General trends like commoditization, globalization, deregulation, and the Internet make it increasingly difficult to keep track of competitive forces, both traditional and new. Equally, rapidly changing consumer trends inject new risks into doing business. 1.1.2 Drivers Against this background, many organizations have been forced to reevaluate their traditional approaches to doing business and have started to look for ways to respond to changes in the business environment. The main requirements driving this reevaluation are: • Focus on the customer The requirement here is to rejuvenate customer relationships with an emphasis on greater intimacy, collaboration, and one-to-one partnership. In 2 Intelligent Miner Applications Guide turn, this requirement has forced organizations to ask new questions about their existing customers and potential customers, for example: • − Which general classes of customer do I have? − How can I sell more to my existing customers? − Is there a recognizable pattern whereby my customers acquire products or use services? − Which of my customers will prove to be good, long-term valuable customers and which will not? − Can I predict which of my customers are more likely to default on their payments or to defraud me? Focus on the competition Organizations need to focus increasingly on competitive forces with a view to building up a modern armory of business weapons. Some of the approaches to building such an armory are: • − Prediction of potential strategies or major business plans by leading competitors − Prediction of tactical movements by local competitors − Discovery of subpopulations of existing customers that are especially vulnerable to competitive offers Focus on the data asset Business and information technology (IT) managers are becoming increasingly aware that there is an information-driven opportunity to be seized. Many organizations are now beginning to view their accumulated data resources as a critical business asset. Some of the factors contributing to this growing awareness are: − Growing evidence of exponential return on investment (ROI) numbers from industry watchers and consultants on the benefits of a modern, corporate, decision-making strategy based on data-driven techniques such as data warehousing. Data mining is a high-leverage business where even small improvements in the accuracy of business decisions can have huge benefits. − Growing availability of data warehouses. As the data warehouse approach becomes more pervasive, early adopters are forced to leverage further value from their investments by pushing into new technology areas to maintain their competitive edge. − Growing availability of success stories, both anecdotal and otherwise, in the popular trade press. Figure 1 on page 4 summarizes the situation. The frustrated business executive is attempting to grasp new opportunities such as better customer relationships and improved services. He fails, however, given the combination of a rapidly changing business environment and poor or outdated in-house technology systems. Chapter 1. Introduction 3 Figure 1. New Customer Relationships Out of Reach 1.1.3 Enablers There is a set of enablers for data mining that, when combined with the driving forces discussed above, substantially increases the momentum toward a revised approach to business decision making: • Data flood Forty years of information technology have led to the storage of enormous amounts of data (measured in gigabytes and terabytes) on computer systems. A typical business trip today generates an automatic electronic audit trail of a traveler′s habits and preferences in airline travel, car hire, credit card usage, reading material, mobile phone services, and perhaps Web sites. In addition, the increasing availability of demographic and psychographic data from syndicated providers, such as A.C. Nielsen and Acxiom in the United States, has provided data miners with a useful data source.The availability of such data is particularly important given the focus in data 4 Intelligent Miner Applications Guide mining on consumer behavior, which is often driven by preferences and choices that are not visible in a single organization′s database. • Growth of data warehousing The growth of data warehousing in organizations has led to a ready supply of the basic raw material for data mining: clean and well-documented databases. Early adopters of the warehousing approach are now poised to further capitalize on their investment. See ″The Data Warehouse Connection″ on page 18 for a detailed discussion of the integration of data warehouse and data mining approaches. • New information technology solutions More cost-effective IT solutions in terms of storage and processing ability have made large-scale data mining projects possible. This is particularly true of parallel technologies, as many of the data mining algorithms are parallel by nature. Furthermore, increasingly affordable desktop power has enabled the emergence of sophisticated visualization packages, which are a key weapon in the data mining armory. • New research in machine learning New algorithms from research centers and universities are being pressed into commercial service more quickly than ever. Emphasis on commercial applications has focused attention on better and more scalable algorithms, which are beginning to come to market through commercial products. This movement is supported by increasing contact and joint ventures between research centers and commercial industries around the world. The net effect of the changed business environment is that decision making has become much more complicated, problems have become more complex, and the decision-making process less structured. Decision makers today need a set of strategies and tools to address these fundamental changes. 1.2 What Is Data Mining? It is difficult to make definitive statements about an evolving area—and surely data mining is an area in very quick evolution. However, we need a framework within which to position and better understand the subject. Figure 2 shows a general positioning of the components in a data mining environment. Figure 2. Data M i n i n g Positioning Although there is no one single definition of data mining that would meet with universal approval, the following definition is generally acceptable: Chapter 1. Introduction 5 Data Mining... is the process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial business decisions. The highlighted words in the definition lend insight into the essential nature of data mining and help to explain the fundamental differences between it and the traditional approaches to data analysis such as query and reporting and online analytical processing (OLAP). In essence, data mining is distinguished by the fact that it is aimed at the discovery of information, without a previously formulated hypothesis. First, the information discovered must have been previously unknown. Although this sounds obvious, the real issue here is that it must be unlikely that the information could have been hypothesized in advance; that is, the data miner is looking for something that is not intuitive or, perhaps, even counterintuitive. The further away the information is from being obvious, potentially the more value it has. Data mining can uncover information that could not even have been hypothesized with other approaches. Second, the new information must be valid. This element of the definition relates to the problem of overoptimism in data mining; that is, if data miners look hard enough in a large collection of data, they are bound to find something of interest sooner or later. For example, the potential number of associations between items in customers′ shopping baskets rises exponentially with the number of items. The possibility of spurious results applies to all data mining and highlights the constant need for post-data-mining validation and sanity checking. Third, and most critically, the new information must be actionable, that is, it must be possible to translate it into some business advantage. In the case of the classic example of the retail store manager, who, using data mining, discovered that there was a strong association between the sales of diapers and beer on Friday evenings, clearly he could leverage the results of the analysis by placing the beer and diapers closer together in the store or by ensuring that the two items were not discounted at the same time. In many cases, however, the actionable criterion is not so simple. The ability to use the mined data to inform crucial business decisions is another critical environmental condition for successful commercial data mining and underpins data mining′s strong association with and applicability to business problems. Needless to say, an organization must have the necessary political will to carry out the action implied by the mining. 1.3 Data Mining and Business Intelligence We use business intelligence as a global term for all the processes, techniques, and tools that support business decision-making based on information technology. The approaches can range from a simple spreadsheet to a major competitive intelligence undertaking. Data mining is an important new component of business intelligence. Figure 3 on page 7 shows the logical positioning of different business intelligence technologies according to their potential value as a basis for tactical and strategic business decisions. 6 Intelligent Miner Applications Guide In general, the value of the information to support decision-making increases from the bottom of the pyramid to the top. A decision based on data in the lower layers, where there are typically millions of data records, will typically affect only a single customer transaction. A decision based on the highly summarized data in the upper layers is much more likely to be about company or department initiatives or even major redirection. Therefore we generally also find different types of users on the different layers. A database administrator works primarily with databases on the data source and data warehouse level, whereas business analysts and executives work primarily on the higher levels of the pyramid. Note that Figure 3 portrays a logical positioning and not a physical interdependence among the various technology layers. For example, data mining can be based on data warehouses of flat files, and the data presentation can be used outside data mining, of course. Figure 3. Data M i n i n g and Business Intelligence 1.3.1 Where to from Here? It is probably a little early to ponder the future of data mining, but some trends on the horizon are already becoming clear. Data mining technology trends are becoming established as we see vendors scramble to position their tools and services within the new data mining paradigm. This scramble will be followed by the inevitable technology shakeout where some vendors will manage to establish leadership positions in the provision of tools and services and others will simply follow. Doubtless, new data mining algorithms will continue to be developed, but, over time, the technology will begin to dissolve into the general backdrop of database and data management technology. Already, we are seeing the merging of OLAP and multidimensional database analysis (MDA) tools and the introduction of structured query language (SQL) extensions for mining data directly from relational databases. Chapter 1. Introduction 7 On the data mining process side, there will be more open sharing of experiences by the early adopters of data mining. Solid, verifiable success stories are already beginning to appear. Over time, as more of the implementation details of these successes emerge, knowledge of the data mining process will begin to move out into the public domain. The final phase in the evolution will be the integration of the data mining process into the overall business intelligence machinery. In the long run, data mining, like all truly great technologies, may simply become transparent! 1.4 Data Mining Applications Large customers in mature, competitive industries can no longer establish competitive advantage through transaction systems or business process improvement. To distinguish their strategies and operations from those of competitors, they must discover and extract strategic value from their operational data. Companies generate enormous amounts of data during the course of doing business, and Business Intelligence is the process of transforming that data into knowledge. Business Intelligence enables companies to make strategic marketing decisions about which markets to enter and which products to promote, all in an effort to increase profitability. Some customers use business intelligence for marketing purposes, others, to detect fraud. New marketing strategies and the implementation of fraud detection can also reduce operating costs through effective financial analysis, risk management, fraud management, distribution and logistics management, and sales analysis. Perhaps the best known application area for data mining is database marketing. The objective is to drive targeted and therefore effective marketing and promotional campaigns through the analysis of corporate databases. Data known through credit card transactions or loyalty cards, for example, mixed with publicly available information from sources such as lifestyle studies forms a potent concoction. Data mining algorithms then sift through the data, looking for clusters of ″model″ consumers who share the same characteristics such as interests, income level, and spending habits. It is a win-win game for both the consumers and marketers: Consumers perceive greater value in the (reduced) number of advertising messages, and marketers save by limiting their distribution costs and getting an improved response to the campaign. Another application area for data mining is that of determining customer purchasing patterns over time. Marketers can determine much about the behavior of consumers, such as the sequence in which they take up financial services as their family grows, or how they change their cars. Commonly the conversion of a single bank account to a joint account indicates marriage, which could lead to future opportunities to sell a mortgage, a loan for a honeymoon vacation, life insurance, a home equity loan, or a loan to cover college fees. By understanding these patterns, marketers can advertise just-in-time to these consumers, thus ensuring that the message is focused and likely to draw a response. In the long run, focusing on long-term customer purchasing patterns provides a full appreciation of the lifetime value of customers, where the strategy is to move away from share of market to share of customer. An average supermarket customer is worth $200,000 over his or her lifetime, and General Motors estimates that the lifetime value of an automobile customer is $400,000, 8 Intelligent Miner Applications Guide which includes car, service, and income on loan financing. Clearly, understanding and cultivating long-term relationships bring commercial benefits. Cross-selling campaigns constitute another application area where data mining is widely used. Cross selling is where a retailer or service provider makes it attractive for customers who buy one product or service to buy an associated product or service. 1.5 Data Mining Techniques Data mining techniques are specific implementations of the algorithms that are used to carry out the data mining operations. Predictive modeling, database segmentation, link analysis, and deviation detection are the four major operations for implementing any of the business applications. We deliberately do not show a fixed, one-to-one link between the business applications and data mining layers, to avoid the suggestion that only certain operations are appropriate for certain applications and vice versa. (On the contrary, truly breakthrough results can sometimes come from the use of nonintuitive approaches to problems.) Nevertheless, certain well-established links between the applications and the corresponding operations do exist. For example, modern target marketing strategies are almost always implemented by means of the database segmentation operation. However, fraud detection could be implemented by any of the four operations, depending on the nature of the problem and input data. Furthermore, the operations are not mutually exclusive. For example, a common approach to customer retention is to segment the database first and then apply predictive modeling to the resultant, more homogeneous segments. Typically the data analyst, perhaps in conjunction with the business analyst, selects the data mining operations to use. Not all algorithms to implement a particular data mining operation are equal, and each has its own strengths and weaknesses. The key message is this: There is rarely one, fool-proof technique for any given operation or application, and the success of the data mining exercise relies critically on the experience and intuition of the data analyst. In the sections that follow we discuss in detail the operations associated with data mining. 1.5.1 Predictive Modeling Predictive modelling is akin to the human learning experience, where we use observations to form a model of the essential, underlying characteristics of some phenomenon. For example, in its early years, a young child observes several different examples of dogs and can then later in life use the essential characteristics of dogs to accurately identify (classify) new animals as dogs. This predictive ability is critical in that it helps us to make sound generalizations about the world around us and to fit new information into a general framework. In data mining, we use a predictive model to analyze an existing database to determine some essential characteristics about the data. Of course, the data must include complete, valid observations from which the model can learn how to make accurate predictions. The model must be told the correct answer to some already solved cases before it can start to make up its own mind about Chapter 1. Introduction 9 new observations. When an algorithm works in this way, the approach is called supervised learning. Physically, the model can be a set of IF THEN rules in some proprietary format, a block of SQL, or a segment of C source code. Figure 4 illustrates the predictive modeling approach. Here a service company, for example an insurance company, is interested in understanding the increasing rates of customer attrition. A predictive model has determined that only two variables are of interest: the length of time the client has been with the company (Tenure), and the number of the company′s services that the client uses (Services). The decision tree presents the analysis in an intuitive way. Clearly, those customers who have been with the company less than 2.5 years and use only one or two services are the most likely to leave. Figure 4. Predictive Modeling Models are developed in two phases: training and testing. Training refers to building a new model by using historical data, and testing refers to trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics. Training is typically done on a large proportion of the total data available, whereas testing is done on some small percentage of the data that has been held out exclusively for this purpose. The predictive modeling approach has broad applicability across many industries. Typical business applications that it supports are customer retention management, credit approval, cross selling, and target marketing. There are two specializations of predictive modeling: classification and value prediction. Although both have the same basic objective, namely, to make an educated guess about some variable of interest, they can be distinguished by the nature of the variable being predicted. With classification, a predictive model is used to establish a specific class for each record in a database. The class must be one from a finite set of possible, predetermined class values. The insurance example in Figure 4 is a case in point. The variable of interest is the class of customer, and it has two possible values: STAY and LEAVE. 10 Intelligent Miner Applications Guide With value prediction, a predictive model is used to estimate a continuous numeric value that is associated with a database record. For example, a car retailer may want to predict the lifetime value of a new customer. A mining run on the historical data of present long-standing clients, including some agreed-upon measure of their financial worth to date, produces a model that can estimate the likely lifetime value of new customers. A specialization of value prediction is scoring , where the variable to be predicted is a probability or propensity. Probability and propensity are similar in that they are both indicators of likelihood. Both use an ordinal scale, that is, the higher the number, the more likely it is that the predicted event will occur. Typical applications are the prediction of the likelihood of fraud or the probability that a customer will respond to a promotional mailing. 1.5.2 Database Segmentation The goal of database segmentation is to partition a database into segments of similar records, that is, records that share a number of properties and so are considered to be homogeneous. In some literature the words segmentation and clustering are used interchangeably. Here, we use segmentation to describe the data mining operation, and segments or clusters to describe the resulting groups of data records. By definition, two records in different segments are different in some way. The segments should have high internal (within segment) homogeneity and high external (between segment) heterogeneity. Database segmentation is typically done to discover homogeneous subpopulations in a customer database to improve the accuracy of the profiles. A subpopulation, which might be ″wealthy, older, males″ or ″urban, professional females,″ can be targeted for specialized treatment. Equally, as databases grow and are populated with diverse types of data, it is often necessary to partition them into collections of related records to obtain a summary of each database or before performing a data mining operation such as predictive modeling. Figure 5 on page 12 shows a scatterplot of income and age from a sample population. The population has been segmented into clusters (indicated by circles) that represent significant subpopulations within the database. For example, one cluster might be labeled ″young, well-educated professionals″ and another, ″older, highly paid managers.″ The grid lines and shaded sectors on the plot illustrate the comparative inefficiency of the traditional, slice-and-dice approach to the problem of database segmentation. The overlaid areas do not account for the truly homogeneous clusters because they either miss many of the cluster members or take in extraneous cluster members—which will skew the results. In contrast, the segmentation algorithm can segment a database without any prompting from the user about the type of segments or even the number of segments it is expected to find in the database. Thus, any element of human bias or intuition is removed, and the true discovery nature of the mining can be leveraged. When an algorithm works in this way, the approach is called unsupervised learning. Chapter 1. Introduction 11 Figure 5. Database Segmentation Database segmentation can be accomplished by using either demographic or neural clustering methods. The methods are distinguished by: • The data types of the input attributes that are allowed • The way in which they calculate the distance between records (that is, the measure of similarity or difference between the records, which is the essence of the segmentation operation) • The way in which they organize the resulting segments for analysis Demographic clustering methods operate primarily on records with categoric variables. They use a distance measurement technique based on the voting principle called condorect, and the resulting segments are not prearranged on output in any particular hierarchy. Neural clustering methods are built on neural networks, typically by using Kohonen feature maps. Neural networks accept only numeric input, but categorical input is possible by first transforming the input variables into quantitative variables. The distance measurement technique is based on Euclidean distance, and the resulting segments are arranged in a hierarchy where the most similar segments are placed closest together. Segmentation differs from other data mining techniques in that its objective is generally far less precise than the objectives of predictive modeling or link analysis. As a result, segmentation algorithms are sensitive to redundant and irrelevant features. This sensitivity can be alleviated by directing the segmentation algorithm to ignore a subset of the attributes that describe each instance or by assigning a weight factor to each variable. 12 Intelligent Miner Applications Guide Segmentation supports such business applications as customer profiling or target marketing, cross selling, and customer retention. Clearly, this operation has broad, cross-industry applicability. 1.5.3 Link Analysis In contrast to the predictive modeling and database segmentation operations, which aim to characterize the contents of the database as a whole, the link analysis operation seeks to establish links (associations) between individual records, or sets of records, in the database. A classic application of this operation is associations discovery, that is, discovering the associations between the products or services that customers tend to purchase together or in a sequence over time. Other examples of business applications that link analysis supports are cross selling, target marketing, and stock price movement. There are three specializations of link analysis: associations discovery, sequential pattern discovery, and similar time sequence discovery. The differences among the three are best illustrated by some examples. If we define a transaction as a set of goods purchased in one visit to a shop, associations discovery can be used to analyze the goods purchased within the transaction to reveal hidden affinities among the products, that is, which products tend to sell well together. This type of analysis is called market basket analysis (MBA) or product affinity analysis. Sequential pattern discovery is used to identify associations across related purchase transactions over time that reveal information about the sequence in which consumers purchase goods and services. It aims to understand long-term customer buying behavior and thus leverage this new information through timely promotions. Similar time sequence discovery, the discovery of links between two sets of data that are time dependent, is based on the degree of similarity between the patterns that both time series demonstrate. Retailers would use this approach when they want to see whether a product with a particular pattern of sales over time matches the sales curve of other products, even if the pattern match is lagging some time behind. Figure 6 shows an example of three apparently unrelated patterns that could represent sales histories or even stock movements over time. At first glance the graphs appear not to be related in any significant way. However, on closer examination, definite patterns can be identified, which, when translated into business terms, can be exploited for commercial gain. Figure 6. Pattern Matching Chapter 1. Introduction 13 1.6 General Approach to Data Mining Figure 7 depicts a general data mining process that fits into an overall business process. In this section we briefly describe most of the actions to be performed within the data mining process. Figure 7. The Data M i n i n g Process 14 Intelligent Miner Applications Guide 1.6.1 Business Requirements Analysis The first part of any data mining project is to understand the client′s business requirements. The business requirements that form the project objectives should be clearly presented and understood by all members of the project team. The data mining process is driven by the client′s business requirements. • Economics of the business problem In order to turn the data mining results into actionable business results, it is important to understand the economics and/or other drivers of the client′ s business requirements. The data mining activities activities undertaken must adhere to and improve the economics and/or other drivers of the requirements. • Review of current methods used An understanding of the current methods used and the current business performance of the methods is required to ensure that the application of new technologies and methods adds incremental value beyond the status quo. The status quo performance is the minimum performance required of any new methods. • Expected performance of a new method The expected improvement over status quo methods should be presented by the client to ensure that the project team has clear objectives. It is also important to set attainable expectations of results. 1.6.2 Project Management The second part of a data mining project is to define the scope of the project and the team to run the project. • Project team identification A cross-functional project management team including representation from all parties is defined and ensures that all project issues are appropriately discussed and resolved. • Project plan design The first task of the project management team is to agree on a project plan that includes identification of all project tasks, project task resourcing, project task scheduling, and project task estimation. • Project objectives At the outset of the project, a clear set of objectives must be defined to maintain project focus and help resolve project issues. A project without clear objectives has a high probability of not being completed on time and with positive results. • Project evaluation criteria The client must present criteria that will be used to evaluate the success of the project. The project management team should modify the evaluation criteria as required to set an appropriate expectation of success and achieve an objective evaluation. Chapter 1. Introduction 15 1.6.3 Business Solution Design Before the actual data mining phase of a project, a business solution must be designed. The solution should define the detailed data mining tasks and can be illustrated as a flow diagram. 1.6.4 Data Mining Run As illustrated in Figure 7 on page 14, the data mining action is iterative and consists of these steps: • Data selection This step involves identifying and selecting all relevant data that can be used for data mining. The business requirement defines that data selection. The data requirement project activity defined above is the data selection mining activity. • Data preparation Data preparation is a substantial portion of a data mining project. It involves the treatment of missing values, outliers, and the creation of new variables based on data transformations. Each data mining algorithm has a different data preparation requirements. Data preparation could also include data reduction, which is defined by the maximum number of variables that an algorithm can effectively utilize, and data sampling. The sampling requirements are driven by the different data mining algorithms. • Data mining Data mining involves the execution of the various data mining algorithms against the prepared data sets. Several (tens to hundreds) mining runs are completed for each data mining project. The effects of algorithm parameters and data transformations are scientifically evaluated. • Results analysis Once a data model has been created and tested, its performance is analyzed. The analysis includes a description of all key variables and findings that the model permits. All modeling assumptions are outlined, and implementation issues are presented. 1.6.5 Business Implementation Design Business implementation design involves designing the implementation of the data mining results, with the goal of meeting the defined business requirements. The design should support quality control, tracking of business results, and the ability to prove the causal effect of the data mining result. The design must also take into account any business implementation issues that are not part of the data mining project. The business implementation design is more experimental than fixed. 1.6.6 Business Implementation The business implementation is the execution of the experimental design. 16 Intelligent Miner Applications Guide 1.6.7 Results Tracking If required by the client preliminary business results can be tracked against the expected performance to ensure success of the business implementation. Preliminary results can be used to modify the current business activity if warranted. 1.6.8 Final Business Result Determination At the conclusion of any business activity, a complete analysis of the profitability of the business implementation will be evaluated. The performance of the model will also be analyzed against its expected performance. 1.6.9 Business Result Analysis The final business result should be analyzed to identify general learnings that can be fed into future projects. Many companies are beginning to create learning warehouses to store corporate knowledge. Chapter 1. Introduction 17 18 Intelligent Miner Applications Guide Chapter 2. Introduction to the Intelligent Miner The IBM Intelligent Miner for Data (IM in this book) is leading the way in helping customers identify and extract high-value business intelligence from their data assets. The process is one of discovery . Companies are empowered to leverage information hidden within enterprise data and discover associations, patterns and trends; detect deviations; group and classify information; and develop predictive models. 2.1 History IBM′s award-winning Intelligent Miner was released in 1996. It enables users to mine structured data stored in conventional databases or flat files. Customers and partners have successfully deployed its mining algorithms to address such business areas as market analysis, fraud and abuse, and customer relationship management. 2.2 Intended Customers The Intelligent Miner offerings are intended for use by data analysts and business technologists in areas such as marketing, finance, product management, and customer relationship management. In addition, the text mining technologies have applicability to a wide range of users who regularly review or research documents - for example, patent attorneys, corporate librarians, public relations teams, researchers, and students. 2.3 What Is the Intelligent Miner? The IBM Intelligent Miner is a suite of statistical, processing, and mining functions that you can use to analyze large databases. It also provides visualization tools for viewing and interpreting mining results. The server software runs on AIX, AS/400, OS/390, and Sun Solaris operating systems. AIX, OS/2, and Windows operating systems can be used for the clients. Some of the features provided by the Intelligent Miner include: • Extension of the associations, classification, clustering, and prediction functions • Neural prediction • Statistical functions • Export and import of mining bases across operating systems • Exploitation of DB2 Parallel Edition and DB2 Universal Database Enterprise Extended Edition • Repeatable sequences • API for all server platforms The Intelligent Miner provides a complete graphical user interface with TaskGuides that lead you through the steps of creating the different Intelligent Miner objects. General help for each TaskGuide provides additional information, examples, and valid values for the controls on each page. Copyright IBM Corp. 1999 19 In the sections that follow we introduce the data mining technology and the data mining process of the Intelligent Miner. We also explain in general the statistical, processing, and mining functions that Intelligent Miner provides. 2.4 Data Mining with the Intelligent Miner Data mining is the process of discovering valid, previously unknown, and ultimately comprehensible information from large stores of data. It can be used to extract information to form a prediction or classification model, or to identify similarities between database records. The resulting information can help you make more informed decisions. The Intelligent Miner helps organizations perform data mining tasks. For example,a retail store might use the Intelligent Miner to identify groups of customers that are most likely to respond to new products and services or to identify new opportunities for cross selling. An insurance company might use the Intelligent Miner with claims data to isolate likely fraud indicators. 2.5 Overview of the Intelligent Miner Components In this section we provide a high-level overview of the product architecture. See the Intelligent Miner Application Programming Interface and Utility Reference for more detailed information about the architecture and the APIs for the Intelligent Miner. The Intelligent Miner links the mining and processing functions on the server with the administrative and visualization tools on the client. The client component includes a user interface from which you can invoke the mining and processing functions on an Intelligent Miner server. The results of the mining process can be returned to the client where you can visualize and analyze them. The client components are available for AIX, OS/2, Windows NT, and Windows 95 operating systems. The server components are available for AIX, OS/390, AS/400, and Sun Solaris systems. They are also available for RS/6000 SP and exploit parallel mining on multiple processing nodes. You can have client and server components on the same machine. 2.5.1 Intelligent Miner Architecture Figure 8 on page 21 illustrates the client and server components of the Intelligent Miner and the way they are related to one another : 20 Intelligent Miner Applications Guide Figure 8. The Intelligent M i n e r Architecture User Interface The user interface is a program that enables you to define data mining functions in a graphical environment. You can define preferences for the user interface that are stored on the client. Environment Layer API The environmental layer API is a set of API functions that control the execution of mining runs and results. Sequences of functions and mining operations can be defined and executed by using the user interface through the environment layer API. Data Definition This feature of Intelligent Miner provides the ability to collect and prepare the data for the data mining process. Visualizer The Intelligent Miner provides a rich set of visualization tools. You can also use other visualization tools. Data Access The Intelligent Miner provides access to flat files, database tables, and database views. Chapter 2. Introduction to the Intelligent Miner 21 Databases and Flat Files The Intelligent Miner components work directly with data stored in a relational database or in flat files. The data is not copied to a special format. You define input and output data objects that are logical descriptions of the physical data. Therefore the physical location of the data can be changed without affecting objects that use the data; only the logical descriptions must be changed. The change might be as simple as changing a database name. Processing Library The processing library provides access to database functions such as bulk load of data and data transformation. Mining Bases Mining bases are collections of data mining objects used for a mining objective or business problem. Mining bases are stored on the server, which allows access from different clients. Mining Kernels Mining kernels provide the data mining and statistical functions. Mining Results, Result API, and Export Tools Mining results are the data resulting from running a mining or statistics function. These components allow you to visualize results at the client. Results can be exported for use by visualization tools. 2.5.2 Intelligent Miner TaskGuides Data mining in the Intelligent Miner is accomplished through the creation of interrelated objects. The objects are displayed as icons and represent the collection of attributes or settings that define the data or function. Working with the Intelligent Miner graphical user interface is fairly simple as Intelligent Miner offers TaskGuides. In this section we explain how to use a Taskguide to create a settings object. To create a settings object, use the Create menu or click on a settings object icon in the task bar. A TaskGuide opens to guide you through the creation of the object. Each TaskGuide starts with a Welcome page that provides an overview of the type of settings object that you are creating. Each TaskGuide page provides step-by-step instructions for filling in the fields and making selections that define the settings for the object. You can click on a highlighted term to see a short definition of the term. Click the Next button to navigate to the next TaskGuide page. The last page of every TaskGuide summarizes the settings object that you created. Click the Finish button to create the object. Figure 9 on page 23 shows the TaskGuide for creating a data settings object. 22 Intelligent Miner Applications Guide Figure 9. The Data Task Guide You can have more than one TaskGuide open at a time. Thus you can leave a TaskGuide to create another object that is required to complete the first TaskGuide. For example, while you are in the process of defining a mining function, you might have to define or modify an input data object. You can open a Data TaskGuide to define an input data object, then continue with the Mining TaskGuide. 2.5.3 Mining and Statistics Functions Mining and statistics settings objects are similar in that they represent analytical functions that are run against data. In both cases, you must indicate which data settings object you want to use. Mining and statistics settings objects produce a results object when run. You can view and analyze the results object with visualization tools. You can also indicate in the settings for these functions that you want to create output data in addition to a results object. Chapter 2. Introduction to the Intelligent Miner 23 The Intelligent Miner has many types of mining and statistics functions: Mining Statistics Associations Clustering − demographic Clustering − neural Sequential patterns Time sequence Classification − tree Classification − neural Prediction − Radial-Basis-Function Prediction − neural Cross-correlation Correlation matrixes Factor analysis Linear regression Principal component analysis Univariate curve fitting Bivariate statistics 2.5.4 Processing Functions Processing functions are used to make data suitable for mining or analysis. Processing settings objects apply only to database tables and views because they take advantage of the processing capability of the database engine. The Intelligent Miner has many processing functions: Aggregate values Calculate values Clean up input data or output data Convert to lowercase or uppercase Copy records to file Discard records with missing values Discretization into quantiles Discretization using ranges Encode missing values Encode nonvalid values Filter fields Filter records Filter records using a value set Get random sample Group records Join data sources M a p values Pivot fields to records Run SQL Processing settings objects always read input from a database and create output data in a database. The only exception is the Copy Records to File function, which copies data to a file. When you create a processing settings object or update an existing one, you can use a data settings object to identify input data or output data. In this way the name of a database table or view is copied to the processing settings object. Subsequent changes to the data settings object have no effect on the processing settings object. 2.5.5 Modes How results objects are used with Intelligent Miner depends on the mode in which functions are run. Intelligent Miner provides the following modes under which to perform the mining process: Training In training mode, a mining function builds a model on the basis of the selected input data. Clustering In clustering mode, the clustering functions build a model on the basis of the selected input data. Clustering mode is similar to training mode for the predictive algorithms. Clustering mode offers the choice of using background statistics from the input data or an input result. Test In test mode, a mining function uses new or the same data with known results to verify that the model created in training mode 24 Intelligent Miner Applications Guide produces consistent results. Results objects are used for input and created as output. Application In application mode, a mining function uses a model created in training mode to predict the specified field for every record in the new input data. The data format must be identical to that used to generate the model. For more information about how to work with the Intelligent Miner, see Using the Intelligent Miner for Data, SH12-6325-01, the documentation shipped with the product. Chapter 2. Introduction to the Intelligent Miner 25 26 Intelligent Miner Applications Guide Chapter 3. Case Study Framework Customer Relationship Management (CRM) is a key focus area today in marketing departments in many different industries including finance, telecommunications, utilities, and insurance. Businesses in these industries have changed or are changing their marketing focus from a product-centric view to a customer-centric view. There are several reasons for this change in focus: increased competition for nongrowing markets, government deregulation, a technology revolution enabling the consolidation of corporate data and access to new data sources, and a growing awareness that the primary assets of a business are its customers. 3.1 Customer Relationship Management CRM is a methodology used to market to customers. CRM′s key features include customer profitability, customer lifetime value, and customer loyalty. In managing their customers, businesses recognize that all customers are not created equal and that they should focus their marketing efforts on retaining their best customers, increasing the profitability of their high-potential customers, spending less marketing dollars on their low-potential customers, and acquiring new high-potential customers at a lower cost. A customer segmentation based on their key characteristics is central to CRM and is used to derive strategic marketing campaigns. A consolidated customer view enabled through the process of data warehousing permits businesses to determine the current and potential value of customers. A business can associate customer purchase behaviors with their customers ′ value to the shareholders. By understanding the association between transaction behavior and shareholder value, marketers can influence customers to change their purchase behavior in ways profitable to the organization. By further understanding the complete view of its customers, including demographic, geodemographic, and psychographic profiles, a business can do more than simply influence behavioral change through the use of customer rewards. Understanding the needs of customers, as exhibited through their purchase behavior, marketers can use the customer profile information to better serve these customers by targeting them for products/services that they are likely to purchase. Increased understanding of their customers also allows marketers to communicate relevant messages through customer-preferred channels such as direct mail or phone campaigns. Effectively serving the needs of the customer requires less incentive to change customer behavior. Increased targeting of customers, focusing on meeting strategic campaign initiatives for smaller customer segments, substantially reduces the cost of marketing and can increase its effectiveness. Strategic campaign initiatives can be derived by creating customer segmentation models. Several different strategic initiatives can be applied to the different customer segments. Businesses have realized that a minority of customers, 10%-25%, contribute the lion′s share, 40%-80%, of the bottom line. A retention strategy is the primary initiative for these ″best″ customers. As many as five average customers are required to replace a ″best″ customer. With the high cost of customer acquisition, businesses have a strong business case to invest heavily in retaining their ″best″ customers and best potential customers. Loyal Copyright IBM Corp. 1999 27 customers increase in value over time; they spend more over time, consolidate their purchases, and refer new customers. Another important customer segment to consider is that containing customers with a high potential value. In addition to retention, high-potential customers are candidates for cross-selling and up-selling campaigns. Finding additional products and services that can be marketed to this segment can be determined by analyzing the customer purchase behaviors. By profiling and understanding the characteristics of best customers, a business can effectively target customer lists to acquire more profitable customers. In addition to changing the way in which organizations market to their customers, a change is occurring in the way marketing campaigns are implemented. The status quo in marketing science is the implementation of marketing campaigns in a series of waves or tactical campaigns. In this type of marketing, groups of customers are targeted for a specific promotion. The customers′ buying behavior initiates a promotional period during which customers can respond to the promotional offer. At the end of such a campaign, the results are determined and then fed back into future waves of marketing activity. A new method of continuous marketing has recently appeared. With multiple customer interaction channels, including the Internet, inbound telephone calls, outbound telephone calls, direct sales, and direct mail, organizations with the capability to provide CRM data to operational customer service applications can continuously market to single customers. For instance, if a customer segment definition and its sensitivity to purchase certain product or service information are made available to customer service agents during inbound telephone calls, the customer service agent can be directed to deliver the appropriate marketing message to the customer interactively. Furthermore, if organizations had the capability to update a customer segment and other purchase behavior models in realtime, they would be able to conduct continuous interactive marketing campaigns. Organizations must track all customer interactions and provide timely and accurate customer behavioral information to the marketer to execute such a campaign. In the example above, customer service agents must have real-time information to know that the customer has not already purchased the products they are marketing. Failure to have real-time information in this instance can have a detrimental effect on customer service. Continuous marketing is also driven by the technology revolution. The technical challenge in continuous marketing is the ability to access real-time information. In order to deliver real-time information, an organization must be able to transform its customer purchase behavior into decision support information in realtime. With wave marketing campaigns, it can take an organization several weeks or months to provide the decision support information that drives the marketing strategy. Organizations can no longer wait for its knowledge workers to spend weeks creating models and decision support analysis to support marketing campaigns. Automated models and expert systems will create the decision support information required by continuous marketing. Data mining technology will play an ever-increasing role in providing decision support information to continuous marketing campaigns. In summary, technology plays a fundamental role in CRM and continuous interactive marketing (CIM). Data warehousing permits the consolidation of an organization′s operational data. Data mining is used to create customer 28 Intelligent Miner Applications Guide segments and to identify profitable marketing opportunities. Campaign management tools are used to implement and manage the design, execution, tracking, and postanalysis of marketing campaigns. Technology is the key enabler in the implementation of CIM. This case study guide illustrates the use of data warehousing and the Intelligent Miner to support CRM and CIM. 3.2 Case Studies The business used for the case studies presented in this book is a retail bank, who is a client of Loyalty Consulting. Throughout the book this retail bank shall be referred to as the ″Bank″. Loyalty Consulting, a subsidiary of The Loyalty Group, grew out of the experience of building and outsourcing the data warehouse for the Air Miles Reward Program (AMRP). By maintaining the data warehouse and providing analytical services to the AMRP and sponsor companies, Loyalty Consulting gained substantial experience in the application of technology to real business requirements. It was one of the original partners for the Intelligent Miner data mining product for IBM and has been applying the technology for more than two years. Loyalty Consulting offers services that can be broadly categorized as: • Database and data warehouse consulting • Data mining or knowledge discovery in databases • Geographic information system (GIS) 3.3 Strategic Customer Segmentation In meeting its database marketing needs, the Bank currently uses standard analytical techniques. The Bank′s business analysts use recency frequency monetary (RFM) analysis, OLAP tools, and linear statistical methods to mine the data for marketing opportunities and to analyze the success of the various marketing initiatives undertaken by various lines of business. The Bank recognizes the opportunity to increase the efficiency of its database marketing activities and improve the knowledge of its customers through advanced data mining technology. The case studies presented in this book are driven by the Bank′s business requirement to use data mining to identify new business opportunities and/or to reduce the cost of marketing campaigns to existing customers. In this section we describe a framework for customer relationship management. We illustrate the framework by using four data mining case studies, which we present in 3.4, “Case Studies” on page 30. Customer segmentation is one of the most important data mining methods in marketing or CRM. Segmentation using behavioral data creates strategic business initiatives. The customer purchase data that a company collects forms the basis of the behavioral data. It is important to create customer segments by using the variables that calculate customer profitability. These variables Chapter 3. Case Study Framework 29 typically include current customer profitability and some measure of risk and/or a measure of the lifetime value of a customer. Creating customer segments based on variables that calculate customer profitability will highlight obvious marketing opportunities. For example, a segment of high-profit, high-value, and low-risk customers is the segment a company wants to keep. This segment typically represents the 10% to 20% of customers who create 50% to 80% of a company′s profits. The strategic initiative for this group is obviously retention. A company would not want to lose these customers. A low-profit, high-value, and low-risk customer segment is also attractive to a company. The obvious goal of the company for this segment would be to increase its profitability. Cross-selling (selling new products) and up-selling (selling more of what customers currently buy) to this segment are the marketing initiatives of choice. Within the behavioral segments, demographic clusters and/or segments are created. Customer demographic data does not typically correlate with customer profitability, which is why it should not be used with behavioral data. Creating demographic segments allows the marketer to create relevant advertising, select the appropriate marketing channel, and identify campaigns within the strategic customer segment defined above. Let us say a bank has both a high-profit and a low-profit behavioral customer segment that have similar demographic subsegments. The profile of the subsegment is young, high-income professionals with families. The marketer would want to ask the following question: Why do these similar demographic segments behave differently and how do I change the low-profit group to a high-profit group? It is difficult, if not impossible, to answer the why, but data mining provides an answer to the how. Affinity analysis discovers that the high-profit segment of young wealthy professionals has a distinct product pattern - mortgages, mutual funds, and credit cards. Using affinity analysis on the low-profit segment reveals that two of its product patterns are the same as those of the high-profit segment - mutual funds and credit cards. The marketing campaign to increase the profitability of the low profit segment would thus be to market mortgages to it. In summary, behavioral segmentation helps derive strategic marketing initiatives by using the variables that determine customer profitability. Demographic segmentation within the behavioral segments defines tactical marketing campaigns and the appropriate marketing channel and advertising for the campaigns. It is then possible to target those customers most likely to exhibit the desired behavior (in the above example, those customers most likely to purchase a mortgage) by creating predictive models. See Figure 10. 30 Intelligent Miner Applications Guide Figure 10. Customer Segmentation Model Chapter 3. Case Study Framework 31 3.4 Case Studies In this book we present the following four case studies that highlight the role IBM′s Intelligent Miner and data mining technology play in supporting a CRM system: • Customer Segmentation The first case study creates a customer segmentation that will be used in the other case studies. Using shareholder value variables to create the segmentation will drive strategic initiatives for the customer segments discovered. Two of Intelligent Miner′s clustering techniques and a decision tree are used to build segmentation models. • Cross-Selling Opportunity Identification Identifying a cross-selling opportunity that is actionable and profitable using Intelligent Miner′s product associations algorithms is the topic of this case study. This study is based on the customer segment from the first case study whose strategic initiative is to increase its profitability. • Target Marketing Model to Support a Cross-Selling Campaign In this case study, we build a predictive model to target those customers likely to buy the product identified as a cross-selling opportunity in the previous case study. Several algorithms from Intelligent Miner are used. The models built with the Intelligent Miner decision tree, radial basis function (RBF) regression, and neural network are compared. • Attrition Model to Improve Customer Retention In this case study, profitable customer segments are selected from the segmentation model built in the first case study. An attrition model is built identifying those profitable customers likely to defect. Several algorithms from Intelligent Miner are compared. In addition to the predictive modeling algorithms used in the previous case study, a time-series neural network will be utilized. The four case studies represent four major components of a CRM program that an organization can implement. The strengths of Intelligent Miner′s algorithms and visualization tools and its ability to work on a wide variety of business problems are illustrated through the case study results. Figure 10 on page 30 shows the customer segmentation model used in the case studies shown in this book. 32 Intelligent Miner Applications Guide Chapter 4. Customer Segmentation This case study creates a customer segmentation that will be used in the other case studies. Using shareholder value variables to create the segmentation will drive strategic initiatives for the customer segments discovered. Two of Intelligent Miner′s clustering techniques and a decision tree are used to build the segmentation models. 4.1 Executive Summary The Bank wanted to create an advanced segmentation of its customer base in order to further understand customer behavior. The segmentation was to be compared with the existing segmentation that was created through RFM analysis. A segmentation framework as described in 3.3, “Strategic Customer Segmentation” on page 29, was to be created to meet these key business requirements: • Define ″shareholder value″ for the corporation • Define strategic objectives for customer management • Understand customer behavior in terms of shareholder value • Understand the interaction between customer transaction behavior and shareholder value Shareholder value was a well-understood concept for the Bank. However, the specific variables that make up shareholder value were not previously considered in detail. The selection or creation of these variables was a primary requirement. Having defined the metrics or variables used to approximate shareholder value, the Bank wanted to understand how the customer base was segmented by shareholder value. An analysis of customer segments defined by shareholder value were to be used to derive strategic initiatives for managing the shareholder value of each of the segments. Further segmentation using detailed customer transaction behavior, defined by RFM variables by product over time, would provide insight into which customer behaviors were related to positive and negative shareholder value. Understanding the relationship between customer behavior and shareholder value would drive the creation of tactical marketing initiatives that could be executed to meet the various customer segment strategies. 4.2 Business Requirements The Bank wanted to create an advanced segmentation of its customer base to further understand customer behavior. This segmentation was to be compared to the existing segmentation that was created with RFM analysis. A segmentation framework as described in 3.3, “Strategic Customer Segmentation” on page 29, was to be created to meet the following key business requirements: Copyright IBM Corp. 1999 • Define ″shareholder value″ for the corporation • Define strategic objectives for customer management 33 • Understand customer behavior in terms of shareholder value • Understand the interaction between customer transaction behavior and shareholder value The Bank′s data warehouse was used as a data source for this case study. The Bank had spent considerable effort cleaning and transforming the data prior to loading it into their warehouse. Therefore, some of the data preparation activities that are usually time consuming were not required in this case study. Customer segments were to be determined using the following shareholder valuable variables that were identified by the Bank′s executives as key drivers of their business: • Number of products used by the customer over a lifetime • Number of products used the customer in the last 12 months • Revenue contribution of the customer over a lifetime • Revenue contribution of the customer over last 12 months lifetime • Most recent Customer Credit Score • Customer tenure in months • Ratio of (number of products/tenure) • Ratio of (revenue/tenure) • Recency A review of the clustering process was presented in sufficient detail so that technical analysts could use their own data to reproduce a clustering project. The results showed that the existing segmentation scheme was valid but could use some additional refinement. The key drivers of profitability were verified. A highly profitable customer segment was identified and represented 35% of the corporate profit with only 9% of customers. Some cross-selling opportunities were quantified; they represented a potential profit increase of 18% over the entire customer base. The Bank executives decided that there was potential value in data mining and started several data mining projects, including target marketing, opportunity identification, and further segmentation work. 4.3 Data Mining Process In this section we outline the data mining process that was used to meet the business requirements of the Bank (see 4.2, “Business Requirements” on page 33). Figure 11 on page 35 highlights the major steps in the process: 1. Shareholder value definition 2. Data selection 3. Data preparation including discretization 4. Demographic clustering 5. Neural clustering 6. Cluster result analysis 7. Classification of clusters with decision tree 34 Intelligent Miner Applications Guide 8. Comparison of results 9. Selection of clusters and/or segments for further analysis We describe the first four topics in this section. We discuss topics 5 through 8 in 4.4, “Data Mining Results” on page 50. Figure 11. Data M i n i n g Process: Customer Segmentation A high-level tabulated comparison of the demographic clustering algorithm and neural clustering results is made. We chose to present the demographic clustering results in detail because they are more interesting than the neural clustering results. This difference is not a general observation; it is true for this particular case study. In our experience both algorithms produce good results, usually one slightly better than the other, depending on the business problem and more importantly the characteristics of the data that was mined. Chapter 4. Customer Segmentation 35 4.3.1 Data Selection We use the data model in Figure 12 on page 37 as the primary source of data. Approximately 50,000 customers and their associated transaction data for a 12-month period was selected as a representative sample for the study. (We used this data in all of our case studies.) The transaction data used contained transactions across all possible products. We selected the complete transaction data because we wanted to develop an understanding of the different customer transaction behaviors. All customer transaction behaviors are contained entirely within the transaction and customer tables. The shareholder value variables we defined for this case study included revenue, tenure, number of products purchased over the customer tenure, number of products purchased over the last 12 months, customer credit score and recency (in months) of the last transaction. These variables form the core of the top layer of the hierarchical clustering model that we develop in this case study (see Figure 10 on page 30). We had to calculate all of these variables from the raw transaction data. The selection of these variables was driven entirely by the business requirement. These are the variables the business had decided to use in managing its customer base. The profitability data in the data model in Figure 12 on page 37 was contained in the transactions table. Each transaction record contained a revenue figure that could be used to estimate profitability by applying a gross profit margin or interest rate spread. More sophisticated profit models could be developed but were outside the scope of this work. 1 The other shareholder value variables were calculated by using aggregate functions on the transaction data while joining the data to each customer record. 1 More sophisticated profit models may include the transaction cost as well as the transaction gross revenue. The cost of marketing to the customer can be determined from the promotion history table as in Figure 12 on page 37. Other costs can be allocated by the customer ′s transaction intensity or by some other variable relevant to the business problem at hand. 36 Intelligent Miner Applications Guide Figure 12. Customer Transaction Data Model The Bank offers has divided its products into the 14 categories listed below. The category labels in the results will be denoted by ″cat #____″ to protect the Bank′s confidentiality. • Loans • Mortgages • Leases • Credit Card • Term Deposits • ATM Card • Savings Accounts • Personal Banking • Internet Banking • Telephone Banking • Business Loans • Business Mortgages • Business Deposit Accounts • Business Credit Cards We created transaction variables for each of the above product categories. For each customer we calculated the recency in months, revenue by quarter and number of transactions by quarter for two consecutive quarters in 1997. Chapter 4. Customer Segmentation 37 4.3.2 Data Preparation Once the data required for the data mining process is selected, it must be in the appropriate format or distribution. Therefore it has to be cleaned and transformed to meet the requirements of the data mining algorithms. 4.3.2.1 Data Cleaning Very little data cleaning was required for this case study because the data was extracted from the Bank′s data warehouse. During the load process for this warehouse, substantial data cleaning occurs to minimize the data preparation required for all analytical activities, including data mining. After we created all the variables on each customer record, we had to clean the data. We profiled the data to determine how many variables had records with missing values, unknown values, invalid values, or valid values. Following are the definitions for possible field contents: • Missing Value − • Unknown Value − • A record has a value for a particular field that has no known meaning. Invalid Value − • A record has no value for a particular field. A record has a value for a particular field that is invalid but whose meaning is known. Valid value − A record has a value for a field that is valid. Data cleaning is the process of assigning valid values to all records with missing, invalid, and unknown values. In this case study only the transaction variables had missing values. (Transaction data is usually very consistent and has no invalid or unknown values). The missing values resulted from particular customers having no transaction activity for a particular product. We assigned these missing values a value of zero. We assigned a new value to all categorical variables that had records with missing and unknown values. We corrected the invalid values for these variables to valid values. 4.3.2.2 Data Transformation After we cleaned the data, handled all missing and invalid values, and made the known valid values consistent, we were ready to transform the data to maximize the information content that can be retrieved. For statistical analysis the data transformation phase is critical as some statistical methodologies require that the data be linearly related to an objective variable, normally distributed and containing no outliers. Artificial intelligence and machine learning methods do not strictly require the data to be normal or linearized, and some methods, like the decision tree, do not even require outliers to be dealt with. This is a major difference between statistical analysis and data mining. The machine learning algorithms can automatically deal with the nonlinearity and nonnormal distributions, although the algorithms work better in many cases if these criteria are met. A good statistician with a lot of time can manually linearize, standardize, and remove outliers better than the artificial 38 Intelligent Miner Applications Guide intelligence and machine learning methods. The challenge is that with millions of data records and thousands of variables, it is not feasible to do this work manually. Also, most analysts are not qualified statisticians, so using automated methods is the only reasonable solution. After cleaning the original data variables, we created new variables using ratios, differences, and business intuition. We created total transaction variables, which were the sum of the transaction variables over two quarters. We used these totals to create ratio variables. We created timeseries variables to capture the time difference in all transaction variables between quarters. Other variables that we calculated on the basis of our knowledge of the business were: • Number of products purchased by the customer over a lifetime • Number of products purchased by the customer in the last 12 months • Revenue contribution of the customer over a lifetime • Revenue contribution of the customer over last 12 months lifetime • Most Recent Customer Credit Score • Customer tenure in months • Ratio of (number of products/tenure) • Ratio of (revenue/tenure) • Recency This last group of variables were designated as ″shareholder value″ variables and they were the variables selected by the business to be used to create strategic customer relationship marketing initiatives. To use the data in the demographic clustering algorithm, we discretized it. Discretization facilitates interpretating the results, for both the neural clustering and demographic clustering algorithms, and takes care of outliers. The following quantiles were calculated for all numeric variables: 10, 25, 50, 75, 90. The values of the variables at these breakpoints were determined and the data was divided into six ordinal values. We arbitrarily chose the quantiles for the discretization, and we found the selection useful. 2 The quantile breaks were generated in an automated fashion. We then profiled the resulting distributions and manually adjusted them to be unimodal or at least monotonic. We selected the modality and monotonicity criteria for ease of interpretation; in our experience these criteria provide useful results. To improve the clustering results, advanced analysts removed the correlated variables. Factor analysis can be used to create linearly independent components. For easy interpretation of results, the original data can be clustered against the components, and the variables most representative of the components chosen as input to the clustering algorithm. Refer to Figure 13 on page 41 for a view of the original data and to Figure 14 on page 42 for a post-discretized view of the data. 2 We give credit for this discretization scheme to Dr. Messatfa from the IBM ECAM lab in Paris, France. Chapter 4. Customer Segmentation 39 Figure 13 on page 41 shows the variable names taken from the data source, and Figure 14 on page 42 shows the variable names as follows: 40 • Unchanged variables have the original variable names. • Changed variables have an underscore added to the end of the variable name. • New variables show like unchanged variables. Intelligent Miner Applications Guide Figure 13. Original Data Profile Chapter 4. Customer Segmentation 41 Figure 14. Post-Discretized Data Profile 42 Intelligent Miner Applications Guide Some of the key features to note in Figure 13 on page 41 and Figure 14 on page 42 are: • The original data has missing values treated. (The Bank data warehouse data is cleaned before loading and thus less cleaning is required.) • The original data has continuous variables that are extremely skewed. • The original data has multimodal variables. • The discretized data is much easier to interpret than the original data. • Many of the previously skewed distributions are ″normal″ in shape, which enables the algorithms to obtain accurate results and/or allow the results to be easily interpreted. • Some of the data in the discretized set is still skewed, indicating that the data may not be useful. To prepare the data for clustering with the neural clustering algorithm, we standardized some of the continuous variables, using a logarithmic transform. See Figure 15. Figure 15. Post Logarithm Transformed Data Profile Some key features of the logarithm transformed data are: • The data is much less skewed. • Some of the variables are unimodal (LAVGBAL, LRATIO1, LRATIO2, LRATIO3, LTENURE) • Some variables (LREV12, LREV3) have two peaks because of a large number of records with zero or small values. • Some variables (LDIFF3, LDIFF3TX, LDIFF6, LDIFF6TX) have three modes or peaks because of the transformation used. The data in transformed form is Chapter 4. Customer Segmentation 43 much easier to visualize than in its original pre-prepared state. The algorithms should achieve better results using this data and/or results that will be much easier to interpret. Once the data has been selected, prepared, and transformed, it is possible to run the data mining algorithms. 4.3.3 Data Mining Figure 16 on page 45 shows the clustering process flow for this case study. We used demographic clustering so that we could use the results to interpret the output from neural clustering. Neural clustering can be difficult to interpret because of the use of continuous data, which is typically skewed or has been logarithm transformed to remove the skew. 44 Intelligent Miner Applications Guide Figure 16. Clustering Process Flow 4.3.3.1 Parameter Selection Referring to Figure 16, you see that the first step in the clustering process, after selecting the data set (the discretized data in this case) and after selecting an algorithm to be run (demographic clustering in this case), is to choose these basic run parameters for the algorithm: • Maximum number of clusters Chapter 4. Customer Segmentation 45 This parameter indicates the maximum number of clusters allowed. The algorithm may find less. This feature is unique to Intelligent Miner. Most other clustering algorithms require that the number of clusters be specified. • Maximum number of passes through the data This parameter indicates how many times the algorithm can read the data. The higher this number and the lower the accuracy criterion (see below), the longer the algorithm will run and the more accurate the result will be. This parameter is a stopping criterion for the algorithm. If the algorithm has not satisfied the accuracy criterion after the maximum number of passes, it stops anyway. • Accuracy This number is a stopping criterion for the algorithm. If the change in the condorect criteria between data passes is smaller than the accuracy (as a percentage), the algorithm terminates. • Similarity threshold This parameter defines the similarity threshold between two values in distance units. The default distance unit is the absolute number. Therefore two values are considered equal if their absolute difference is less than or equal to 0.5. The neural clustering algorithm has the following parameters: • Number of rows and number of columns Multiply the two numbers together to get the maximum number of clusters. The rectangle defined by the number of rows and columns of neural network nodes changes the resulting clusters. Unless you are an advanced user, we recommend choosing the most ″square″ output grid shape. For example, if you want 9 clusters, choose 3 rows by 3 columns (the default). If you want 12 clusters, choose 4 rows by 3 columns as opposed to 6 rows by 2 columns. • Number of passes This parameter indicates the number of passes through the data the algorithm will make to build the neural network. For the first clustering run, we selected a maximum number of clusters larger than the number we wanted at the end of the project. By selecting more we allowed the algorithm to choose less if that is all that is in the data. If the algorithm comes back with the maximum, we know that there are likely more clusters. The number of clusters chosen is driven by how many clusters the business can manage. In our experience, this number is less than 10 for most companies. For this case study we chose 9 for the maximum number of clusters. For the maximum number of passes, we chose 5 and specified the accuracy as 0.1. We left the similarity threshold at the default value of 0.5. The parameter settings for the number of passes and accuracy were arbitrary. We wanted a reasonable number of passes through the data to ensure a reasonable convergence of the solution. For the initial neural clustering run, we selected a three row by three column grid; this selection results in a maximum of nine clusters. We left the number of passes at the default. 46 Intelligent Miner Applications Guide The analysis of the results of each run will guide the selection of parameters for follow-on runs. The clustering process is highly iterative as shown in Figure 16 on page 45. For the first run of the demographic clustering algorithm, we left the advanced parameter settings at the default. Because we discretized the data ahead of time, and all the discretized variables had approximately the same range, many of the advanced parameters were not required. We used these advanced parameter settings to allow continuous data to be effectively clustered with the algorithm: • Distance measure − Absolute One unit of absolute difference in the magnitude of two record values for one variable. − Range The range (difference between maximum and minimum) of a variable is considered one distance unit. − Standard deviation The standard deviation of a variable is considered one distance unit. This setting is only meaningful if the variable is normally distributed. • Field weighting − Probability weighting Uses the probability of the occurrence of a variable value to compensate for its contribution to the overall cluster result. − Information theoretic weighting Uses manually selected weights to compensate for the contribution of a variable to the overall cluster result. 4.3.3.2 Input Field Selection We selected these input field variables for the first run: • Number of products purchased by the customer over a lifetime • Number of products purchased by the customer in the last 12 months • Revenue contribution of the customer over a lifetime • Most Recent Credit Score • Revenue contribution over the last 12 months • Customer tenure in months • Ratio of (revenue/tenure): Ratio 1 • Ratio of (number of products/tenure): Ratio 3 • Region • Recency • Tenure (number of months since the customer first activated at the bank) We used the discretized versions of these variables for demographic clustering and the log-transformed continuous versions for neural clustering. Chapter 4. Customer Segmentation 47 As discussed in section 4.2, “Business Requirements” on page 33, the first layer of clusters in the CRM framework is created by using shareholder value variables and any other variables the business would like to use to manage its customers. All other discrete and categorical variables and some interesting continuous variables were input as supplementary variables to be profiled with the clusters but not used to define them. These supplementary variables can be used to interpret the cluster as well. The ability to add supplementary variables at the outset of clustering is a very useful feature of Intelligent Miner, which allows the direct interpretation of clusters using other data very quickly and easily. 4.3.3.3 Output Field Selection The entire data set was output with the cluster information appended to the end of each record. The entire data set was output so that the results of other clustering runs using both the demographic clustering and neural clustering algorithms could be directly compared by cross-tabulating the cluster IDs from the various schemes. This is one advantage of Intelligent Miner. Having multiple algorithms allows the output of one algorithm to be used as the input to another. The algorithms used in combination are more powerful than those applied alone. 4.3.3.4 Results Visualization The output of the clustering algorithms is an output data set and a visualization. The visual results display the number of clusters, the size of each cluster, the distribution of each variable in each cluster, and the importance of each variable to the definition of each cluster (based on several metrics including chi-square test, entropy, and condorect criteria). The result is completely unsatisfactory if there is only one cluster, or if there is one very large cluster (> 90%) and several small clusters. This situation will occur if highly skewed continuous variables are used as input or if the modal frequency of some of the discretized variables is very large (> 50%-90%). If this situation occurs, we recommend using probability field weighting for the discrete variables and discretization of the continuous variables. The statistics of the input variables can be viewed in the cluster details. 4.3.3.5 Cluster Details Analysis The cluster details contain some tabulated statistics for the cluster model. The global measures include the condorect criteria for the demographic clustering algorithm and the quality for neural clustering. Realistic ″good″ values for the condorect criteria are in the 0.6-0.75 range. Higher values are usually associated with the case of one very large cluster and a number of smaller clusters. ″Good″ neural cluster values are in the 0.5-0.7 range. For the demographic clustering algorithm the details view also shows the condorect criteria for each cluster and for each variable globally and within each cluster, the similarities among all clusters, and the global statistics and statistics within each cluster for each variable. The neural clustering algorithm also shows global statistics and statistics within each cluster for each variable. The details can be used, for example, to assess the quality of the cluster models, assess the contribution of each variable to the model, and to compare different cluster models. 48 Intelligent Miner Applications Guide 4.3.3.6 Cluster Profiling The next step in the clustering process is to profile the clusters by executing SQL queries. The purpose of profiling is to quantitively assess the potential business value of each cluster by profiling the aggregate values of the shareholder value variables by cluster. The scientific quality of the clusters should also be profiled. Some of the variables for profiling include: • Record scores − • 2nd choice cluster − • Intelligent Miner provides a score on each record in addition to the cluster ID, which is a measure of how well the records fit the cluster model. Intelligent Miner provides a cluster ID for the second choice cluster to which the record could have been assigned. 2nd choice scores − Intelligent Miner provides the score for how well the record fits the second choice cluster assignment. • Comparison of methods considering 2nd choice clusters and scores • Other measures including entropy, chi-square, Euclidean distance 4.3.3.7 Cluster Characterization (Qualitative) Once the cluster algorithm has been run, the next step is to qualitatively characterize the clusters. Cluster characterization can be completed using the results visualization. Each cluster should be considered variable by variable. The differences and similarities among the clusters, the variable distributions by cluster and global distribution, the cluster sizes, and ordering of variables within the cluster by different metrics should be noted. 4.3.3.8 Cluster Characterization Using a Decision Tree One of the disadvantages of cluster models is that there are no explicit rules to define each cluster. The model is thus difficult to implement, and there is no clear understanding of how the model assigns cluster IDs. The cluster model tends to be a blackbox. You can use a decision tree to classify the cluster IDs using all the input data and supplementary data that was used in the clustering algorithms. The decision tree will define rules that classify the records using the cluster ID. In many instances, based on our experience, the decision tree produces a very accurate representation of the cluster model (>90% accuracy). If the tree representation is accurate, it is preferable to implement a tree because it provides explicit, easy-to-understand rules for each cluster. 4.3.3.9 Final Result The final clustering result is selected on the basis of a combination of scientific and business reasons. Cluster models that have good global values of the condorect criteria or quality, whose clusters are distinct and different from each other, and which can be accurately modeled with a decision tree are ″scientifically″ good. Good business models are defined by sensible interpretation of the clusters, good segmentation in shareholder value variables, segmentation that drives obvious business strategies, and segments that are actionable. Chapter 4. Customer Segmentation 49 4.4 Data Mining Results Figure 17 on page 51 presents the results of several iterations of demographic clustering. This diagram is the cluster visualizer in Intelligent Miner that is used by both demographic clustering and neural clustering. Here is some general information to help you read the diagram: 50 • Each strip of the diagram represents a cluster. • The clusters are ordered from top to bottom according to their size. • The numbers down the left side show the size of the cluster as a percentage of the universe. • The numbers down the right side are cluster IDs. • The variables are ordered from left to right in their order of importance to the cluster, based on chi-square tests between the variables and cluster IDs. This is the default metric. Among other ordering criteria you could use are entropy, condorect criteria, and database order. • The variables in square brackets are the supplementary variables. Variables without brackets are those used to define a cluster. • Numeric (integer), discrete numeric (smallint), binary, and continuous variables have their frequency distribution or histogram shown as a bar graph. The outlines in the foreground of the bars indicate the distribution of the variable within the current cluster. The grey solid bars in the background indicate the distribution of the variable in the entire universe. The more different the cluster distribution is from the distribution within the entire universe, the more interesting or distinct the cluster is. • Categorical variables are shown as pie charts. The inner pie represents the distribution of the categories for the current cluster, and the outer ring represents the pie chart distribution of the variable for the entire universe. Again, the more different the distribution of the variable is for the current cluster as compared to the average distribution, the more interesting or distinct the cluster is. Intelligent Miner Applications Guide Figure 17. Shareholder Value Demographic Clusters The result shows that there are nine clusters in the model. There are likely more clusters in the data as we chose nine to be the maximum number of clusters allowed. The clusters are reasonably distributed (not one very large cluster). The variable distributions within the clusters tend to be different from their global Chapter 4. Customer Segmentation 51 distributions. The Best98, Revenue, and CreditScore variables are commonly important to several clusters. For comparison purposes, a high-level neural clustering is shown in Figure 18 to highlight some similarities and differences between the results from the two different methods. Figure 18. Shareholder Value Neural Network Clusters The input variables chosen for the neural network were the logarithm transformed versions of the variables used for the demographic clustering. The discretized variables used for the demographic clustering were input as supplementary variables to aid in the interpretation of the neural cluster. Some key features to note are: 52 Intelligent Miner Applications Guide • The neural clusters are not quite as uniformly distributed with respect to cluster size as the demographic cluster results. In our experience, the opposite is usually the case. • The same variables as in demographic clustering appear as the most important variables (for example, Best98, other Best vars, REVENUE_, CREDSCORE_, and NUMPROD_). • The discretized variables are more significant to the cluster definitions than the logarithm transformed variables used to create the clusters. This illustrates one of the values of discretization. You can use the discrete variables to assist in the interpretation of clusters while using the continuous variables to build the clusters. Because of the similarity of the neural clustering and demographic clustering results and in an effort to reduce redundancy in the presentation of results, the discussion below focuses on the demographic clustering results. 4.4.1 Cluster Details Analysis Figure 19 on page 54 shows the cluster details. From this result we can see that the global condorect value is 0.6098. This value is at the low end of a reasonable result. The lower value may be due to several factors including the fact that we restricted the output to nine clusters when there may have been more, some variables are not very good, the data is not on distinct clusters or other reasons. The quality of the clusters ranges from a condorect criterion value of 0.42 to 0.72, as shown in the Cluster Characteristics section in Figure 19 on page 54. In the Similarity Between Clusters section of Figure 19 on page 54, you can see that there is some similarity among clusters, with the similarity measure ranging from <0.25 to a maximum of 0.42. Figure 19 on page 54 also shows that the REVENUE12_, CREDSCORE_ and NPRODL12 variables have low condorect values. Therefore they could be removed from the cluster model to improve the result (see the Reference Field Characteristics section in Figure 19 on page 54). These results indicate that further iteration is wanted. Chapter 4. Customer Segmentation 53 Figure 19. Shareholder Value Demographic Cluster Details 54 Intelligent Miner Applications Guide 4.4.2 Cluster Characterization In this section we discuss the characterizations of some of the interesting clusters in Figure 17 on page 51. The Best98 variable is a binary variable that indicates the best customers in the database as determined by other means. The clustering model presented seems to agree very well with this existing definition as most of the clusters seem to have almost all Best or no Best. As a first pass, this is an exciting result, as the status quo Best segment has been confirmed with little effort! To be confident of the data mining results, you should always observe the current business knowledge in the results. Any successful company knows its business well enough that the obvious results should show clearly in any data mining results. Observing the current business knowledge provides confidence that the data selection and data preparation efforts have been valid. If results are observed that were previously unknown, one can find confidence in them as long as they are alongside currently known facts. This clustering result does not only validate the existing concept of best customers, it also extends the idea of best customers by creating clusters within best. It can be seen from Figure 17 on page 51 that there are several clusters with varying levels of revenue. Perhaps this builds a case to create a ″VeryBest″ customer group? Cluster 6 can be interpreted as almost all Best98 customers, whose Credit Score, Revenue in the last 12 months, and revenue per month and number of products used per month are in the 50th to 75th percentile. (Recall the discretization definition in 4.3.2.2, “Data Transformation” on page 38). Cluster 6 represents 24% of the population. Refer to Figure 20 on page 56 for a detailed view of cluster 6. Chapter 4. Customer Segmentation 55 Figure 20. Cluster 6 Detailed View 56 Intelligent Miner Applications Guide Cluster 3 can be interpreted as almost no Best98 customers, whose revenue, credit score, revenue in the last 12 months, revenue per month, and number of products per month are all in the 25th to 50th percentile. (Recall the discretization definition given in 4.3.2.2, “Data Transformation” on page 38). Cluster 3 represents 23% of the population. Refer to Figure 21 on page 58 for a detailed view of cluster 3. Chapter 4. Customer Segmentation 57 Figure 21. Cluster 3 Detailed View 58 Intelligent Miner Applications Guide Cluster 5 represents 9% of the population, and the customers′ revenue, credit score, and number of products per month are all in the 75% percentile and above, skewed to almost all greater than the 90th percentile. The Best95, Best96, and Best97 variables represent the status of the customers in the calendar years 1995, 1996, and 1997. The fraction of customers who were best was increased by year! This looks like a very profitable cluster. Refer to Figure 22 on page 60 for a detailed view of cluster 5. Chapter 4. Customer Segmentation 59 Figure 22. Cluster 5 Detailed View Figure 23 on page 61 provides the tabulated details for cluster 5. 60 Intelligent Miner Applications Guide Figure 23. Cluster 5 Tabulated Details Chapter 4. Customer Segmentation 61 Cluster 5 contains 8.9% of the customer population. The condorect value for cluster 5 is 0.5946, just below the global value. Cluster 5 is most similar to cluster 7 and cluster 0. Notice that REVENUE_ and CREDSCORE_ have condorect values of 0.71 and 0.83, respectively. Recall that globally these variables had low condorect values, but for this cluster they have very high values. NPRODL12_ has a low condorect value, 0.37 for this cluster and low globally. This information can be used to decide whether or not these variables should be included in the model. The details also present the chi-square value and the entropy value, which are measures of the association between the variable and the cluster to which the records have been assigned. In cluster 1, the supplementary variable, NEW, is a binary variable that indicates whether or not the customer is new to the Bank. This cluster clearly consists of new customers. The recency is low (which means the customer has not had a recent transaction, that is, they have opened accounts but not transacted yet), and the tenure is low. It would be very interesting to track these customers over time to see how they progress. Refer to Figure 24 on page 63 for a detailed view of cluster 1. 62 Intelligent Miner Applications Guide Figure 24. Cluster 1 Detailed View Chapter 4. Customer Segmentation 63 4.4.3 Cluster Profiling In this section we present an example of a profile of revenue, number of products purchased, and customer tenure (see Table 1). The Leverage column is a ratio of revenue to customer. Table 1 shows that cluster 5 is the most profitable cluster in that it represents 35% of the revenue yet only 9% of the customers. The leverage ratio is the highest for this cluster. From Table 1 you can also see that as profitability increases so does the average number of products purchased. The product index is the ratio of the average number of products purchased by the customers in the cluster divided by the average number of products purchased overall. It is also interesting to note that customer profitability increases as customer tenure increases. Table 1. Customer Revenue by Cluster Cluster ID Revenue Customer Product index Leverage Tenure 5 34.74% 8.82% 1.77 3.94 60.92 6 26.13% 23.47% 1.41 1.11 57.87 7 21.25% 10.71% 1.64 1.98 63.52 3 6.62% 23.32% 0.73 0.28 47.23 0 4.78% 3.43% 1.45 1.40 31.34 2 4.40% 2.51% 1.46 1.75 61.38 4 1.41% 2.96% 0.99 0.48 20.10 8 0.45% 14.14% 0.36 0.03 30.01 1 0.22% 10.64% 0.00 0.02 4.66 From this simple result it is possible to derive some high-level business strategies. From Table 1 it is obvious that the best customers (considering only the data in the table) are in clusters 2, 5, and 7. These customers have a higher revenue per person than other clusters as indicated by the leverage column. Some possible high-level business strategies are: • Retention strategy for best customers (clusters 2, 5, and 7) − • 64 A business does not want to lose its best customers. Cross-sell strategy for clusters (2, 6, and 0) by contrasting with clusters 5 and 7. − Clusters 2, 6, and 0 have a product index close to that of clusters 5 and 7, which have the highest number of products purchased. Because the clusters are close in number of products purchased, it is not a big stretch to convert customers from clusters 2, 6, and 0. By comparing the products bought by the best customers to those purchased by clusters 2, 6, and 0, you can find missing products, which are candidates for cross selling. − If you could increase the number of products purchased by 10% of cluster 6 customers by one additional product, you could increase the profitability of cluster 7 by 20% and the entire base by 5%. − If you could increase the number of products purchased by 10% of cluster 7 customers by two products, you could increase the profitability of cluster 2 by 25% and the entire base by 9%. Intelligent Miner Applications Guide • You can similarly cross-sell clusters 3 and 4 compared to clusters 2, 6, and 0 as they are close in value. • The strategy for cluster 1 would be a wait-and-see plus information strategy. − • Cluster 1 appears to be a group of new customers. As they are new customers, sufficient data has not been collected to determine the behaviors they may exhibit. Informing cluster 1 of the products and services the business offers would make them profitable quickly. The strategy for cluster 8 may be not to spend any significant marketing dollars. − Cluster 8 appears to be the worst cluster; it has a very low revenue percentage and purchases very few products, although it has been with the company for about 30 months. 4.4.3.1 Cluster Results Comparison Intelligent Miner permits the output of one algorithm to be used as the input to another. Table 2 is a cross-tabulation of the cluster IDs created by the neural clustering model and the demographic clustering model. The neural network cluster ID distribution is presented by row, and the demographic clustering distribution by column. The comparison shows the similarity of the two models. Table 2. Comparison of Neural and Demographic Clustering Results 0 1 2 3 4 5 6 7 8 Total 0 3 5306 0 0 7 0 0 0 0 5316 1 2 7 0 183 89 0 1 0 567 849 2 1 8 3 665 21 0 0 0 5182 5880 3 1247 0 37 14 648 533 812 169 0 3460 4 3 0 11 2163 455 1 355 0 45 3033 5 2 0 28 5343 32 0 9 2 1277 6693 6 69 0 744 4 3 3733 4661 4625 0 13839 7 124 0 400 2461 33 99 4707 490 0 8314 8 262 0 34 828 193 43 1189 67 0 2616 Total 1713 5321 1257 11661 1481 4409 11734 5353 7071 50000 The highlighted cells indicate a significant overlap between the two models. From Table 2 it is possible to conclude the following: • Cluster 1 and Cluster 0 from both models agree almost 100%. The agreement is not usually this good unless the cluster is very distinct. In this case the cluster contains new Bank customers with very little activity. The fact that both models agree would allow you to apply this particular cluster with confidence. • The cluster models agree fairly well with each other. The results indicate that there are likely more than nine clusters. Rerunning the results with a higher number should result in better agreement between the two models. Chapter 4. Customer Segmentation 65 4.4.4 Decision Tree Characterization One disadvantage of clustering methods is that the cluster definitions are not easily extracted. Building a decision tree model with the cluster ID as the field to be classified and using all available input data allows explicit rules to be extracted for each cluster. The decision tree model built using the demographic clustering result from above showed an accuracy of 95% (see the confusion matrix in Figure 25). The confusion matrix shows the distribution of the classification errors and the global accuracy. Figure 25. Decision Tree Confusion Matrix See Figure 26 on page 67 for a view of the decision tree model and a rule for cluster 5. Rules for each of the clusters can be extracted. 66 Intelligent Miner Applications Guide Figure 26. Decision Tree Model As the accuracy of the decision tree is very high (95%), it is preferable to implement the decision tree version of the customer segmentation model rather than the original demographic clustering model. 4.5 Business Implementation and Next Steps The results of this case study drew several reactions from the Bank executives: 1. Excellent visualization of results allow for more meaningful and actionable analysis. 2. The original segmentation methodology was validated very quickly. 3. Refinement to the original segmentation is indicated and meaningful. Based on the results of this case study, several data mining opportunities were identified, and several projects were undertaken. Some of these projects include: • Several predictive models for direct mail targeting • Further work on segmentation, using more detailed behavioral data • Opportunity identification using association algorithms within the segments discovered Chapter 4. Customer Segmentation 67 Data mining tools can be used to quickly find business opportunity in customer transaction data. The simple example presented herein attempts to highlight a process that can be used to achieve profitable data mining results. Once a segmentation model is built and the customer is satisfied with the result, the model is ready to be implemented. The first step in the implementation is to integrate the model into the data warehouse and to modify the data warehouse load process to automatically assign customers to the appropriate segments. The variables used in the final segmentation model should be calculated and stored in the data warehouse permanently. A data warehouse table should be created to track each customer over time and record which segment the customer was a part of in each time period. Such a table is very useful for analytical purposes and can be used to measure the overall effectiveness of marketing campaigns by observing their effect on customer behavior over time. The segmentation model should also be rebuilt periodically (in our experience, from monthly to annually depending on the organization). A comparison of segmentation models over time should reveal changing market dynamics and changing customer behavior due to an organization′s marketing efforts, changing products and services, and social, political, and economic changes. When the segmentation model has been implemented in the data warehouse, it is possible to begin using it to drive actionable business activities. The customer segment information can be used in operational data stores to support continuous marketing and other operational activities, create standard reports highlighting the shareholder value, demographic profiles and transaction behavior of each segment, and as a framework to support opportunity identification. The next case study explores the use of affinity analysis within the segmentation model defined herein, to find profitable cross-selling opportunities. 68 Intelligent Miner Applications Guide Chapter 5. Cross-Selling Opportunity Identification Using Intelligent Miner′s product associations algorithms to identify a cross-selling opportunity that is actionable and profitable is the topic of this case study. It is based on the customer segments derived from the first case study whose strategic initiative is to increase profitability. 5.1 Executive Summary The business requirements for this case study are to identify cross-selling opportunities for the customer segments defined in Chapter 4 and ensure that the opportunities discovered adhere to the corporate objectives. Customer purchase transactions or billing data are required to perform product associations. We used the Bank′s data warehouse to analyze transaction data, thereby reducing the data preparation requirements for the case study. A review of the cross-selling process using association discovery was presented in sufficient detail for technical analysts to be able to reproduce the project using their own data. The cluster selected for cross-selling opportunity represented 7% of revenue for 23% of customers. The target behavior cluster represented 26% of revenue for 23% of customers (see Figure 17 on page 51 or Table 1 on page 64 for the corresponding clusters). Changing the behavior of 10% of the cross-selling cluster to that of the target cluster would represent a 25% increase in the cluster′s profitability, or a 3% increase in the overall profitability of the business. A credit card product category was identified as a cross-selling opportunity. Several specific products within the credit card category were also identified. The selection of these products was driven by the fact that the Bank′ s executives, based on previous analyses, knew that credit card products were very profitable. Our analyses also revealed that these products have the highest profit potential opportunity. Confirmation of the business intuition provided additional confidence in proceeding with a campaign. Although the data mining analysis simply confirmed the business intuition, it provided quantitative results and a specific target group, both of which were previously missing. The recommended next steps include some demographic profiling of the target customer group to assist the marketer in creating appropriate advertising and marketing messages as well as selecting marketing channels. To further refine the target group, we also recommend the construction of a predictive model, which is the content of a later case study (see Chapter 7, “Attrition Model to Improve Customer Retention” on page 111). 5.2 Business Requirement The main objective of this case study was to use data mining techniques to find actionable cross-selling opportunities from the analysis of customer transaction data. Any opportunities that are identified should support strategic marketing initiatives for the customer segments used by the organization. The segmentation and the strategic initiatives recommended from the previous case study in Chapter 4, “Customer Segmentation” on page 33 should be used. Copyright IBM Corp. 1999 69 Finally, the next steps required to implement the cross-selling opportunities as a marketing campaign should be recommended. 5.3 Data Mining Process Figure 27 on page 71 highlights the data mining process implemented in this case study to meet the business requirements. The major steps in the process are: 1. Cluster (segment) selection 2. Transaction data selection 3. Data preparation 4. Product association mining 5. Results analysis 6. Compare to identify cross selling opportunities 7. Compare methodology 8. Select a cross selling opportunity We cover topics 1 through 4 in this section. We cover the other topics in 5.4, “Data Mining Results” on page 76. 70 Intelligent Miner Applications Guide Figure 27. Data M i n i n g Process: Cross-Selling Opportunity 5.3.1 Cluster Selection The process to find cross-selling opportunities within a specific customer segment depends on contrasting the purchase behavior of more than two clusters. (The method discussed here is not the only method of finding cross-selling opportunities.) One cluster is selected to be the group of customers whose behavior is to be replicated in other clusters; this cluster is usually the more profitable one. For cross-selling opportunity identification, the purchase behaviors of interest are represented as product associations derived from purchase transactions. Comparing the patterns of two or more clusters highlights product pattern differences. For instance, if cluster A had a product association, (A,B-->C) and cluster B had a product association (A-->B), the cross-selling opportunity would be to market product C to cluster B. The Chapter 5. Cross-Selling Opportunity Identification 71 behavior of the clusters being contrasted should not differ significantly. The gap in missing products should not differ by more than a 1-3 product count because it is difficult to change customer behavior drastically. Furthermore, too large a gap between clusters could indicate fundamentally different customer behaviors that would be impossible or difficult to bridge. 5.3.2 Data Selection The data required to create product associations is customer transaction data, market basket data, or any other data that has a similar layout. Figure 28 illustrates a typical transaction record used in a data mining project as input. Figure 28. Typical Transaction Record For market basket analysis the customer is not usually known, and product associations are found when using transaction or market basket data. In this case study the customer is known and it is therefore possible to link customer transactions over time; this is much more powerful than an analysis of market baskets without the customer ID. The following considerations are important in the selection of transaction data for association rule mining: • Time window of transactions • Level of product aggregation • Definition of product activity The selection of a time window for the transactions is driven by the product purchase cycle. We typically choose 2 to 4 product cycles, a range that has produced positive results. The average purchase cycle can be determined by query analysis of customer purchase transactions. (If the customer ID is not known, the product cycles must be determined by empirical or survey methods.) For frequently purchased products, a short time window is sufficient. A long time window and hence more transaction records are required for low frequency items. It is typically more difficult to find patterns in low frequency items because of the amount of data and the prevalence of too many product cycles of high-frequency-item transactions. To find patterns between low frequency items, we recommend removing the transaction line items for all high-frequency products. If the objective is to find patterns between low- and high-frequency purchases, there is no choice but to use the long time window and all transaction details. 72 Intelligent Miner Applications Guide For this case study we selected a 12-month window of customer transaction data. This implies that the patterns or associations discovered will be for customer purchases that occur with a frequency of six months or less. Another important consideration is the level of aggregation chosen for product definition. If product codes are too specific (that is, they are based on product details like size and flavor groupings), fewer associations will be discovered. The associations discovered will also be less actionable because of the specificity required in a promotional advertisement. A product taxonomy or hierarchy is usually helpful in guiding the selection of product definition. For this case study we used product categories, which resulted in a reduction in the number of possible product codes from more than 130 to 13. A final consideration important to product association analysis is the definition of what constitutes a product purchase. This is more relevant when the customer ID is known. If all products that were purchased only once over time are included, more product patterns will be discovered, some of which may not be very strong (that is, have lower confidence). Setting some minimum criterion for inclusion of a particular product for a particular customer should reduce the number of weak rules, and thus permit easy analysis. A threshold may be to consider only products that have been purchased more than once by a customer or products where the customer has spent some level of money. 5.3.3 Data Preparation Transaction data for organizations that generate revenue through customer billing is typically very ″clean.″ The industry sectors are finance, telecommunications, insurance, and utilities. In these industries the transaction data to be analyzed is actually billing data. Very little data preparation is required to perform product association analysis for these industry sectors. The preparation activities conducted would typically include: • Ensure that product codes are consistent • Addition of product hierarchy information • Creation of new product hierarchy levels Product IDs which reference the same product should be made consistent. Some variations in the product ID could result from the use of different codes in different stores or regions, code changes due to supplier change, new coding systems being implemented, and errors. If the product IDs are not made consistent, the support for patterns and hence the number of patterns discovered will be less. The product codes in the customer transaction data used for this case study were consistent, with only a few exceptions. Adding in the product hierarchy information (as illustrated in Figure 28 on page 72) allows the product association mining to be easily conducted using different levels of product definition. The Bank′s data warehouse already contained the product hierarchy information on each transaction record, obviating this step in the process. The final data preparation activity that may be required is the manual creation of new product hierarchy levels. This activity is required when too few patterns are discovered as a result of product definitions that are too specific. In such cases, Chapter 5. Cross-Selling Opportunity Identification 73 we recommend using a higher level in the product hierarchy. If, however, using the higher level results in rules that are too general and hence difficult to action, the creation of an intermediate layer is required. This process could be very laborious if the number of possible products is large. (Most industries have several hundred products, except retail, which may have tens of thousands!) The creation of the product hierarchy for the Bank′s data was based on past analytical experience and was therefore appropriate for analysis without modification. 5.3.4 Product Association Analysis Figure 29 illustrates the steps required to discover product associations: 1. Parameter Settings 2. Association Discovery 3. Profile Rules and Large Item Sets (LIS) 4. Selectively Remove Large Item Sets 5. Iteration back to step 1 6. Rebuild Rules Figure 29. Product Association Analysis Workflow 74 Intelligent Miner Applications Guide 5.3.4.1 Parameter Settings The first step in setting up an association run is to select the algorithm parameters. The parameters available include: • Minimum support This is the minimum frequency of occurrence of a pattern required for a rule to considered. • Minimum confidence This is the minimum joint probability between the rule head and rule tail required for a rule to be considered. • Maximum rule length This is the maximum number of products allowed in any rule to be considered. • Item constraints This is a list of items that all rules must contain in order to be considered. Starting with values that are too low for support and confidence may cause unnecessary computer load. The association algorithm is very memory and CPU intensive as the number of products and number of rules considered grows. We recommend choosing very high values for support and confidence (50% for both) and gradually lowering them until the number of patterns becomes unwieldy. We usually leave the confidence level at 50% to eliminate most of the permutations of rules that meet the minimum support criterion. For example, if rule (A-->B) meets the support criterion, so does rule (B-->A). If the original rule meets the confidence criterion, the permutation usually does not. Reducing the permutation results in fewer rules and permits easy results analysis. We never limit the maximum rule length or constrain the list of items within the algorithm. If certain items are not to be considered, it is convenient to remove them from the transaction records. 5.3.4.2 Association Discovery Association discovery is repeated for all the clusters that were selected for contrasting. The minimum set of rules in the two cases is compared to identify products not present in some of the clusters. The list of removed large item sets (LIS) is also compared to identify products not present in some clusters. These missing products are the cross-selling candidate opportunities. The acceptance of candidates as actionable opportunities is usually driven by the number of customers who have the missing product. Too small a group of customers will have too little return to justify the promotional investment. 5.3.4.3 Selectively Remove Large Item Sets Having determined the parameter bounds, you can discover the association rules. The number of rules generated initially is usually very large and intimidating. Rather than changing the parameter settings at this point, it is possible to begin temporarily removing certain products from the transaction records. The products removed are LIS. There are two types of LIS: 1. Large item sets whose frequency in the entire transaction data universe is statistically equivalent to the frequency in the current data set. These items complicate the analysis and should be removed to achieve less complicated rules that are easy to analyze. Chapter 5. Cross-Selling Opportunity Identification 75 2. Large item sets whose frequency in the entire transaction data universe is statistically different from the frequency in the current data set. These items are the items that make up the patterns discovered. After the first type of LIS is removed from the transaction data, the associations are rediscovered. If the number of rules is still unmanageable, begin removing the second type of LIS, noting carefully what is removed. The associations are again rediscovered. This process is repeated until the number of rules is manageable (usually 20 to 50 rules). Removing the LIS allows you to understand the ″structure″ of the rules. The remaining 20 to 50 rules at this point form the core of the rules. The initial unmanageable set of rules is created by permutating the LIS with the remaining rules and applying the support and confidence criterion. 5.3.4.4 Profile Rules and Large Item Sets This step is repeated for all the clusters that where selected for contrasting. The minimum set of rules in the two cases is compared to identify products not present in some of the clusters. The list of removed LIS is also compared to identify products not present in some clusters. These missing products are the cross-selling candidate opportunities. The acceptance of candidates as actionable opportunities is usually driven by the number of customers that bought the missing product. Too small a group of customers will have too little return to justify the promotional investment. 5.3.4.5 Rebuild Rules Once you have determined the candidate opportunities, it is important to reconstruct the original and actual rules present in the data. Removing rules from the transaction data affects the statistics of the rules. To get accurate statistics, the LIS must be returned. Adding the LIS back one by one and observing the change in the discovered associations will give useful insight into the rule ″structure.″ 5.4 Data Mining Results As mentioned before, the result of the demographic clustering process as described in our first case study has been used for this case study. Identifying opportunities for cross selling is a two-step process: 1. Select customer segments created before and select those clusters containing valuable customers. 2. Perform the product association discovery on the selected cluster data. 5.4.1 Cluster Selection We created Table 3 on page 77 using the result of the demographic clustering, enhanced by some data we selected through query analysis against the original cluster data. 76 Intelligent Miner Applications Guide Table 3. Demographic Clustering Results: Percentage Cluster ID Profit Customer RevenueLl2 Product Index Leverage (Profit/Cust) Tenure 5 34.74% 8.82% 32.83% 1.77 3.94 60.92 6 26.13% 23.47% 28.36% 1.41 1.11 57.87 7 21.25% 10.71% 20.10% 1.64 1.98 63.52 3 6.62% 23.32% 5.98% 0.73 0.28 47.23 0 4.78% 3.43% 6.78% 1.45 1.40 31.34 2 4.40% 2.51% 3.00% 1.46 1.75 61.38 4 1.41% 2.96% 2.46% 0.99 0.48 20.10 8 0.45% 14.14% 0.47% 0.36 0.03 30.01 1 0.22% 10.64% 0.01% 0.00 0.02 4.66 Total 100.00% 100.00% 100.00% The two clusters chosen for further study in this case study were clusters 3 and 6. From Table 3 you can see that cluster 6 represents a profitable customer segment with 26% of revenue represented by 23% of customers. In contrast cluster 3 represents only 7% of revenue for 23% of customers. The number of products used by cluster 6 customers (indicated by the product index) is greater than that for cluster 3. Query analysis reveals that the difference is on average two products. Furthermore cluster 6 has a slightly longer tenure. These two clusters were chosen because of the sizeable opportunity and the small gap in purchase behavior between them. 5.4.2 Association Rule Discovery We initially performed product association discovery on the selected cluster data, using the Intelligent Miner parameter settings illustrated in Figure 30 on page 78 . Chapter 5. Cross-Selling Opportunity Identification 77 Figure 30. Parameter Settings for Associations 5.4.2.1 Association Results for Cluster 6 Figure 31 shows that associations for the entire Good Customer Set returned many (2,218) rules. Figure 31. Associations on Good Customer Set Figure 32 on page 79 shows that the frequent item sets include loan (94%), mortgage (90%), and credit card (79%). 78 Intelligent Miner Applications Guide Figure 32. Associations on Good Customer Set Detail Figure 33 shows the associations when loan, mortgage, and credit card are removed. Note that the number of rules has been reduced to 286. Figure 33. Associations for Good Customer Set: LIS Removed There are many multiple-item (more than four) rules in the Good Customer Set (see Figure 34 on page 80.) Chapter 5. Cross-Selling Opportunity Identification 79 Figure 34. Associations for Good Customer Set: LIS Removed, Detail 5.4.2.2 Okay Customer Set Figure 35 shows the associations for the entire Okay Customer Set. Many (212) rules have been generated. Figure 35. Associations on Okay Customer Set Figure 36 on page 81 shows that the frequent item sets include loan (70%), mortgage (70%), and credit card (24%). The substantially lower frequency of credit card activity in cluster 3 represents a cross-selling opportunity. 80 Intelligent Miner Applications Guide Figure 36. Associations on Okay Customer Set Detail Figure 37 shows the associations when loan and mortgage are removed. Note that the number of rules has been reduced to 48. Figure 37. Associations for Okay Customer Set: LIS Removed No rules in the Okay Customer Set contain more than four items. Further detailed comparison of the association rules will reveal other cross selling opportunities. The largest cross-selling opportunities are revealed by differences in the large item sets. 5.4.2.3 Association Rules Discovery: Product Detail Level So far, all of the associations have been processed on product categories that summarize products. A comparison of Figure 38 on page 82 (and Figure 39 on page 82) with Figure 33 on page 79 (and Figure 34 on page 80) shows what happens when associations are run on a more detailed level. With low-level products instead of product categories specified, associations exploded from 286 to 1,521 for the Good Customer Set. 5.4.2.4 Good Customer Set Chapter 5. Cross-Selling Opportunity Identification 81 Figure 38. Associations for Good Customer Set: LIS Removed, Summary Figure 39. Associations for Good Customer Set: LIS Removed, Detail Figure 40 and Figure 41 on page 83 show the results of associations when in addition to all types of loans, mortgage and credit cards being removed we also removed frequent account types from the Good Customer Set customer product sets. Note the number of rules has been reduced to 55. Figure 40. Associations for Good Customer Set: LIS and Certain Products Removed, Summary 82 Intelligent Miner Applications Guide Figure 41. Associations for Good Customer Set: LIS and Certain Products Removed, Detail 5.4.2.5 Okay Customer Set Figure 42 and Figure 43 show the results of associations when transactions containing large item sets are removed from the Okay Customer Set. The number of rules has been reduced from 513 to 15. Figure 42. Associations for Okay Customer Set: LIS and Certain Products Removed, Summary Figure 43. Associations for Okay Customer Set: LIS and Certain Products Removed, Detail The difference in large item sets removed for cluster 6 and cluster 3 reveals the opportunity to cross-sell web banking to cluster 3. Furthermore, a lower frequency of occurrence of term deposits in cluster 3 reveals another substantial Chapter 5. Cross-Selling Opportunity Identification 83 opportunity. Further detailed comparison of the differences in the rules generated in cluster 6 and cluster 3 will reveal further cross-selling opportunities. 5.4.2.6 Association Rule Discovery for the Entire Universe Figure 44 and Figure 45 show the association rule results discovered from mining against all customer transaction records without using segmentation. Product categories were used in this example. Figure 44. Associations for A l l Transactions: LIS Removed, Summary Figure 45. Associations for A l l Transactions LIS Removed Detail The number of rules in this case is 480 compared to 286 from cluster 6. The increased number of rules results in a more complex analysis. Furthermore, the lack of segment objectives makes it difficult to know what to search for. 84 Intelligent Miner Applications Guide 5.5 Business Implementation and Next Steps Several cross-selling opportunities were identified. At the product category level, cross-selling CREDIT CARD to cluster 3 was the best opportunity. The strategic initiative for cluster 3 derived in the segmentation case study was to identify cross-selling opportunity. This CREDIT CARD opportunity is thus consistent with the segment objectives. With previous methods and common business experience the Bank recognized that cross selling CREDIT CARD to its customer base was an objective. The method presented in this case study provides the additional benefit of targeting CREDIT CARD to customer segments that are high shareholder value. Using more detailed product definitions revealed several specific product cross selling opportunities. These included cross-selling term deposits and web banking to cluster 3. Before the cross-selling opportunity can be implemented, several activities must be completed. Some demographic profiling or clustering of the target universe is required to assist the marketer and advertiser in creating the appropriate marketing message and selecting the appropriate marketing channel. It is also not efficient to target the entire group for this cross-selling campaign. Other factors must be considered to target those customers most likely to use a credit card. Building a predictive model to target those customers most likely to use or require a Credit Card as well as targeting those customers most likely to pass a credit check would further reduce the mailing cost in executing this campaign. The creation of a predictive model to target these customers is the topic of the next case study. Chapter 5. Cross-Selling Opportunity Identification 85 86 Intelligent Miner Applications Guide Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign In this case study, we build three predictive models to target those customers likely to buy the product identified as a cross-selling opportunity in the previous case study. Several algorithms from Intelligent Miner are used. The models built with the Intelligent Miner (decision tree, radial base function (RBF) regression, and neural network) are compared. 6.1 Executive Summary In Chapter 5, “Cross-Selling Opportunity Identification” on page 69 we describe how we identified a new business opportunity: cross-selling credit cards to existing customers to make them more profitable. The focus was on customers in the Okay Customer Set. Our strategy was to market the Bank′s credit card to these customers. By getting some of the customers in the Okay Customer Set to use a credit card we would migrate them to the Good Customer Set and increase their profitability. A simple approach would have been to conduct a direct campaign for mail all customers in the Okay Customer Set, but our limited marketing budget did not permit that. Furthermore, an additional goal was to reduce the customer acquisition and thus increase the campaign ROI, reduce the cost and maximize the number of customers cross-sold. Using the customers from the Okay Customer Set and those of the Good Customer Set we created a data set that we could use to predict which customers in Okay Customer Set had a propensity to use a credit card. We built three predictive models, using the three prediction techniques available in Intelligent Miner: • Decision tree • Value prediction with RBF • A neural network A process for predictive modeling was presented to the analysts and the results from each algorithm were compared. The neural network model had the best performance. By mailing only 40% of the total Okay Customer Set, we managed to include 76% of the customers with the highest propensity to use a credit card. Furthermore, we expected to get an ROI of 113%, in contrast to the 60% ROI we could have expected by mailing the entire Okay Customer Set. The higher ROI was achieved by reducing the cost per customer acquisition from $167 to $88. Table 4 on page 88 summarizes the financial details. Copyright IBM Corp. 1999 87 Table 4. Cross-selling: Summary - Predictive Modeling M o r e Than Doubles ROI Without Prediction Model Mailed customers 25,000 Cost of mailing material and mailing per customer $5 Total cost $125,000 New acquisitions With Prediction Model 10,000 $5 $50,000 750 570 Cost per acquisition $167 $88 Average profit/year per customer $100 $100 $75,000 $57,000 Total profit per year Return on investment 60% 113% 6.2 Business Requirements The general objective in this case study was to improve company revenue and profitability by attracting more customers to the credit card. We specifically wanted to cross sell customers from the Okay Customer Set and in successfully doing so increase their profitability and move them to the more profitable Good Customer Set. The average profit from a customer who uses a credit card is $100 per year. We used a direct mailing campaign to target the best prospects for the credit card from the Okay Customer Set. The first task in designing the campaign was to establish a baseline against which to measure the success of the planned mailing campaign. In other words, we had to calculate the ROI, that would be expected without any data mining, from such a mailing campaign. We calculated the ROI by looking at the historical trends in the movement of customers from the Okay Customer Set to the Good Customer Set. Table 5 summarizes the calculations. Table 5. Cross-Selling: Baseline ROI Calculation Total number of customers Cost to mailing and creation (per piece) Total cost of mailing Expected takeup rate Expected new acquisitions Cost per acquisition Average profit per customer Total profit per year Return on investment 25,000 $5 $125,000 25,000/734 = 3 % ( s e e F i g u r e 48 on page 94) 750 (3% of 25,000) $125,000/750 = $167 $100 $100 * 750 = $75,000 $75,000/$125,000 = 60% The Okay Customer Set and Good Customer Set total approximately 25,000 customers. The expected response credit card offer is about 3%, based on the observed movement of customers from the Okay Customer Set to the Good 88 Intelligent Miner Applications Guide Customer Set (see Figure 48 on page 94 for details). The baseline ROI for a mass marketing campaign is 60%. Thus it was not feasible to use mass marketing methods to implement the direct mailing campaign. The specific campaign goals were to achieve a positive return while moving as many customers as possible from the Okay Customer Set to Good Customer Set. 6.3 Data Mining Process Our general approach was to build predictive models to help identify those customers in the Okay Customer Set who were the best prospects for using the credit card. In fact, we used several different predictive techniques in Intelligent Miner, both as a means of gaining more insight into the target customer set and as a cross-validation of the different mining algorithms. Figure 46 on page 90 illustrates the overall approach. Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 89 Figure 46. Data M i n i n g Process: Cross-Selling 6.3.1 Create Objective Variable The first step in the predictive modeling process is to determine the objective variable to be modeled. When building models for targeting direct mail campaigns, the objective variable is usually based on the customer′s historical response data to similar campaigns. In this particular case, we did not have a historical campaign similar enough to the proposed campaign to use to create a response variable. In practice this situation occurs frequently. An alternative to building a response model is to build a propensity model. A propensity model 90 Intelligent Miner Applications Guide predicts which customers, who do not currently purchase the product being cross-sold, have a higher likelihood or propensity to purchase the product. The first and critical step in creating the target or objective variable is to select the time period in consideration. Setting the objective variable correctly is critical. The size of the time window selected is driven by the time horizon of the desired prediction. For instance, if the marketing campaign to be executed has a six-month window for the customer to respond, the objective variable should be defined with a six-month window. In this case, the marketing objective was to cross-sell a particular product to a group of customers targeted for a direct mail campaign that would have a six-month window of opportunity for the customer to respond. We thus considered customers who used the credit card product in question in the most recent six-month period. In fact, we only considered customers who had activated in the most recent six months, that is, they had never used the Bank′ s credit card more than six months ago. This last statement is extremely important. To create a predictive model we must be able to predict the future behavior of a customer before that customer exhibits the behavior. If we are to predict the propensity of a customer to use the Bank′s credit card in the next six months, we must do so using past data for the customer, that is, data from a period before the customer used the credit card. We assigned the objective variable a value of 1 if a customer had no credit card in the previous third quarter of 1997 but had activity in the third or fourth quarter of 1997. All other customer records were assigned a value of 0 for the objective variable. To predict those customers who activated in the third and fourth quarters of 1997, we used the customer transaction records for only the first two quarters of 1997 (see Figure 47). The final consideration in creating data for prediction is to ensure that no data related to the objective variable is used for prediction. For instance, if you are predicting the credit card profitability of customers, do not use credit card data because it is one and the same. Profitable credit card customers will have more activity on their cards, so using credit card activity to predict credit card profitability is a self-fulfilling prophecy. Figure 47. Creating an Objective Variable We also selected customers only from the Okay Customer Set and the Good Customer Set instead of sampling from the entire customer universe. By focusing on these customer clusters, both of which were profitable, we Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 91 automatically eliminated customers who were low profit from the direct mail campaign. Note that some of the customers in the Okay Customer Set and Good Customer Set had no activity in the last 12 months with the credit card we were cross-selling. It is important to have a mix of target groups and non-target groups in order to develop a model that can distinguish between the two extremes. We specifically chose three different types of data to use for the prediction modeling process problem: 1. Customer transaction data Transaction data includes revenue, the number of transactions, and recency of transactions from each Bank product category by time period (in this case by quarter). The Bank′s data warehouse categorizes products into 14 groups as explained in 4.3.1, “Data Selection” on page 36. We therefore created a total of 84 variables (3 variables * 14 categories * 2 quarters). 2. Customer demographic and summary data This data consists of demographic data including customer age, gender, income, and household size. The Bank′s data warehouse also contains summarized transaction data including total revenue lifetime to date (LTD), number of products used LTD, total transactions LTD, recency, and first transaction date. 3. Third party census and tax data In Canada the government permits the reselling of census and tax data. This data is aggregated to the enumeration area which contains 300 to 400 households. A variety of data is available including profession, ethnicity, education, income, and income by source. Promotion history data is also typically used when building response models for direct mail campaigns. In this case, promotion history was not available for the particular product we were targeting, so the model built is not a response model. (The credit card we identified as a cross-selling opportunity was a card with new features that had not been marketing via direct mail previously). It is a propensity model, which predicts the customer′s propensity to use a credit card. 6.3.2 Data Preparation We performed two types of data preparation: data cleaning and transformations. 6.3.2.1 Data Cleaning The data cleaning required for predictive modeling is similar to data cleaning for clustering as discussed in 4.3.2, “Data Preparation” on page 38. The only difference is that more care is required in assigning values to missing records. In choosing a value to assign, the resulting distribution of the variable in question should not be drastically altered. Ideally you want to assign values that do not change the characteristics of the distribution (for example, the min, max, and mean). If it is not possible to assign values without dramatically altering a variable′s distribution, discard that variable to avoid spurious correlations. We assigned all transaction variables that had missing values a value of zero. Such an assignment is appropriate as an absence of transaction activity (null in the database) implies zero activity. We discarded demographic data with missing values if the missing portion was significant. Also, we created binary variables indicating the missing portion of all categorical variables. 92 Intelligent Miner Applications Guide 6.3.2.2 Data Transformation After we cleaned the data, handled all missing and invalid values, and made the known valid values consistent, we transformed the data to maximize the information content that can be retrieved. For statistical analysis the data transformation phase is critical, as some statistical methodologies require that the data be linearly related to an objective variable, normally distributed and containing no outliers. Artificial intelligence and machine learning methods do not strictly require the data to be normal or linearized, and some methods, like the decision tree, do not even require outliers to be dealt with. This is a major difference between statistical analysis and data mining. The machine learning algorithms can automatically deal with the nonlinearity and nonnormal distributions, although the algorithms work better in many cases if these criteria are met. A good statistician with a lot of time can manually linearize, standardize, and remove outliers better than the artificial intelligence and machine learning methods. The challenge is that with millions of data records and thousands of variables, it is not feasible to do this work manually. Also, most analysts are not qualified statisticians, so using automated methods is the only reasonable solution. After cleaning the original data variables, we created new variables using ratios, differences, and business intuition. We created total transaction variables, which were the sum of the transaction variables over two quarters. We used these totals as normalizing constants to create ratio variables. We created timeseries variables to capture the time difference in all transaction variables between quarters. The data set for predictive modeling is almost identical to that created for a clustering model except that more care is taken in the data cleaning and data transformation processes. In addition it is important to remove all colinearities from the input variables before you execute any algorithms. Variables that are colinear cause most data mining algorithms difficulty and worsen model performance. Colinearities can be removed by using either: • Correlation analysis • Principal component analysis • Regression Removing colinearities is especially important when you use RBF and the neural network. One of the assumptions made in the back-propagation algorithm is that the input variables are linearly independent. If this is not true, it may take a long time to train the neural network, and the results may be poor depending on how correlated the inputs are. 6.3.3 Data Sampling for Training and Test Finally, we took a sample of the data for training and testing the Intelligent Miner prediction algorithm. (see Figure 48 on page 94). Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 93 Figure 48. Cross Selling: Data Sampling (5252f405/50) It′s important to create both a training data set, which is used to build the model, and a test or hold-back data set, which is used to test the model. A model should be tested against data that it has never seen before to ensure that there is no overfitting. In this case we were trying to build a model that would predict a binary outcome (that is, the customer propensity to a particular product). In the customer universe of the Okay Customer Set and the Good Customer Set, we sampled approximately 23000 records. The distribution of a positive event (that is, customer used a credit card in second half of 1997 but not before) was 734 records out of the 23000. A minimum number of positive events required to build a predictive model is approximately 250. (We chose that number on the basis of our experience). On the basis of the this distribution, we randomly split the entire file into two equal sized data sets as shown in Figure 48. One data set was to be used for testing and was left as is. The training portion of the data set was further sampled to create a 50/50 distribution of the target variable. This is known as stratified sampling . When the distribution of the target to non target is less than 10%, stratified sampling tends to improve the modeling results. The stratified sample data set size is usually driven by the number of positive events, but when the number of records becomes small, as in this case, it is important to consider the sample size of the non target or negative events relative the entire universe. To avoid the sample bias, the sample of non targets should not be too small. If sample bias is a concern, it is possible to distribute the target to non target events non evenly (for example, 20/80) or to duplicate records with positive events to permit a larger non target sample. To consider these effects, we recommend creating multiple training data sets with different target and non target distributions to ensure valid samples and to maximize model performance. 94 Intelligent Miner Applications Guide In this case study we simply explored the 50/50 case for the sake of brevity, even though the non target sample is small. (The wavy results in the gains charts in Figure 51 on page 105 could be a symptom of these effects.) 6.3.4 Feature Selection If the number of variables in the training and test data sets is very large (i.e. greater than 300), it is useful to reduce the number of variables before building any models. Feature selection, the process of selecting a subset of variables most correlated to the target variable from a larger set of variables, is an entire discipline in itself. Here we simply mention some of the methods for selecting variables: • Linear and non linear correlation • Principal components analysis • Factor analysis • Regression • Decision trees Most problems that we have worked on had over 1000 variables at the outset, and feature selection was a part of the predictive modeling process. 6.3.5 Train and Test We used several methods to build the predictive model after preparing the data: • Classification using a decision tree • Value prediction with RBF regression • Classification with a back-propagating neural network Figure 49 on page 96 outlines the detailed steps in running a predictive modeling algorithm. Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 95 Figure 49. Detailed Predictive Modeling Process 6.3.5.1 Algorithm Selection The first algorithm that is usually used to build a model is the decision tree. There are two reasons for using a decision tree: 1. The tree is very good at finding anomalies in the data. The first half dozen runs typically fine tune the data preparation. The decision tree discovers missed details. 2. The tree can also be used as a data reduction tool. It typically reduces the number of variables by one order of magnitude if a few hundred variables are input into the algorithm. The tree algorithm is very scalable, and performance is not hampered by several hundred variables. Selecting the variables that are in the tree model as input to the value prediction and neural classification algorithms improves their accuracy and performance. The tree is used to create a reduced set of variables. The top 10 to 20 variables are selected from the tree according to the position and number of occurrences in the tree (that is, the higher up in the tree a variable occurs and the more times it occurs, the more significant it is). Value prediction with RBF requires a reduced set of variables because the algorithm does a clustering pass before building the predictive model. The more variables present, the more difficult it is 96 Intelligent Miner Applications Guide to get good clusters and the worse the results. If you use linearly independent input variables, such as those created by principal components analysis, RBF can handle many more variables. Creating training and test data sets with Principal Component Analysis factors can improve the accuracy of both RBF and the neural network. The neural network also requires a reduced set of input variables. The major concern in using the neural network is the algorithm run time. This is the reason for selecting a reduced variable set. 6.3.5.2 Parameter Selection After completing the data preparation and selecting the data set to be mined (in this case the training data set first) you have to select the algorithm parameters. Set the basic parameters first and, if you are an advanced user, you can set advanced settings to be different from the defaults. Decision Tree — For the decision tree the parameters available for selection are: • Maximum tree depth The maximum tree depth sets the maximum number of levels to which the tree can grow. We typically leave this at no limit. When no limit is chosen, the algorithm fits the data and then prunes back the tree, using minimum description length (MDL) pruning. If you want to prevent overfitting or limit the complexity of a tree, set this limit. • Maximum purity per internal node The maximum purity per internal node sets a limit for the purity beyond which the tree will no longer split the data. We typically leave this at 100%, which allows the tree to fit the data before pruning. If you are concerned about overfitting, choose a lower value. • Minimum number of records per internal node The minimum number of records per internal node sets a minimum number of records required per node. We typically set this parameter to 50. If a node contains at least 50 records, the resulting rule is likely to be statistically significant. Value Prediction with RBF — For the RBF algorithm tree the parameters available for selection are: • In-sample size In-sample size is the number of consecutive records assigned to the training data set before out-sample records are assigned to the cross-validation data set. The ratio of in-sample size to out-sample size is the same as the ratio of the training to cross-validation data sets. A cross-validation data set is used to test the accuracy of the model between successive passes through the data and model iterations. Cross-validation is used to choose the best model and to minimize the likelihood of overfitting. Although this algorithm has cross-validation, we strongly recommend that the model be tested against the hold back test data set. The in-sample to out-sample ratio is driven by the number of positive target events. You would like to have at least 250 positive target events in the training or in-sample. If this criterion is met, we usually use an 80/20 split, where the in-sample data set is the larger data set. • Out-sample size Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 97 Out-sample size is the number of consecutive records assigned to the cross-validation data set. • Maximum number of passes Maximum number of passes is the maximum number of passes the algorithm makes through the data. This is a stopping criterion for the algorithm (that is, if the algorithm has not achieved its accuracy criterion, it will continue to run on until it has made the maximum passes through the data). We usually start with 25 passes. If the algorithm uses less, the value chosen was good. If the algorithm stops at 25 passes, we recommend doubling the number of passes until the accuracy result is achieved before the maximum passes or until it seems that the accuracy criterion will not be achieved no matter how high the number of passes. • Maximum number of centers Maximum number of centers is the maximum number of gaussian regions that will be built by the model. If this value is set to zero, the algorithm chooses the number of centers to maximize the accuracy. • Minimum region size Minimum region size is the minimum number of records that the clustering portion of the algorithm will assign to one gaussian region. Any gaussians with less than this number of records will be deleted after each pass through the data. We use approximately 50 records so that the gaussians are assigned to regions that are statistically significant. If there are not sufficient data records to set the minimum region size to 50, choose a minimum region size to get at least 5 to 10 regions in the output. • Minimum number of passes Minimum number of passes is the minimum number of passes the algorithm will take through the data. During these initial passes, the algorithm does not do cross-validation. Neural Network — For the neural network algorithm the parameters are: • In-sample size The in-sample and out-sample size parameters are used to split the input data set into a training data set and cross-validation data set exactly as described above for value prediction with RBF regression. The neural network uses the cross-validation data set to choose a network architecture as well as to find the weights that minimize the model root mean square error. Again it is important to test the model against a hold back test data set. • Out-sample size Out-sample size is the number of corrective records assigned to the cross-validation data set. • Maximum number of passes Maximum number of passes is a stopping criterion for the algorithm. If the accuracy and error criteria are not achieved, the algorithm will stop after taking the maximum number of passes through the data. We use 500 passes as a starting point and test the effect of increasing the number of passes on the accuracy. • 98 Accuracy Intelligent Miner Applications Guide Accuracy is a stopping criterion for the algorithm. It is the percentage of records that the algorithm classified correctly. The accuracy is tested against the out-sample or cross validation data set. • Error rate Error rate is a stopping criterion for the algorithm. It is the percentage of records that the algorithm classified incorrectly. This is different from the accuracy rate, because an unknown class is assigned if the network cannot make a decision. In the predictive modeling case, where you are interested in simply rank ordering the records, which is different from classification, the accuracy and error rate of classification are not necessarily important. The network may have poor accuracy yet still rank order the records correctly. The network outputs a confidence, which is the actual output of the neural network that can be used to rank order. • Network architecture Using the manual architecture option of IM it is possible to assign the number of nodes per hidden layer. The neural network can have up to three hidden layers. The number of nodes in each layer can be selected by specifying the number in the hidden layer 1, hidden layer 2, and hidden layer 3 parameters. Selecting the default setting or automatic architecture determination causes the algorithm to iterate several architectures and choose the best one based on preliminary cross-validation results. Unless you have some reason to specify an architecture, we recommend using automated architecture selection. Sometimes the algorithm creates a neural network with no hidden layers. In this case, to compare results, you may want to force some hidden layers to compare results. • Learning rate Learning rate can be used to control the rate of gradient descent. The parameter can range from 0 to 1. Too high a value causes the network to diverge, and too low a value causes the neural network to train very slowly. The academic literature recommends a value of 0.1, which seems to work best in most cases. This is the default setting. If the algorithm is converging too slowly, you might slowly increase the value of this parameter. • Momentum Momentum can be used to control the rate of convergence. It controls the direction of gradient descent and it is the fraction of the previous direction that is maintained in the current descent step. The parameter can range from 0 to 1. Too high a value causes the algorithm to converge very slowly or not at all as the descent direction is not sufficiently changed. Too low a value causes very slow convergence as the convergence direction changes too much, causing the descent direction to ″zig-zag″ across the error surface. The academic literature recommends a value of 0.9, which is the default value. 6.3.5.3 Input Field Selection In this case we selected transaction and demographic data that we had created as input to the tree algorithm. Refer to 6.3.1 ″Data Definition″ on page 146 for a detailed description. We also included some of the clustering algorithm data in the tree to test their significance in predicting propensity to use a credit card. Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 99 Decision Tree — To rank order the records in order of the customer propensity to buy a particular product, you must set the objective variable type to discrete numeric or continuous. Valuable Prediction with RBF — The RBF algorithm requires that the objective variable be continuous. RBF also allows the use of supplementary variables, which are profiled by model region but not used to build the model. This useful feature of RBF enables you to immediately profile the model scores. We selected the top decision tree variables as input to the RBF algorithm. Neural Network — The neural network requires that the objective variable be categorical. We selected the top decision tree variables as input to the neural network algorithm. 6.3.5.4 Output Field Selection To build a gains chart, the minimum output requirements are, for each algorithm: • Customer ID • Objective variable • Algorithm prediction If you want to use the output of one algorithm, including its prediction, as the input to another algorithm, you should output the entire original data set. Having the scores of multiple algorithms in the same file is useful for comparisons. For instance, if the tree places certain records in the top decile, and the RBF algorithm assigns them to the middle decile, it may be possible to correct the models by creating new variables or altering the training data set to compensate for this disagreement. 6.3.5.5 Results Visualization We used three different mining algoritms in this case study. The following gives an idea how the result presented may look for each of the algorithms. We also show how the most common problems for each algorithm appear within the result visualization. Decision Tree — The tree algorithm outputs a summary screen showing the mean and root mean square error. From this screen it is possible to view both the unpruned and pruned trees. As discussed earlier, the tree will find all data anomalies. Symptoms of anomalous trees are: • One leaf only This symptom is caused by using a variable in the input data that is perfectly correlated with the objective variable. This typically occurs with variables as dates or customer IDs or other fields that are unique to each customer. These fields produce a one-to-one mapping to the objective. • Highly unbalanced tree with only one leaf to one side of the root node This symptom is also caused by a variable that is highly correlated with the objective variable. • Very shallow tree when many input variables are used This symptom can also be caused by variables that are highly correlated with the objective. Reasonable tree visualizations should produce balanced trees with a reasonable number of levels, depending on the number of input variables. The purity of the 100 Intelligent Miner Applications Guide leaf nodes should range from highly pure with either value of the target to leaf nodes with mixed distributions of the target values. Value Prediction with RBF — The RBF algorithm outputs a visualization similar to that of the cluster viewer. The main difference between the two visualizations is that RBF presents the records by region and not cluster. A record is assigned to the region to which it has the highest probability of belonging. The visualization shows the average model score by region and root mean square error by region as well. Anomalous results are indicated by these symptoms: • Visualizations with only one or two regions This symptom is usually indicative of very strong predictor variables that mask the effect of other input. In this situation look at the decision tree results to determine whether there are segments at the top of the tree with the same variables that are most important to the regions. To correct this situation, remove the strong variables from the chosen input fields and split the data into multiple files based on the segmentation by the strong variables as indicated by the tree. It is then possible to run RBF against each of the separate files, and after scoring, simply append the results into one file. • A low ratio of the average score in the top region to the average score in the bottom region This symptom is caused by either too many input variables or just poor data used for the prediction problem. Good results are indicated by: • A high ratio (> 2-3) between the top and bottom regions′ average prediction scores • Several ( >5 ) regions present Neural Network — The algorithm outputs a confusion matrix that shows the classification accuracy of the network. The algorithm adds an unknown class to the possible predicted class set. Anomalous output is indicated by too many records being classified as unknown. If too many records are unknown, try increasing the number of passes. The algorithm also outputs a sensitivity matrix that assigns each input variable a percentage. The percentage indicates how sensitive the output is to changes in that variable. Anomalous results may occur if one or a few variables contribute a vary large fraction of the sensitivity. These variables may indicate the presence of segments in the data. If this occurs, observe the decision tree results and see whether the high sensitivity variables occur at the top of the tree. If they do, split the data into segments as indicated by the tree, and train a neural network for each segment. Once each segment is scored, it is possible to append the results together to analyze. 6.3.5.6 Results Analysis/Refinement The results of a predictive model that rank orders records are typically displayed as a gains chart (see Figure 51 on page 105). A gains chart contrasts the performance of the model within the results achieved by random chance. Several iterations of the algorithm are executed, varying the parameters. Gains charts for each run should be compared and studied. In training mode the gains curves should be perfectly smooth, and the counts of the positive target event by decreasing decile should be monotonically decreasing and non wavering. Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 101 6.3.5.7 Run Model against Test Data To ensure that the model has not overfit the data and to assess the model performance against a data set that has the same characteristics as the application universe, the model should be executed against the test data in test mode. Test mode permits using an existing model to score the records. The test mode results, should be approximately equal to the training results, except when stratified sampling is used. When stratified sampling is used, the test mode gains chart should be better for the test data set than the training data set. The performance of the model prediction by descending decile should result in a monotonic decrease in the counts of positive target events. Any wavering in the top deciles of the model that are likely to be mailed should be studied. The cause of the wavering should be identified and corrected. If the model performs well against the test data set, it should perform similarly against the application universe, if both populations have the same statistics. (This point is discussed further in 6.3.7, “Perform Population Stability Tests on Application Universe” on page 103.) 6.3.6 Select ″Best Model″ After using gains charts to analyze the model results, you have to explain why the model is scoring as it is. Perform clustering on the input data, using the score decile (or other quantile) as a supplementary variable, and observe and characterize the clusters that appear. If the model is working properly, the clusters should separate the quantile field. The clusters can be used to explain the difference between records containing high scores and low scores. Compare the characterizations of the scores from each of the algorithms to determine whether the algorithms are observing the same effect or one algorithm is discovering something the others are not. Use the differences found to iteratively improve the results of each algorithm. After having iteratively improved the models, you chose the ″best model″. Typically the best model has the highest performance as measured by the gains chart, that is, it rank orders the input records the best. Sometimes, however, you may choose a model that does not rank order the best, for several reasons: • The model is easy to explain Sometimes the best model contains variables that are not easily explained or are not the related to the current business problem, and it may be difficult to justify its application. It is just as important to be able to explain why a model works as it is for the model to work well. • The model agrees with the current business intuition If the model reflects the current understanding of the factors that affect the business problem, more confidence can be assigned to the result. Furthermore, if some new learnings are present with the current understanding, more confidence can be assigned to the new learnings. If a model contains unusual factors that cannot be explained, the model should not be implemented. • The model is simple to implement A simple model with few variables or that requires little data processing is preferable to more complex models. The implementation of complex models could result in errors as well as a high tendency to overfit. 102 Intelligent Miner Applications Guide 6.3.7 Perform Population Stability Tests on Application Universe After you have selected the ″best″ model, it is crucial to ensure that the application data set that the model will be implemented against is the same as the test data set that the model was tested against. The similarity can be determined by univariate and multivariate profiles of the data sets. A comparison of statistics from these profiles should show very little difference between the universes. If the statistics are very different, the model will probably not work properly. The statistics could be different for these few reasons: • Sample bias The test and training samples created were biased samples. If this is the case, the data should be re-created, and the modeling process repeated. • Incorrect problem setup The design of test and training data differs from design to which the model was intended to be applied. The process used to create the application data set should be identical to the process used to create the test data set, except of course for the difference in time periods. 6.4 Data Mining Results In this section we explain how to use the Intelligent Miner visualization tools to present the results of the mining algorithms and how to interpret those results. 6.4.1 Decision Tree Figure 50 on page 104 shows the visualization results from the decision tree. Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 103 Figure 50. Decision Tree Results: Isolating the Key Decision Criteria This result was achieved after several iterations during which some variables were removed. The variables that appeared in the tree rules included total revenue, total number of transactions in Q1, savings account revenue in Q2, the number of savings account transactions in Q2, best customer in 96, and the second-choice cluster ID assigned by Intelligent Miner during the Customer Segmentation case study (see Chapter 4). All of these variables agreed with the current business understanding. The gains chart for the training data produced a smooth curve as expected. The training results are typically non-interesting, as in most cases the models achieve good results against training. A more important test is how well the model performs against the test or holdback data set. A gains chart was created for the test data set (see Figure 51 on page 105). 104 Intelligent Miner Applications Guide Figure 51. Gains Chart for Decision Tree Results (s406/0.0) Gains Chart — A gains chart is a graph created from the rank ordered model scores. For algorithms that create a continuous score, such as RBF and the neural network, the score variable can be quantiled. The gains graph chart is then created by plotting the number of cumulative positive events by descending quantile versus the cumulative number of records by descending quantile. For an algorithm containing discontinuous scores, such as the decision tree, it is not possible to quantile the scores. The decision tree scores records by assigning the average leaf node score to all records in the leaf node. You can therefore build the gains graph chart by plotting the number of cumulative positive events by descending leaf node score versus the cumulative number of records by descending leaf node score. The line labeled random indicates that a random rank ordering of records results in an even number of positive events by quantile. This is the expected result for random ordering. If our model is rank ordering well, there should be more positive target events in the top quantiles, and the slope of the gains curve should be higher in the top quantiles than the random line. This higher slope will result in a curve that is lifted above the random line. In most cases the business action taken using the output of a predictive model typically uses 10%-40% of the possible universe. It is therefore important to note the ratio of gains curve to random at the implementation cutoff. In Figure 53 on page 109 we observe a lift of approximately 1.5 at 25% of the Universe. This is a modest lift curve. In our experience most gains charts have a lift ratio ranging between 1.5 to 3.5. Too low a lift indicates that the data is not very predictive of what was being modeled. Too high a lift is also suspicious and may indicate sample bias or the use of input data that is too closely related to the target variable. Another feature to note is the smoothness of the gains curve. If the curve is very smooth, the number of positive events by quantile is distributed monotonically. A monotonic distribution of positive events indicates that the model is rank ordering correctly. A wavy curve indicates that the positive events are not monotonically distributed. This implies that there is a secondary factor in the data that the model did not capture. If the waviness occurs in the top quantiles Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 105 or in a range in which you intend to use the model, it should be corrected. If the waviness occurs in the bottom quantiles or out of range, you can ignore it. In this case the tree gains chart had a modest positive lift of approximately 1.5 times random and was a smooth curve. 6.4.2 RBF Figure 52 on page 107 presents the RBF visualization results. One immediate advantage of using RBF is apparent: the results of the RBF algorithm present a profile by model region, which can be used to characterize or explain why the model is working. 106 Intelligent Miner Applications Guide Figure 52. RBF Results Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 107 In Figure 52 on page 107, observe that the top region with an average model score of 0.7778 is characterized by customers with higher than average revenue in Q2, and a larger positive revenue difference between Q2 and Q1 indicating growth in activity. The third region from the top is characterized by customers who were in the Best segment in 1996 and who have much higher than average withdrawal amounts from savings accounts. These characterizations are consistent with the current business understanding of customers likely to use a credit card. The gains chart for RBF is plotted in Figure 53 on page 109. The RBF model used against the training data set results in a gains curve similar to that of the decision tree, with a modest lift of 1.5 over random. The model is, however, wavy in the top quantiles, which raises some concern and should be resolved before implementation. 6.4.3 Neural Network The neural network algorithm was run against the top six variables selected from the decision tree. The following sensitivity analysis results were output: Field Name Savings_Revenue_Q2 Savings_Txns_Q2 Loan_Revenue_Q1 Best96 Status Total Revenue Total_Txns_Q1 Sensitivity 4.6 4.7 12.0 0.2 22.2 56.0 This result indicates that the output is most affected by changes in the total number of transactions in the first quarter of 1997, which accounts for 56% of the total change observed. Total revenue in the first half of 1997 accounts for 22.2% of the observed change. The large fraction of sensitivity accounted for by two of the variables raises some concern. These two variables are also at the very top of the decision tree. Better results may be achieved with the neural network if the training file is first segmented using the tree rules for these variables and then a neural network is trained on each segment. Figure 53 on page 109 shows the gains chart for both training and test for this neural network model. Although the training results outperform both other models, however, the gains curve is much wavier below the top 20% of the list. In training, the lift of the model is approximately 2 times random. In test mode the inflection point of the lift curve moves to the left. When the model is built against a stratified training data set, this is expected. The lift of the test curve is approximately 3 times random at 20% of the total population. The curve is very wavy at the top, however. This severe waviness indicates that the model has missed a major factor and should be resolved. The two overpowering variables in the model as indicated by the sensitivity results above could be masking other effects that would otherwise be present and could explain the gains curve waviness. Preliminary results indicate that the neural network model will work better in the end for this data set. 108 Intelligent Miner Applications Guide Figure 53. Cross-Selling: Comparison of Three Predictive Models 6.5 Business Implementation You may recall that the business objective was: given a limited mailing budget, target the most likely prospects for a credit card offer to reduce the cost of customer acquisition and improve the campaign ROI. The optimal number of customers to target can be decided by looking at Table 6 and its graphic representation in Figure 54 on page 110. Table 6. Cross-Selling: ROI Analysis Figures Percentage of Universe (%) Predicted Response Rate ( % ) Number of Responses Total Cost of Mailing (U.S. $) Annual Credit Card Profit (U.S. $) Predicted ROI ( % ) 10 55 416 12,500 41,600 333 20 58 435 25,000 43,500 174 30 65 488 37,500 48,800 130 40 76 570 50,000 57,000 114 50 83 623 62,500 62,300 99.7 60 90 675 75,000 67,500 90 70 92 690 87,500 69,000 79 80 95 713 100,000 71,300 71 90 98 735 112,500 73,500 65 100 100 750 125,000 75,000 60 Chapter 6. Target Marketing Model to Support a Cross-Selling Campaign 109 Figure 54. Cross-Selling: ROI Analysis Figures The ROI analysis table is built from a combination of the comparative gains chart (Figure 53 on page 109) and the baseline ROI calculation in Table 5 on page 88. The first two columns in Table 6 are derived by reading the gains chart in Figure 54 from left to right as follows: contacting 20% of the universe will yield 58% of the respondents, contacting 40% of the universe will yield 76% of the respondents, and so on. The number of responses is derived from the neural network test model. The cost and profit figures are taken directly from Table 5 on page 88. In conclusion, to achieve a positive return and at the same time maximize the migration of customers from the Okay Customer Set to the Good Customer Set, 40% of the potential customer universe should be targeted. 110 Intelligent Miner Applications Guide Chapter 7. Attrition Model to Improve Customer Retention In this chapter we discuss attrition management analysis, which is how to keep your customer satisfied, how to predict the customers who will leave within six months, and how to make these expected defectives loyal customers. In general, it is more profitable to influence the nonloyal customers to be loyal to your company than to strive to gather new customers. With many analysts estimating customer attrition rates at almost 50% every five years, the challenge to manage customer attrition is driving companies to gain a more comprehensive understanding of their customers. Figure 55 illustrates the point well. This chart is taken from the Harvard Business Review and demonstrates the value of good attrition control to the profitability of businesses in several different sectors. The message is clear: by decreasing the rate of attrition, you can increase the profitability of your business. Figure 55. Reducing Defections 5 % Boosts Profits 2 5 % to 8 5 % Intelligent Miner can develop models that you can use to accurately target customers who might defect. If you take the appropriate business action to stop the potential defection, you can stop the reduction in customer attrition. 7.1 Executive Summary The goal of this case study was to identify which profitable customers were likely to defect. The profitable customers selected for analysis were the Okay Customer Set and the Good Customer Set customers from the Customer Segmentation case study. In addition to being able to predict which customers had a higher likelihood of defection, we wanted to understand the characteristics of the defectors and nondefectors and how we could use that information to increase the company′s retention rate. We used four different methods to solve the business problem. The methodology to implement these techniques, including data preparation and analysis of results, are also presented. For this prediction we used a combination of Copyright IBM Corp. 1999 111 customer transaction data and demographic data. We contrasted the results from the three standard prediction techniques with a time-series technique. The neural networks, both the standard and time-series versions, were able to predict the best which customers were likely to defect. The standard neural network could identify 95% of the defectors in only 20% of the customer population. The time-series neural network could identify 92% of defectors in 20% of the customer population. In addition the time-series neural network could reduce the window of predicting the time of defection to one month, instead of six months for the standard techniques. We profiled and characterized the output of all the techniques to distinguish between a typical defector customer and a nondefector customer. The characterizations from all algorithms agreed very well. The defining characteristics of defectors were: • Mostly from the Okay Customer Set • No Best customers • Lower product usage than average • Shorter tenure • In general lower usage of all products, especially telebanking, credit card, mortgages, and loans The defining characteristics of nondefectors were: • Mostly from the Good Customer Set, although not as skewed in favor as the defector • Higher ratio of Best customers • Higher product usage than average • Longer tenure • In general a higher usage of all products, especially telebanking, credit card, personal banking, mortgages, and loans The Bank′s personal banking service is a bundle of savings, checking, credit card and lower fees for various services. This bundle was only associated with non defector customers. Customers with a multi-faceted relationship with the Bank are less likely to defect. Selling the personal banking bundle is a good start to build a strong relationship with customers. 7.2 Business Requirement The goal of this case study was to identify those customers in profitable segments who have a high probability of defection. Once customers likely to defect have been identified, it is then possible to take business action to reduce the likelihood by offering the customer incentives to remain loyal. The reason for analyzing only profitable customers is that they provide sufficient margin to permit discounting and rewards and still be profitable. Customers who are likely to defect and are not profitable should be let go. We built an attrition model for the Okay Customer Set and the Good Customer Set, both of which are profitable customer segments. 112 Intelligent Miner Applications Guide We defined customer defection as a customer who had no activity for at least six months. Our analysis was completed in January 1998, so the most recent defectors were customers who had no activity in July 1997 through December 1997. In addition to identifying those customers most likely to defect, we analyzed how customers could be prevented from defecting. We also profiled the customers likely to defect. 7.3 Data Mining Process We took two broad approaches to the problem: • Model 1: A combination of three tried-and-tested Intelligent Miner algorithms • Model 2: A new Intelligent Miner algorithm called time-series prediction We combined decision tree, RBF, and neural classification much as we did for the Cross-Selling case study (see Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87). On the basis of detailed customer transactions from January through December 1997, we identified those customers with a high probability of defecting sometime in the last six months of 1997. The more sophisticated time-series analysis algorithm used the same input data but provided us with not just a single probability of defection but also with six different data points corresponding to the probability of defection in any one of the last six months of 1997. Figure 56 on page 114 illustrates the general approach. Chapter 7. Attrition Model to Improve Customer Retention 113 Figure 56. Data M i n i n g Process: Attrition Analysis • The modeling approach shown on the left-hand side of Figure 56 is identical to the approach we used in the Cross-Selling case study (see Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87). • The modeling approach shown on the right-hand side of Figure 56 uses time-series prediction . The only differences are the definition and creation of the objective variable. In fact, because the same time periods were involved for both case studies, we used the exact same initial data set to start the data mining process. Therefore we only discuss method 1, the combination of decision tree, RBF, and neural classification, for the definition of the objective variable. For details of the process refer to Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87. The data mining process focuses on the second method. We present and discuss the results for both methods. 114 Intelligent Miner Applications Guide 7.3.1 Data Definition Refer to Figure 57 on page 116 for a layout of the data required for each method. Note that for the standard prediction methods, there is one record per customer, with one variable being the targeted. For the time-series approach each customer has one record per month and one value of the objective variable per month, that is, the target variable has a profile, whereas the time-series approach will try to predict the profile. 7.3.1.1 Method 1 As a first step we created the objective variable to be modeled. We assigned the objective variable a value of 0 or 1. Customers who had activity in the first half of 1997 and had no activity in the second half were assigned a value of 1, and all other customers were assigned a value of 0. 7.3.1.2 Method 2 For this method we defined the objective variable in the same way as in method 1, but implemented it differently. Customers who had activity in the first half of 1997 and had no activity in the second half were assigned a value of 1 for the objective variable for each of the six months of the second half of 1997, and a value of zero for each of the 6 months of the first half of 1997. All other customers were assigned a zero for the objective variable for all 12 months. Because of the definition of customer defection in both methods, we actually built a model to identify which customers defected in July 1997. Chapter 7. Attrition Model to Improve Customer Retention 115 Figure 57. Attrition Analysis: Data Definition 7.3.2 Data Preparation Given the two models we used, shown in Figure 56 on page 114, we performed the data preparation for each model. 7.3.2.1 Method 1 Refer to Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87. 7.3.2.2 Method 2 One advantage of using the time-series prediction method is that it does not require pivoting the transaction data: that is, taking the transaction data out of its natural time sequence and creating time variables for each customer record. The time-series algorithm uses data in its time-like sequence, whereas the standard prediction methods use a different variable on the same record for each time period. This fact limits the number of variables that can be analyzed with the standard prediction techniques. To capture time effects, a separate variable must be created for each time period being considered, the differences between time periods, and other factors. In method 1 the selection of quarterly time periods was driven by the desire to keep the number of variables small. For 14 product categories, with 3 variables (recency, number of transactions, and 116 Intelligent Miner Applications Guide revenue) and 2 quarters, there are 84 base variables with 42 differences, 42 totals, and 84 ratios. If we had used months instead of quarters, we would have had 3 times the number of variables. With the time-series method we only had 42 variables (that is, 14 categories with 3 variables per category) and one record per month or 12 times the number of records. The time series neural network does not require the creation of time-derivative terms as it captures those effects using time-delay layers. The algorithms tend to scale logarithmically with the number of records and exponentially with the number of variables. It is therefore much more efficient and elegant to use the time-series method. The algorithm requires that each customer have a record for each time period. Unfortunately, if a customer has no transaction activity in a period, it has no transaction record. Therefore a dummy table must be created containing all customer ID and month-number pairs. The transaction data can then be joined through an outer join, which will assign null values to all missing customer ID and month-number pairs. The null values can then be updated to zeroes. 7.3.3 Data Mining Figure 49 on page 96 outlines the detailed steps in running a predictive modeling algorithm. We refer to Figure 49 on page 96 in our discussion of the time-series prediction method. 7.3.3.1 Parameter Selection For the two methods described here, we used the following parameter settings: Method 1 — Refer to Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87. Method 2 — The time-series algorithm has the following parameters (see Figure 58 on page 118). • In-sample size Refer to 6.3.5.2, ″Parameter Selection,″ under Neural Network. • Out-sample size Refer to 6.3.5.2, ″Parameter Selection,″ under Neural Network. • Maximum number of passes Refer to 6.3.5.2, ″Parameter Selection,″ under Neural Network. • Forecast horizon The forecast horizon is the number of periods in the future from the record in consideration for which the algorithm is making a prediction. In this case we used a forecast horizon of 1, that is, we were trying to predict one month in advance. • Window size The window size is the number of historical records used to make a future prediction. In our case we used six months of historical data to predict one month in advance. • Average error Average error is an algorithm stopping criterion. It is the average root mean square (RMS) error limit. If the average RMS error is greater than this limit, Chapter 7. Attrition Model to Improve Customer Retention 117 the algorithm continues training until the criterion is met or the maximum number of passes is exceeded. • Neural architecture Refer to 6.3.5.2, ″Parameter Selection,″ under Neural Network. • Learning Rate Refer to 6.3.5.2, ″Parameter Selection,″ under Neural Network. • Momentum Refer to 6.3.5.2, ″Parameter Selection,″ under Neural Network. Figure 58. Times Series: Setting the Parameters 7.3.3.2 Input Field Selection We selected the input fields for each model as described below. Method 1 — Refer to Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87. Method 2 — The prediction variable should be continuous for the time-series algorithm. We selected the variables indicated by the decision tree to be important predictors of defection. The algorithm also profiles the output by quantile break points that can be user defined. We input as supplementary all data not used in the prediction. 118 Intelligent Miner Applications Guide 7.3.3.3 Output Field Selection We selected the following output fields for the two models used in this case study: Method 1 — Refer to Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87. Method 2 — The output of the time-series neural network can be viewed as a series of gains charts, one gains chart for each time period prediction. Therefore, at a minimum we required: • Customer ID • Objective variable • Algorithm prediction 7.3.3.4 Results Visualization As we used two different models in this case study, the following will give you an idea of what the results would look like for each model: Method 1 — Refer to Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87. Method 2 — The algorithm generates a profile of the quantile breaks, using the clustering visualizer. This output also shows the average score by each quantile. A reasonable result should have the following characteristics: • Ratio of average score for top quantile to bottom quantile should be at least 2 or 3 to 1. If this criterion is not satisfied, a mistake was made in setting up the data for the model or the data selected is not predictive of the target. • The characteristics of the top quantile to bottom quantile should be different. If the top and bottom quantiles do not have differing characteristics, the model will have poor lift, or the data was not defined correctly. • The average score should be monotonically decreasing with decreasing quantile bucket. If the average score is not monotonic with decreasing bucket, there is sample bias or an effect that the model is not capturing. • The order of importance of the variable within each quantile should be different. If the order of importance of the first few variables in each quantile is the same, those variables are likely very powerful indicators that a separate model should be built for each segment. Compare the results of the prediction with a decision tree. If the powerful variables appear in the top of the tree, then this is the case. If not, the variables may be systematically related to the target variable. Chapter 7. Attrition Model to Improve Customer Retention 119 7.3.4 Gains Chart Refer to Chapter 6, “Target Marketing Model to Support a Cross-Selling Campaign” on page 87 for a discussion of the role of gains charts in predictive modeling. 7.3.5 Clustering In addition to the exercise of building predictive models to identify those customers likely to defect, we completed some clustering of the resulting models to explain the characteristics of defectors and nondefectors. We extracted the top and bottom decile customer records from the neural prediction model and appended them together into a data set. We used the demographic clustering algorithm to cluster all input data, not just the variables used to create the neural network model. The quantile number was input into the clustering algorithm as a supplementary variable to determine whether the algorithm could distinguish the customer records using other data. We then used the insight into the differences between defectors and nondefectors to design a campaign to reduce the defection rate in the targeted customer segments. In addition to clustering the neural network scores, RBF output visualization automatically provides profile information about the customer records in each region. For the run of the RBF algorithm, we made sure to include all possible input data available as supplementary data if it was not used already as input into the model. 7.4 Data Mining Results In this section we describe the results achieved with each of the data mining algorithms used in this case study. Remember that the decision tree , RBF modeling , neural network , and clustering represent the first model used, whereas the time-series prediction represents the second model used for the attrition management analysis. 7.4.1 Decision Tree The following primary variables appeared in the decision tree: • Lifetime to Date Revenue • Mortgage Balance • Total loan balance • Customer tenure in months • Number of products used in 1997 Figure 59 on page 121 shows a node for customers who will leave with 85% probability. The characteristics of these customers are: 120 • Total revenue is less than 1091. • They have lower than average mortgage balances. • They have no loans Intelligent Miner Applications Guide Figure 59. Attrition Analysis: Decision Tree Structure All of these variables were meaningful for the business problem. Figure 60 on page 122 shows the gains chart for both training and test for the decision tree. The training results are very smooth, indicating monotonicity in the distribution of the target variable positive events by descending leaf node score. The training result has a lift ratio of 1.75 times random at 20% of the customer population. The test gains curve has a higher lift ratio of 3.3 at 20%. This improvement is due to the use of stratified sampling in the training mode. The problem with the test mode is the severe waviness of the gains curve in the top portion of the curve. This is due to either a biased training sample (because of a small sample of negative target events) or some effect that the model was missing. The former explanation is the most probable. Chapter 7. Attrition Model to Improve Customer Retention 121 Figure 60. Decision Tree Gains Chart: Training and Testing 7.4.2 RBF Modeling The RBF results visualization in Figure 61 on page 123 is positive as we have multiple regions. The top region to bottom region have a ratio of defection of over 10 to 1, and the characteristics of each region are different and interpretable. 122 Intelligent Miner Applications Guide Figure 61. RBF: Results Window Those customers in the top region of Figure 62 on page 124 have low activity across all products and tend to live in Ontario. Chapter 7. Attrition Model to Improve Customer Retention 123 Figure 62. Attrition Analysis: Predicting Values Result The second region from the top, which also has a high likelihood of defection, also contains customers skewed to Ontario on whom the Bank didn′t have any demographics. The reason for no demographics would be that these customers had not applied for any credit and hence had not completed a credit check. This region also had low activity across all products. The first two regions also contained more male customers than average. The regions that had a low probability of defection were mostly from the Good Customer Set and a small segment of the Okay Customer Set. These customers tended to be the Bank′s Best customers and to come from Western Canada. These customers had high product activity and tended to have a longer tenure 124 Intelligent Miner Applications Guide as well. All these characterizations reflect the current business understanding (see Figure 63 on page 125). Figure 63. Attrition Analysis: Predicting Values Figure 64 on page 126 shows the gains curves for the RBF models against both the training and test data sets. The RBF training result is similar to the decision tree gains chart at the top quantiles or below 25% of the customer universe. Above 25% percent of the cumulative customer population, the tree training model performs significantly better. The test result for the RBF model is worse than that of the decision tree because it has a lower lift ratio and is more wavy in the top quantiles. The model is still very promising, showing a lift over Chapter 7. Attrition Model to Improve Customer Retention 125 random of approximately 2.8 at 20% of the customer population. Some further analysis to account for the source of the waviness may improve the result. Figure 64. Attrition Analysis: Comparative Gains Charts for A l l Methods 7.4.3 Neural Network We built the neural network model, using the most significant variables as identified by the decision tree. The following sensitivity analysis resulted from the neural network: Field Name REVENUE TENURE AVG_LOAN_BAL AVG_MTG_BAL PRODUCTS_USED Sensitivity 7.4 14.3 1.7 2.2 74.3 The ′PRODUCTSS_USED′ variable seems to be highly related to customer defection. It is concerning that one variable accounts for such a high fraction of the observed output sensitivity. However, exploring this variable in other algorithms reveals that it is not the most sensitive variable as identified by the decision tree or RBF. The business interpretation of the result indicates that the more products a customer uses, the less likely they are to defect. This is a current reflection of the business understanding. Referring to Figure 64, you can see that the neural network achieves the best results in training with a lift over random of 2 to 1 up to 40% of the customer population. The training curve is also smooth. The test results have an exceptional lift over random of approximately 4.75 to 1 at 20% of the customer population. The gains curve, however, is wavy at the top quantiles. The result should be accepted with caution. The high sensitivity of the one high variable, the ″too good″ gains curve, and the waviness should all be explained before the result is accepted as valid. The initial results are very promising and indicate that the neural network will produce the best model for this business problem. 126 Intelligent Miner Applications Guide 7.4.4 Clustering Finally, we clustered the neural network results to: 1. Validate the models which were developed A clustering algorithm should be able to distinguish between the top decile and bottom decile of predictive model scored output if the model is valid. 2. Identify the distinguishing characteristics of defectors and nondefectors The distinguishing features could then be utilized to create retention campaigns. Figure 65 on page 128 shows the results of the clustering on the neural network model output. The data set used for the clustering contained a 50/50 split of top decile and bottom decile customer records. As expected we got two large clusters, which are 43% and 38%, respectively. If the model had worked perfectly, we would see just two clusters. In this case we got five smaller clusters. The largest cluster contains the defectors, as indicated by the quantile variable (the left two bars are the quantile buckets for the bottom two 5 percentile buckets, and the right two buckets are the top two 5 percentile buckets). Notice the slight sliver of low quantile customers in this cluster. These are mistakes made by either the neural network or the clustering algorithm; further analysis of these records may improve the prediction result. Because the clustering found a larger segment of defectors, there are probably fewer characteristics associated with defectors as opposed to nondefectors. The characteristics of defectors are: • Mostly from the Okay Customer Set segment Recall that we only used customers from the Okay Customer Set and the Good Customer Set. • No Best customers • Lower product usuage than the average • Shorter tenure • In general lower usage of all products, especially telebanking, credit card, mortgages and loans In contrast, the characteristics of nondefectors are: • Mostly from the Good Customer Set, although not as skewed in favor as the defector • Higher ratio of Best customers • Higher product usage than average • Longer tenure • In general a higher usage of all products, especially telebanking, credit card, personal banking, mortgages, and loans Chapter 7. Attrition Model to Improve Customer Retention 127 Figure 65. Attrition Analysis: Demographic Clustering of Likely Defectors 7.4.5 Time-Series Prediction The time-series neural network outputs a profile of the model scores by quantile (see Figure 66 on page 129). The output is very good with a ratio of almost 20 to 1 between the top and bottom quantiles. The scores are monotonically distributed by decreasing quantile, although the distribution could be a little smoother, and variables have differing importance to each quantile group. Customers with a high likelihood to defect, the top quantile in Figure 66 on page 129, have lower than average activity in the different products. Customers with a low likelihood to defect have high levels of activity at all products. This 128 Intelligent Miner Applications Guide characterizations of defectors and nondefectors agrees with the output of all other algorithms, including those using standard prediction methods. Figure 66. Profile of Time-Series Prediction Figure 64 on page 126 shows both the test and training gains curves for the time-series neural network. The training case is almost identical to the standard Chapter 7. Attrition Model to Improve Customer Retention 129 back-propagating neural network. The test case is slightly worse than the standard neural network, but the curve is smooth with no wave. The lift over random for the test case is very high at approximately 4.6 to 1 over random at 20%. Again you should be skeptical of such good performance and do some analysis to ensure that there is no sample bias or the input data is not highly related to the target field. The time-series neural network can also be used to generate prediction profiles over time by plotting the prediction versus time period. Figure 67 and Figure 68 on page 131 show the profiles of a few randomly selected customers from the test data set. These plotted profiles are representative of the many that we analyzed in the case study. Figure 67. Time Profile of Defection Probability for Defectors 130 Intelligent Miner Applications Guide Figure 68. Time Profile of Defection Probability for Nondefectors The profile of a defector by definition has a step-like profile, with a step between months 6 and 7, as visible in Figure 67 on page 130. The predicted profiles have a similar step-like profile except the height of the step is smaller and the location of the step is not always between months 6 and 7. One profile also has a blip in the first few months of the year. The nondefector profiles in Figure 68 should be a flat line at 0. The shape of the profiles is not flat. They are wavy, with no distinct step-like shape as the defectors had. The clear difference between a defector and nondefector profile makes this algorithm very useful. Not only can you distinguish between defectors and nondefectors, as you can with the standard predictive methods, but you can approximate when the customer is likely to defect. This fact is of critical importance in customer defection problems because the timing of a business activity to reduce customer defection should occur very close to the time that a customer is likely to defect. With the standard techniques in this case, our prediction of defection is windowed to within 6 months. Using the time-series approach we can narrow this approximation to a particular month in this case. This is a substantial improvement. The added difficulty of predicting the time of defection tends to cause the gains curves of the time-series prediction to be slightly worse than the standard neural technique, but in this case the model can distinguish defectors and nondefectors very easily. 7.5 Business Implementation Once customers with a high likelihood of defection are identified, it is possible to execute a direct mail campaign to target them. If we are to believe the neural network model results, we could target 95% of customers likely to defect for 20% of the cost compared to mailing to all customers (in the Okay Customer Set and the Good Customer Set). This is a substantial cost savings. Using the time-series prediction we can also time the customer communication to be as Chapter 7. Attrition Model to Improve Customer Retention 131 close as possible to the time of likely customer defection to make the contact as relevant as possible. Once we have identified a list of customers we intend to target, we can profile the profitability of those customers to determine the margin available to be used to increase customer retention. We can then implement this budget to try to change the defectors into nondefectors. The characteristics of non-defectors were primarily lots of product usage indicating a strong relationship with the Bank. A key product in building a multi-product relationship is the personal banking bundle which was present in only non-defector customers. This bundle includes a savings account, checking account, credit card and other services all at low fee. A campaign to cross-sell this product to profitable customers likely to defect may help consolidate the Bank′s relationship. 132 Intelligent Miner Applications Guide Chapter 8. Intelligent Miner Advantages In the first case study we created a segmentation model to be used as a basis for CRM. Using several techniques, we were able to create segments of customers with differing levels of shareholder value. The differences in shareholder value of the customer segments allowed us to identify the most profitable customers, high potential customers, and low potential customers. The best customer segment represented 35% of customer revenue from 9% of customers. Several high potential segments were also identified. If by marketing additional products and/or services to these customers we were able to change the purchase behavior of 10% of the high potential customers to be similar to our best customers, we could impact total revenue by 18%. Selecting one of the high potential customer segments, we used product association techniques to find cross-selling opportunities in the second case study. By contrasting the behavior of the high potential segment against the behavior of a higher potential group, we were able to identify missing products that could be marketed in a cross-selling campaign. The product to cross sell was identified to be a credit card. If through marketing we were able to activate 10% of the high potential cluster to use the credit card, we could impact the cluster revenue by 25% and the overall revenue by 3%. Having identified a customer segment and a product for cross-selling, we could execute a promotion. Rather than marketing to every customer in the segment of interest, building a predictive model to target just those customers likely to activate with the bank′s credit card would be more cost effective. In the third case study, we built several predictive models, using different techniques. The best of these models was able to predict 65% of likely activations with only 30% of the mailing cost. If we targeted 30% of the customer segment in question, our expected ROI would be 160%. In the fourth case study, we built predictive models to identify which customers were likely to defect. From the customer segmentation model in the first case study we were able to identify the current best customers and future high potential value customers. An important marketing strategy for these customers is retention. By marketing to these customers, we would be able to reduce the defection rate in our best customer segments, ensuring the corporation′s future earning potential and maintaining current revenue levels. Using IBM′s Intelligent Miner for Data product in these four case studies, we were able to illustrate how data mining can be used to support a CRM program within an organization. We also showed the power of Intelligent Miner and its capability to work on a wide variety of business problems. In selecting a tool for data mining, an organization should consider the range of problems to be solved and the potential for feeding the results of one business problem into the input of the next. Intelligent Miner was able to execute a sequence of business activities fundamental to CRM: • Create strategic marketing initiatives • Identify marketing opportunities to support strategic initiatives • Effectively target customers for a particular promotion Based on the case studies presented, the total impact on the bottom line was approximately a 25% increase in profitability. Copyright IBM Corp. 1999 133 IBM′s Intelligent Miner is the leading data mining product in today′ s marketplace, offering these competitive advantages: • Algorithms based on open academic research The algorithms within Intelligent Miner are based on open academic research. They were developed in IBM laboratories around the world by world-leading researchers in artificial intelligence and machine learning. This body of research dates back more than 20 years. For end users of this technology, this large body of research means higher quality algorithms that produce better results than other tools in the marketplace. • Research grown out of IBM core competence IBM has been cultivating artificial intelligence and machine learning through billions of dollars of investment in R&D for decades. In the corporate world IBM research labs are second to none. Organizations building competitive data mining products do not have nearly as strong a competency in the disciplines required for data mining. In addition to a core competency in data mining, IBM has a core competency in software development. The technical challenge of implementing data mining technology with the ability to work against millions of customer records is immense. No other organization has developed such complex algorithms that are as scalable as Intelligent Miner′s. Most of the competition produces data mining products for PCs and uniprocessor server platforms. To customers, this advantage means higher quality algorithms that have been so efficiently implemented that the time required to create decision support information has been significantly reduced. • Algorithms that have existed for a long time Some of the algorithms in Intelligent Miner have existed in other IBM products for 10 years. The newest algorithms are more than three years old. The competitors are just creating and releasing products today. In addition to being first in the marketplace, IBM consultants have been using the algorithms in more than 100 engagements around the world. In fact, Intelligent Miner was created because the data mining consultants recognized the need for an integrated data mining product with several different analytical methods. End users benefit from these advantages. More product use means that current product will have fewer software bugs. Investments in million dollar marketing campaigns are more secure with Intelligent Miner. All the algorithm bugs have been shaken out through years of practical application. • Wider variety of algorithms and visualizations in one tool As shown in the case studies presented in this book, the wide variety of algorithms in Intelligent Miner allows for a wider range of analysis than other data mining packages. The powerful combination of visualization tools and data mining algorithms, some of which are unique to Intelligent Miner, permits better business results than other products permit. • Unique algorithms Intelligent Miner has two algorithms that are unique and were invented by IBM researchers. The demographic algorithm is the only clustering algorithm that can cluster categorical data. The product associations algorithms were also invented in IBM research labs. 134 Intelligent Miner Applications Guide These unique capabilities enable end users to perform analyses not possible with other tools. • ″Infinitely scalable″ Intelligent Miner runs on the SP2 MPP platform, which can scale to handle terabyte-sized data warehouses. No competitive products are as scalable. Intelligent Miner can also connect to operational databases for scoring and validation. This scalability enables end users with millions of customers to efficiently integrate data mining technology into their businesses today. • Open technology Intelligent Miner runs on other vendor platforms, including HP, Sun, and Windows NT. It can also interface to other databases, using IBM′ s DataJoiner product. Thus end user customers can use IBM data mining technology on their platforms today. Chapter 8. Intelligent Miner Advantages 135 136 Intelligent Miner Applications Guide Appendix A. Special Notices This publication is intended to help all customers to better understand the different data mining algorithms used by the Intelligent Miner for Data. The information in this publication is not intended as the specification of any programming interfaces that are provided by Intelligent Miner for Data. See the PUBLICATIONS section of the IBM Programming Announcement for Intelligent Miner for Data for more information about what publications are considered to be product documentation. References in this publication to IBM products, programs or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only IBM′s product, program, or service may be used. Any functionally equivalent program that does not infringe any of IBM′s intellectual property rights may be used instead of the IBM product, program or service. Information in this book was developed in conjunction with use of the equipment specified, and is limited in application to those specific hardware and software products and levels. IBM may have this document. these patents. Licensing, IBM patents or pending patent applications covering subject matter in The furnishing of this document does not give you any license to You can send license inquiries, in writing, to the IBM Director of Corporation, North Castle Drive, Armonk, NY 10504-1785. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact IBM Corporation, Dept. 600A, Mail Drop 1329, Somers, NY 10589 USA. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer′s ability to evaluate and integrate them into the customer′s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. Any pointers in this publication to external Web sites are provided for convenience only and do not in any manner serve as an endorsement of these Web sites. Copyright IBM Corp. 1999 137 The following terms are trademarks of the International Business Machines Corporation in the United States and/or other countries: AIX DATABASE 2 DB2 Universal Database Intelligent M iner RISC System/6000 TextMiner AIX/6000 DB2 IBM QMF RS/6000 Visual Warehouse The following terms are trademarks of other companies: C-bus is a trademark of Corollary, Inc. Java and HotJava are trademarks of Sun Microsystems, Incorporated. Microsoft, Windows, Windows NT, and the Windows 95 logo are trademarks or registered trademarks of Microsoft Corporation. PC Direct is a trademark of Ziff Communications Company and is used by IBM Corporation under license. Pentium, MMX, ProShare, LANDesk, and ActionMedia are trademarks or registered trademarks of Intel Corporation in the U.S. and other countries. UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company Limited. Other company, product, and service names may be trademarks or service marks of others. 138 Intelligent Miner Applications Guide Appendix B. Related Publications The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this redbook. B.1 International Technical Support Organization Publications For information on ordering these ITSO publications see “How to Get ITSO Redbooks” on page 141. • Discovering Data Mining , SG24-4839 • Mining Relational and Nonrelational Data with IBM Intelligent Miner for Data , SG24-5278 B.2 Redbooks on CD-ROMs Redbooks are also available on CD-ROMs. Order a subscription and receive updates 2-4 times a year at significant savings. CD-ROM Title System/390 Redbooks Collection Networking and Systems Management Redbooks Collection Transaction Processing and Data Management Redbook AS/400 Redbooks Collection RS/6000 Redbooks Collection (HTML, BkMgr) RS/6000 Redbooks Collection (PostScript) Application Development Redbooks Collection Personal Systems Redbooks Collection Subscription Number SBOF-7201 SBOF-7370 SBOF-7240 SBOF-7270 SBOF-7230 SBOF-7205 SBOF-7290 SBOF-7250 Collection Kit Number SK2T-2177 SK2T-6022 SK2T-8038 SK2T-2849 SK2T-8040 SK2T-8041 SK2T-8037 SK2T-8042 B.3 Other Publications These publications are also relevant as further information sources: • Copyright IBM Corp. 1999 Using the Intelligent Miner for Data , SH12-6325 139 140 Intelligent Miner Applications Guide How to Get ITSO Redbooks This section explains how both customers and IBM employees can find out about ITSO redbooks, CD-ROMs, workshops, and residencies. A form for ordering books and CD-ROMs is also provided. This information was current at the time of publication, but is continually subject to change. The latest information may be found at http://www.redbooks.ibm.com. How IBM Employees Can Get ITSO Redbooks Employees may request ITSO deliverables (redbooks, BookManager BOOKs, and CD-ROMs) and information about redbooks, workshops, and residencies in the following ways: • PUBORDER — to order hardcopies in United States • GOPHER link to the Internet - type GOPHER.WTSCPOK.ITSO.IBM.COM • Tools disks To get LIST3820s of redbooks, type one of the following commands: TOOLS SENDTO EHONE4 TOOLS2 REDPRINT GET SG24xxxx PACKAGE TOOLS SENDTO CANVM2 TOOLS REDPRINT GET SG24xxxx PACKAGE (Canadian users only) To get BookManager BOOKs of redbooks, type the following command: TOOLCAT REDBOOKS To get lists of redbooks, type one of the following commands: TOOLS SENDTO USDIST MKTTOOLS MKTTOOLS GET ITSOCAT TXT TOOLS SENDTO USDIST MKTTOOLS MKTTOOLS GET LISTSERV PACKAGE To register for information on workshops, residencies, and redbooks, type the following command: TOOLS SENDTO WTSCPOK TOOLS ZDISK GET ITSOREGI 1998 For a list of product area specialists in the ITSO: type the following command: TOOLS SENDTO WTSCPOK TOOLS ZDISK GET ORGCARD PACKAGE • Redbooks Web Site on the World Wide Web http://w3.itso.ibm.com/redbooks • IBM Direct Publications Catalog on the World Wide Web http://www.elink.ibmlink.ibm.com/pbl/pbl IBM employees may obtain LIST3820s of redbooks from this page. • REDBOOKS category on INEWS • Online — send orders to: USIB6FPL at IBMMAIL or DKIBMBSH at IBMMAIL • Internet Listserver With an Internet e-mail address, anyone can subscribe to an IBM Announcement Listserver. To initiate the service, send an e-mail note to [email protected] with the keyword subscribe in the body of the note (leave the subject line blank). A category form and detailed instructions will be sent to you. Redpieces For information so current it is still in the process of being written, look at ″Redpieces″ on the Redbooks Web Site ( http://www.redbooks.ibm.com/redpieces.htm). Redpieces are redbooks in progress; not all redbooks become redpieces, and sometimes just a few chapters will be published this way. The intent is to get the information out much quicker than the formal publishing process allows. Copyright IBM Corp. 1999 141 How Customers Can Get ITSO Redbooks Customers may request ITSO deliverables (redbooks, BookManager BOOKs, and CD-ROMs) and information about redbooks, workshops, and residencies in the following ways: • Online Orders — send orders to: In United States: In Canada: Outside North America: • • United States (toll free) Canada (toll free) 1-800-879-2755 1-800-IBM-4YOU Outside North America (+45) 4810-1320 - Danish (+45) 4810-1420 - Dutch (+45) 4810-1540 - English (+45) 4810-1670 - Finnish (+45) 4810-1220 - French (long (+45) (+45) (+45) (+45) (+45) distance charges apply) 4810-1020 - German 4810-1620 - Italian 4810-1270 - Norwegian 4810-1120 - Spanish 4810-1170 - Swedish Mail Orders — send orders to: I B M Publications 144-4th Avenue, S.W. Calgary, Alberta T2P 3N5 Canada IBM Direct Services Sortemosevej 21 DK-3450 Allerød D enmark Fax — send orders to: United States (toll free) Canada Outside North America • Internet [email protected] [email protected] [email protected] Telephone orders I B M Publications Publications Customer Support P.O. Box 29570 Raleigh, NC 27626-0570 USA • IBMMAIL usib6fpl at ibmmail caibmbkz at ibmmail dkibmbsh at ibmmail 1-800-445-9269 1-403-267-4455 (+45) 48 14 2207 (long distance charge) 1-800-IBM-4FAX (United States) or (+1)001-408-256-5422 (Outside USA) — ask for: Index # 4421 Abstracts of new redbooks Index # 4422 IBM redbooks Index # 4420 Redbooks for last six months • Direct Services - send note to [email protected] • On the World Wide Web Redbooks Web Site IBM Direct Publications Catalog • http://www.redbooks.ibm.com http://www.elink.ibmlink.ibm.com/pbl/pbl Internet Listserver With an Internet e-mail address, anyone can subscribe to an IBM Announcement Listserver. To initiate the service, send an e-mail note to [email protected] with the keyword subscribe in the body of the note (leave the subject line blank). Redpieces For information so current it is still in the process of being written, look at ″Redpieces″ on the Redbooks Web Site ( http://www.redbooks.ibm.com/redpieces.htm). Redpieces are redbooks in progress; not all redbooks become redpieces, and sometimes just a few chapters will be published this way. The intent is to get the information out much quicker than the formal publishing process allows. 142 Intelligent Miner Applications Guide IBM Redbook Order Form Please send me the following: Title First name Order Number Quantity Last name Company Address City Postal code Telephone number Telefax number • Invoice to customer number • Credit card number Credit card expiration date Card issued to Country VAT number Signature We accept American Express, Diners, Eurocard, Master Card, and Visa. Payment by credit card not available in all countries. Signature mandatory for credit card payment. How to Get ITSO Redbooks 143 144 Intelligent Miner Applications Guide Glossary A C adaptive connection. A numeric weight used to describe the strength of the connection between two processing units in a neural network. The connection is called adaptive because it is adjusted during training. Values typically range from zero to one, or -0.5 to +0.5. categorical values. Discrete, nonnumerical data represented by character strings; for example, colors or special brands. aggregate. To summarize data in a field. application program interface (API). A functional interface supplied by the operating system or a separate orderable licensed program that allows an application program written in a high-level language to use specific data or functions of the operating system or the licensed program. architecture. The number of processing units in the input, output, and hidden layer of a neural network. The number of units in the input and output layers is calculated from the mining data and input parameters. An intelligent data mining agent calculates the number of hidden layers and the number of processing units in those hidden layers. chi-square test. A test to check whether two variables are statistically dependent or not. Chi-square is calculated by subtracting the expected frequencies (imaginary values) from the observed frequencies (actual values). The expected frequencies represent the values that were to be expected if the variable question were statistically independent. classification. The assignment of objects into groups or categories based on their characteristics. cluster. A group of records with similar characteristics. cluster prototype. The attribute values that are typical of all records in a given cluster. Used to compare the input records to determine whether a record should be assigned to the cluster represented by these values. associations. The relationship of items in a transaction in such a way that items imply the presence of other items in the same transaction. clustering. A mining function that creates groups of data records within the input data on the basis of similar characteristics. Each group is called a cluster . attribute. Characteristics or properties that can be controlled, usually to obtain a required appearance. For example, the color is an attribute of a line. In object-oriented programming, a data element defined within a class. confidence factor. Indicates the strength or the reliability of the associations detected. B D back propagation. A general-purpose neural network named for the method used to adjust weights while learning data patterns. The Classification − Neural mining function uses such a network. DATABASE 2 (DB2). An IBM relational database management system. boundary field. The upper limit of an interval as used for discretization using ranges of a processing function. database view. An alternative representation of data from one or more database tables. A view can include all or some of the columns contained in the database table or tables on which it is defined. bucket. One of the bars in a bar chart showing the frequency of a specific value. continuous field. A field that can have any floating point number as its value. database table. A table residing in a database. data field. In a database table, the intersection from table description and table column where the corresponding data is entered. data format. There are different kinds of data formats, for example, database tables, database views, pipes, or flat files. data table. A data table, regardless of the data format it contains. Copyright IBM Corp. 1999 145 data type. There are different kinds of Intelligent Miner data types, for example, discrete numeric, discrete nonnumeric, binary, or continuous. discrete. Pertaining to data that consists of distinct elements such as character or to physical quantities having a finite number of distinctly recognizable values. discretization. discrete. F-test. A statistical test that checks whether two estimates of the variances of two independent samples are the same. In addition, the F-test checks whether the null hypothesis is true or false. The act of making mathematically E envelope. The area between two curves that are parallel to a curve of time-sequence data. The first curve runs above the curve of time-sequence data, the second one below. Both curves have the same distance to the curve of time-sequence data. The width of the envelope, that is, the distance from the first parallel curve to the second, is defined as epsilon. epsilon. The maximum width of an envelope that encloses a sequence. Another sequence is epsilon-similar if it fits in this envelope. epsilon-similar. Two sequences are epsilon-similar if one sequence does not go beyond the envelope that encloses the other sequence. equality compatible. Pertaining to different data types that can be operands for the = logical operator. Euclidean distance. The square root of the sum of the squared differences between two numeric vectors. The Euclidean distance is used to calculate the error between the calculated network output and the target output in neural classification, to calculate the difference between a record and a prototype cluster value in neural clustering. A zero value indicates an exact match; larger numbers indicate greater differences. F field. A set of one or more related data items grouped for processing. In this document, with regard to database tables and views, field is synonymous with column . file. A collection of related data that is stored and retrieved by an assigned name. file name. (1) A name assigned or declared for a file. (2) The name used by a program to identify a file. flat file. (1) A one-dimensional or two-dimensional array: a list or table of items. (2) A file that has no hierarchical structure. 146 formatted information. An arrangement of information into discrete units and structures in a manner that facilitates its access and processing. Contrast with narrative information . Intelligent Miner Applications Guide function. Any instruction or set of related instructions that perform a specific operation. fuzzy logic. In artificial intelligence, a technique using approximate rules of inference in which truth values and quantifiers are defined as possibility distributions that carry linguistic labels. I input data. The metadata of the database table, database view, or flat file containing the data you specified to be mined. input layer. A set of processing units in a neural network which present the numeric values derived from user data to the network. The number of fields and type of data in those fields are used to calculate the number of processing units in the input layer. instance. In object-oriented programming, a single, actual occurrence of a particular object. Any level of the object class hierarchy can have instances. A n instance can be considered in terms of a copy of the object type frame that is filled in with particular information. interval. A set of real numbers between two numbers either including or excluding both of them. interval boundaries. Values that represent the upper and lower limits of an interval. item category. A categorization of an item. For example, a room in a hotel can have the following categories: Standard, Comfort, Superior, Luxury. The lower category is called the child item category. Each child item category can have several parent item categories. Each parent item category can have several grandparent item categories. item description. The descriptive name of a character string in a data table. item ID. The identifier for an item. item set. A collection of items. For example, all items bought by one customer during one visit to a department store. K Kohonen Feature Map. A neural network model comprised of processing units arranged in an input layer and output layer. All processors in the input layer are connected to each processor in the output layer by an adaptive connection. The learning algorithm used involves competition between units for each input pattern and the declaration of a winning unit. Used in neural clustering to partition data into similar record groups. L large item sets. The total volume of items above the specified support factor returned by the Associations mining function. learning algorithm. The set of well-defined rules used during the training process to adjust the connection weights of a neural network. The criteria and methods used to adjust the weights define the different learning algorithms. learning parameters. The variables used by each neural network model to control the training of a neural network which is accomplished by modifying network weights. lift. Confidence factor divided by expected confidence. nonsupervised learning. A learning algorithm that requires only input data to be present in the data source during the training process. No target output is provided; instead, the desired output is discovered during the mining run. A Kohonen Feature Map, for example, uses nonsupervised learning. O offset. (1) The number of measuring units from an arbitrary starting point in a record, area, or control block, to some other point. (2) The distance from the beginning of an object to the beginning of a particular field. operator. (1) A symbol that represents an operation to be done. (2) In a language statement, the lexical entity that indicates the action to be performed on operands. output data. The metadata of the database table, database view, or flat file containing the data being produced or to be produced by a function. output layer. A set of processing units in a neural network which contain the output calculated by the network. The number of outputs depends on the number of classification categories or maximum clusters value in neural classification and neural clustering, respectively. P M pass. metadata. objects. mining. In databases, data that describes data Synonym for analyzing or searching. mining base. A repository where all information about the input data, the mining run settings, and the corresponding results is stored. model. A specific type of neural network and its associated learning algorithm. Examples include the Kohonen Feature Map and back propagation. One cycle of processing a body of data. prediction. The dependency and the variation of one field′s value within a record on the other fields within the same record. A profile is then generated that can predict a value for the particular field in a new record of the same form, based on its other field values. processing unit. A processing unit in a neural network is used to calculate an output by summing all incoming values multiplied by their respective adaptive connection weights. Q N narrative information. Information that is presented according to the syntax of a natural language. Contrast with formatted information. neural network. A collection of processing units and adaptive connections that is designed to perform a specific processing function. quantile. One of a finite number of nonoverlapping subranges or intervals, each of which is represented by an assigned value. Q is an N%-quantile of a value set S when: • Approximately N percent of the values in S are lower than or equal to Q . • Approximately (100- N ) percent of the values are greater than or equal to Q . Neural Network Utility (NNU). A family of IBM application development products for creating neural network and fuzzy rule system applications. Glossary 147 The approximation is less exact when there are many values equal to Q. N is called the quantile label. The 50%-quantile represents the median. R similar time sequences. Occurrences of similar sequences in a database of time sequences. radial basis function. In data mining functions, radial basis functions are used to predict values. They represent functions of the distance or the radius from a particular point. They are used to build up approximations to more complicated functions. record. A set of one or more related data items grouped for processing. In reference to a database table, record is synonymous with row . region. (Sub)set of records with similar characteristics in their active fields. Regions are used to visualize a prediction result. round-robin method. A method by which items are sequentially assigned to units. When an item has been assigned to the last unit in the series, the next item is assigned to the first again. This process is repeated until the last item has been assigned. The Intelligent Miner uses this method, for example, to store records in output files during a partitioning job. rule. A clause in the form head ⇐ body. It specifies that the head is true if the body is true. rule body. Represents the specified input data for a mining function. rule group. Covers all rules containing the same items in different variations. rule head. Represents the derived items detected by the Associations mining function. S scale. A system of mathematical notation: fixed-point or floating-point scale of an arithmetic value. scaling. To adjust the representation of a quantity by a factor in order to bring its range within prescribed limits. scale factor. A number used as a multiplier in scaling. For example, a scale factor of 1/1000 would be suitable to scale the values 856, 432, -95, and /182 to lie in the range from -1 to +1, inclusive. self-organizing feature map. Map . See Kohonen Feature sensitivity analysis report. An output from the Classification − Neural mining function that shows which input fields are relevant to the classification decision. 148 sequential patterns. Intertransaction patterns such that the presence of one set of items is followed by another set of items in a database of transactions over a period of time. Intelligent Miner Applications Guide Structured Query Language (SQL). An established set of statements used to manage information stored in a database. By using these statements, users can add, delete, or update information in a table, request information through a query, and display results in a report. supervised learning. A learning algorithm that requires input and resulting output pairs to be presented to the network during the training process. Back propagation, for example, uses supervised learning and makes adjustments during training so that the value computed by the neural network will approach the actual value as the network learns from the data presented. Supervised learning is used in the techniques provided for predicting classifications as well as for predicting values. support factor. Indicates the occurrence of the detected association rules and sequential patterns based on the input data. symbolic name. In a programming language, a unique name used to represent an entity such as a field, file, data structure, or label. In the Intelligent Miner you specify symbolic names, for example, for input data, name mappings, or taxonomies. T taxonomy. Represents a hierarchy or a lattice of associations between the item categories of an item. These associations are called taxonomy relations. taxonomy relation. The hierarchical associations between the item categories you defined for an item. A taxonomy relation consists of a child item category and a parent item category. trained network. A neural network containing connection weights that have been adjusted by a learning algorithm. A trained network can be considered a virtual processor: it transforms inputs to outputs. training. The process of developing a model which understands the input data. In neural networks, the model is created by reading the records of the input and modifying the network weights until the network calculates the desired output data. translation process. Converting the data provided in the database to scaled numeric values in the appropriate range for a mining kernel using neural networks. Different techniques are used depending on whether the data is numeric or symbolic. Also, converting neural network output back to the units used in the database. transaction. A set of items or events that are linked by a common key value, for example, the articles (items) bought by a customer (customer number) on a particular date (transaction identifier). In this example, the customer number represents the key value. transaction ID. The identifier for a transaction, for example, the date of a transaction. transaction group. The identifier for a set of transactions. For example, a customer number, can represent a transaction group that includes all purchases of a particular customer during the month of May. V vector. A quantity usually characterized by an ordered set of numbers. W weight. The numeric value of an adaptive connection representing the strength of the connection between two processing units in a neural network. winner. The index of the cluster which has the minimum Euclidean distance from the input record. Used in the Kohonen Feature Map to determine which output units will have their weights adjusted. Glossary 149 150 Intelligent Miner Applications Guide List of Abbreviations AMRP air miles reward program MBA market basket analysis API application programming interface MDA multidimensional database analysis CIM continuous interactive marketing MDL minimum description length MPP massive parallel processor CPU central processing unit OLAP online analytical processing CRM customer relationship marketing PC personal computer DB2 DATABASE 2 POS point of sale GB gigabyte PROFS Professional Office System GIS graphical information system R&D research and development RBF radial basis function RFM recency frequency monetary IBM International Business Machines Corporation IT information technology RMS root mean square ITSO International Technical Support Organization ROI return on investment SQL structured query language LIS large item sets TB terabyte Copyright IBM Corp. 1999 151 152 Intelligent Miner Applications Guide Index A C accuracy 46, 98 affinity analysis 30 aggregate function aggregation algorithm selection analysis affinity 30 cluster detail 48, 53 data 6 decision tree 120 factor 39 intelligence 6 link 9, 13 market basket 13 multidimensional database 7 product affinity 13 result 16, 101, 120 time-series 113 anomalous decision tree 100 application data mining 8 mode 24 architecture Intelligent Miner 20 neural 118 association discovery 13 association rule discovery attrition model clustering result 126 data definition 114 data preparation 116 decision tree result 120 gains chart 119 input field selection 118 mining process 113 neural network result 126 output field selection 118 parameter selection 117 RBF result 122 result visualization 119 time-series result 128 average error calculate ROI 88 campaign cross selling 9 categoric variables 12, 50 chart gains 105 cleaning data 92 cluster characterization 49, 54 detail analysis 48, 53 maximum number of 45 profiling 48, 63 result comparison 65 selection 71, 76 values 48 clustering demographic 12 disadvantages 49 mode 24 neural 12 process 44 competition focus on 3 components CRM 32 Intelligent Miner 20 Condorect criteria 12 confidence m i n i m u m 74 confusion matrix 65, 101 continuous marketing 28 continuous variables 50 create objective variable 90 CRM components 32 cross selling association discovery 75 association rule discovery 77 campaign 9 cluster selection 71, 76 data preparation 73 data selection 72 identification 32 large item set removal 75 mining process 70 mining results 76 opportunity identification 68 parameter settings 74 rebuild rules 76 target marketing model 32 B behavior pattern customer 1 binary variables business analyst 7 50 Copyright IBM Corp. 1999 153 customer behavior pattern 1 focus on 2 purchasing pattern 8 relationship management 25 retention 30, 32, 110 retention manmagement 10 segmentation 29, 32 customer segmentation cluster analysis 48, 53 cluster characterization 49, 54 cluster comparison 65 cluster profiling 48, 63 data preparation 37 data selection 35 decision tree characterization 65 input field selection 47 mining process 34 output field selection 48 parameter selection 45 result visualization 48, 50 D data access 21 analysis 6 cleaning 38, 92 definition 21, 114 flood 4 mining 5, 16 preparation 16, 37, 73, 92, 116 reduction 16 sampling 16, 93 selection 16, 35, 72 transformation 38, 92 data mining application 8 process 34, 70, 89 results 103 techniques 9 data warehouse 3, 7 database analysis 7 marketing 8 segmentation 9, 11 decision tre results 103 decision tree 10, 49, 65, 96, 99 anomalous 100 parameter 97 demographic clustering 12, 45 demographic profile 27 deviation detection 9 standard 47 discovery association 13, 75 association rule 77 154 Intelligent Miner Applications Guide discovery (continued) subpopulations 3 discrete numeric variables discretization 39 distance absolute 47 range 47 standard deviation 47 drivers 2 E enablers 4 error average 117 rate 98 F factor analysis 39 feature selection 95 focus on competition 3 on customer 2 on data assets 3 relationship 2 forecast horizon 117 function aggregate 36 analytical 23 mining 23 processing 24 statistical 23 G gains chart 105 geodemographic profile 27 H hierarchy 73 horizon forecast 117 I input field selection 47, 99 integer variables 50 Intelligent Miner architecture 20 components 20 invalid value 38 item constraints 75 item set large 75 r e m o v a l 75 50 neural (continued) clustering 12 network 12, 98 network parameter 98, 100 prediction 19 numeric variables 50 K Kohonen feature map 12 L large item sets r e m o v a l 75 learning supervised 9 unsupervised 11 level aggregation 73 library processing 22 link analysis 9, 13 logarithmic transformation O output field selection P 43 M machine learning 1, 5 management customer relationship 25 market niche 2 saturation 1 market basket analysis 13 marketing continuous 28 database 8 matrix confusion 101 maximum rule length 75 minimum confidence 74 support 74 mining base 22 data 5 functions 23 kernel 22 result 22 missing value 38 model attrition 110 propensity 90 modeling predictive 9 modes application 24 clustering 24 test 24 training 24 m o m e n t u m 98, 118 N network neural 98, 100 neural architecture 118 48, 100 parameter accuracy 98 clustering algorithm 46 error rate 98 in-sample size 97, 98 item constraints 74 learning rate 98 maximum number of clusters 45 maximum number of passes 45 minimum confidence 74 minimum support 74 m o m e n t u m 98 number of centers 97 number of passes 97, 98 number of records 97 out-sample size 97, 98 purity per internal node 97 region size 97, 98 rule length 74 selection 45, 97, 117 settings 74 tree depth 97 passes maximum number of 45 permutation 75 prediction neural 19 potential strategies 3 profile 130 strategies 2 tactical movements 3 time-series 128 value 97 value with RBF 100, 101 predictive modeling 9 probability weighting 47 process clustering 44 data mining 34, 70, 89 processing functions 24 library 22 product aggregation 72 association analysis 73, 74 hierarchy 73 Index 155 product (continued) ID 73 product affinity analysis 13 product association 73, 74 profile prediction 130 profiles demographic 27 geodemographic 27 psychographic 27 project design 15 evaluation 15 management 15 objectives 15 plan 15 t e a m 15 propensity 11 model 90 psychographic profile 27 R range distance measure 47 rate e r r o r 98 learning 98 RBF modeling result 122 record scores 48 result analysis 16, 101, 120 data mining 103 decision tree 103 RBF modeling 122 visualization 49, 100, 119 result visualization customer segmentation 48 ROI calculate 88 rules rebuild 76 S sampling data 93 stratified 94 scatterplot 11 scores record 48 scoring 11 segmentation 11 customer 29, 32 database 9 selection algorithm 96 cluster 76 156 Intelligent Miner Applications Guide selection (continued) data 35 feature 95 input field 47, 99 output field 48, 100 parameter 45, 97, 117 selling cross 9, 13, 30 up 30 shareholder value 33, 36 similarity threshold 46 standard deviation 47 statistical functions 23 stratified sampling 94 subpopulation discovery 3 supervised learning 9 support m i n i m u m 74 T target marketing algorithm selection 96 data preparation 92 data sampling 93 decision tree result 103 feature selection 95 input field selection 99 mining process 89 neural network result 108 output field selection 100 parameter selection 97 RBF result 106 result analysis 101 result visualization 100 train and test 95 variable creation 90 TaskGuide 19, 22 techniques data mining 9 test 95 threshold similarity 46 time sequence 13 time-series analysis 113 parameter 117 prediction 128 result analysis 128 train 95 transformation data 38, 92 logarithmic 43 U unknown value 38 unsupervised learning up-selling 30 11 user interface 21 V value invalid 38 missing 38 prediction with RBF 97, 100, 101 shareholder 33, 36 unknown 38 valid 38 variable binary 50 categoric 12 categorical 50 continuous 50 create objective 90 discrete numeric 50 integer 50 numeric 50 visualization result 48, 49, 100, 119 visualizer 21 W weighting information theoretic probability 47 window size 117 47 Index 157 158 Intelligent Miner Applications Guide ITSO Redbook Evaluation Intelligent Miner for Data Applications Guide SG24-5252-00 Your feedback is very important to help us maintain the quality of ITSO redbooks. Please complete this questionnaire and return it using one of the following methods: • • • Use the online evaluation form found at http://www.redbooks.com Fax this form to: USA International Access Code + 1 914 432 8264 Send your comments in an Internet note to [email protected] Please rate your overall satisfaction with this book using the scale: (1 = very good, 2 = good, 3 = average, 4 = poor, 5 = very poor) Overall Satisfaction ____________ Please answer the following questions: Was this redbook published in time for your needs? Yes____ No____ If no, please explain: _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ What other redbooks would you like to see published? _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ Comments/Suggestions: ( THANK YOU FOR YOUR FEEDBACK! ) _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ _____________________________________________________________________________________________________ Copyright IBM Corp. 1999 159 SG24-5252-00 IBML SG24-5252-00 Printed in the U.S.A. Intelligent Miner for Data Applications Guide PAGE SEGMENT 5252F122 CONTAINS INVALID DATA. ′.E D F A W R K ′ LINE 900: .si 5252F122 inline STARTING PASS 2 OF 2. + + + P a g e c h e c k : document requires more passes or extended cross-reference to resolve correctl y. (Page 32 File: 5252CH3)