Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
University of Rhode Island Department of Computer Science and Statistics March 30, 2007 An Overview and Example of Data Mining Daniel T. Larose, Ph.D. Professor of Statistics Director, Data Mining @CCSU Editor, Wiley Series on Methods and Applications in Data Mining [email protected] www.math.ccsu.edu/larose URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 1 Overview • Part One: – A Brief Overview of Data Mining • Part Two: – An Example of Data Mining: – Modeling Response to Direct Mail Marketing • But first, a shameless plug … URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 2 Master of Science in DM at CCSU Faculty • Dr. Roger Bilisoly (from Ohio State Univ., Statistics) – Text Mining, Intro to Data Mining • Dr. Darius Dziuda (from Warsaw Polytechnic Univ, CS) – Data Mining for Genomics and Proteomics, Biomarker Discovery • Dr. Zdravko Markov (from Sofia Univ, CS) – Data Mining (CS perspective), Machine Learning • Dr. Daniel Miller (from UConn, Statistics) – Applied Multivariate Analysis, Mathematical Statistics II, Intro to Data Mining • Dr. Krishna Saha (from Univ of Windsor, Statistics) – Intro to Data Mining using R • Dr. Daniel Larose (Program Director) (from UConn, Statistics) – Intro to Data Mining, Data Mining Methods, Applied Data Mining, Web Mining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 3 Master of Science in DM at CCSU Program (36 credits) • Core Courses (27 credits) All available online. – – – – – – – – Stat Stat Stat Stat Stat Stat Stat Stat 521 522 523 525 526 527 416 570 Introduction to Data Mining (4 cr) Data Mining Methods (4 cr) Applied Data Mining (4 cr) Web Mining Data Mining for Genomics and Proteomics Text Mining Mathematical Statistics II Applied Multivariate Analysis • Electives ( 6 credits. Choose two) – – – – – – – • CS 570 Topics in Artificial Intelligence: Machine Learning CS 580 Topics in Advanced Database: Data Mining Stat 455 Experimental Design Stat 551 Applied Stochastic Processes Stat 567 Linear Models Stat 575 Mathematical Statistics III Stat 529 Current Issues in Data Mining Capstone Requirement: Stat 599 Thesis (3 credits) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 4 Master of Science in DM at CCSU • • • • • • • • • Only MS in DM that is entirely online. Some courses available on campus. Student must come to CCSU to present Thesis We reach students in about 30 US States and a dozen foreign countries Half of our students already have master’s degrees About 15% already have Ph.D.’s Typical student is a mid-career professional Backgrounds are diverse: Computer Science, Engineering, Finance, Chemistry, Database Admin, Statistics, etc. www.ccsu.edu/datamining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 5 Graduate Certificate in Data Mining • 18 Credits: • Required Courses (12 credits) – Stat 521 Introduction to Data Mining – Stat 522 Data Mining Methods and Models – Stat 523 Applied Data Mining • Elective Courses (6 credits. Choose Two): – – – – – Stat 525 Web Mining Stat 526 Data Mining for Genomics and Proteomics Stat 527 Text Mining Stat 529 Current Issues in Data Mining Some other graduate-level data mining or statistics course, with approval of advisor. • No Mathematical Statistics requirement. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 6 Material for Part I Drawn From: Discovering Knowledge in Data: An Introduction to Data Mining (Wiley, 2005) • • • • Chapter Chapter Chapter Chapter • • • • Chapter Chapter Chapter Chapter • Chapter • Chapter • Chapter 1. 2. 3. 4. An Introduction to Data Mining Data Preprocessing Exploratory Data Analysis Statistical Approaches to Estimation and Prediction 5. K-Nearest Neighbor 6. Decision Trees 7. Neural Networks 8. Hierarchical and K-Means Clustering 9. Kohonen networks 10. Association Rules 11. Model Evaluation Techniques URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 7 Material for Part II Drawn From: Data Mining Methods and Models (Wiley, 2006) • Chapter 1. Dimension Reduction Methods • Chapter 2. Regression Modeling • Chapter 3. Multiple Regression and Model • • • • Building Chapter 4. Logistic Regression Chapter 5. Naïve Bayes Classification and Bayesian Networks Chapter 6. Genetic Algorithms Chapter 7. Case Study: Modeling Response to Direct-Mail Marketing URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 8 No Material Drawn From: Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage (Wiley, April 2007) • Part One: Web Structure Mining – Information Retrieval and Web Search – Hyperlink-Based Ranking • Part Two: Web Content Mining – Clustering – Evaluating Clustering – Classification • Part Three: Web Usage Mining – Data Preprocessing, – Exploratory Data Analysis, – Association Rules, Clustering, and Classification for Web Usage Mining • With Dr. Zdravko Markov, Computer Science, CCSU URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 9 Call for Book Proposals Wiley Series on Methods and Applications in Data Mining • Suggested topics: – – – – – – Data Mining in Bioinformatics Emerging Techniques in Data Mining (e.g., SVM) Data Mining with Evolutionary Algorithms Drug Discovery Using Data Mining Mining Data Streams Visual Analysis in Data Mining • Books in press: – Data Mining for Genomics and Proteomics, by Darius Dziuda – Practical Text Mining Using Perl, by Roger Bilisoly • Contact Series Editor at [email protected] URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 10 What is Data Mining? • “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” – David Hand, Heikki Mannila & Padhraic Smyth, Principles of Data Mining, MIT Press, 2001 URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 11 Why Data Mining? • “We are drowning in information but starved for knowledge.” – John Naisbitt, Megatrends, 1984. • “The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge, and thence up the taxonomy tree into wisdom.” – Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2005. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 12 Need for Human Direction • Automation is no substitute for human supervision and input. – Humans need to be actively involved at every phase of data mining process. •“Rather than asking where humans fit into data mining, we should instead inquire about how we may design data mining into the very human process of problem solving.” - Daniel Larose, Discovering Knowledge in Data: An Introduction to Data Mining, Wiley, 2005. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 13 “Data Mining is Easy to Do Badly” • Black box software – Powerful, “easy-to-use” data mining algorithms – Makes their misuse dangerous. – Too easy to point and click your way to disaster. • What is needed: – An understanding of the underlying algorithmic and statistical model structures. – An understanding of which algorithms are most appropriate in which situations and for which types of data. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 14 CRISP-DM: Cross-Industry Standard Process for Data Mining URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 15 CRISP: DM as a Process 1. Business / Research Understanding Phase 2. 3. 4. 5. Data Understanding Phase: EDA Data Preparation Phase: Preprocessing Modeling Phase: Fun and interesting! Evaluation Phase 6. Deployment Phase: Use results to solve problem. If desired: Use lessons learned to reformulate business / research objective. Enunciate your objectives Confluence of results? Objectives Met? URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 16 What About Data Dredging? Data Dredging “A sufficiently exhaustive search will certainly throw up patterns of some kind. Many of these patterns will simply be a product of random fluctuations, and will not represent any underlying structure.” David J. Hand, Data Mining: Statistics and More? The American Statistician, May, 1998. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 17 Guarding Against Data Dredging: Cross-Validation is the Key • • • Partition the data into training set and test set. If the pattern shows up in both data sets, decreases the probability that it represents noise. More generally, may use n-fold cross-validation. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 18 Inference and Huge Data Sets • Hypothesis testing becomes sensitive at the huge sample sizes prevalent in data mining applications. – Even very tiny effects will be found significant. – So, data mining tends to de-emphasize inference URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 19 Need for Transparency and Interpretability • Data mining models should be transparent • • • – Results should be interpretable by humans Decision Trees are transparent Neural Networks tend to be opaque If a customer complains about why he/she was turned down for credit, we should be able to explain why, without saying “Our neural net said so.” URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 20 Part Two: Modeling Response to Direct Mail Marketing Business Understanding Phase: – Clothing Store Purchase Data • Results of a direct mail marketing campaign • Task: Construct a classification model – For classifying customers as either responders or non-responders to the marketing campaign, – To reduce costs and increase return-oninvestment URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 21 Data Understanding: The Clothing Store dataset List of fields in the dataset (28,7999 customers, 51 fields) Customer ID: Unique, encrypted customer Number of days the customer has identification been on file Product uniformity (Low score = diverse spending patterns) Zip Code Lifetime average time between visits Number of purchase visits Total net sales Average amount spent per visit Amount spent at each of four different franchises (four variables) Amount spent in the past month, the past three months, and the past six months Amount spent the same period last year Gross margin percentage Number of marketing promotions on file Number of days between purchases Markdown percentage on customer purchases Number of different product classes purchased Number of coupons used by the customer Total number of individual items purchased by the customer Number of stores the customer shopped at Number of promotions mailed in the past year Number of promotions responded to in the past year Promotion response rate for the past year Microvision® Lifestyle Cluster Type Percent of Returns Flag: Credit card user Flag: Valid phone number on file Flag: Web shopper 15 variables providing the percentages spent by the customer on specific classes of clothing, including sweaters, knit tops, knit dresses, blouses, jackets, career pants, casual pants, shirts, dresses, suits, outerwear, jewelry, fashion, legwear, and the collectibles line. Also a variable showing the brand of choice (encrypted). Target variable: Response to promotion URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 22 Data Preparation and EDA Phase • Not covered in this presentation. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 23 Modeling Strategy • Apply principal components analysis to address • • • • multicollinearity. Apply cluster analysis. Briefly profile clusters. Balance the training data set. Establish baseline model performance – In terms of expected profit per customer contacted. Apply classification algorithms to training data set: – CART – C5.0 (C4.5) – Neural networks – Logistic regression. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 24 Modeling Strategy continued • Evaluate each model using test data set. • Apply misclassification costs in line with cost benefit table. • Apply overbalancing as a surrogate for misclassification costs. – Find best overbalancing proportion. • Combine predictions from four models – Using model voting. – Using mean response probabilities. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 25 Principal Components Analysis (PCA) • Multicollinearity does not degrade prediction accuracy. – But muddles individual predictor coefficients. • Interested in predictor characteristics, customer profiling, etc? – Then PCA is required. • But, if interested solely in classification (prediction, estimation), – PCA not strictly required. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 26 Report Two Model Sets: • Model Set A: – Includes principal components – All purpose model set • Model Set B: – Includes correlated predictors, not principal components – Use restricted to classification URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 27 Principal Components Analysis (PCA) • Seven correlated variables. – Two components extracted – Account for 87% of variability URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 28 Principal Components Analysis (PCA) • Principal Component 1: – Purchasing Habits – Customer general purchasing habits – Expect component to be strongly indicative of response • Principal Component 2: – Promotion Contacts – Unclear whether component will be associated with response • Components validated by test data set URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 29 BIRCH Clustering Algorithm • Requires only one pass through data set • • • – Scalable for large data sets Benefit: Analyst need not pre-specify number of clusters Drawback: Sensitive to initial records encountered – Leads to widely variable cluster solutions Requires “outer loop” to find consistent cluster solution • Zhang, Ramakrishnan and Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1, 1997. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 30 BIRCH Clusters • Cluster 3 shows: – Higher response for flag predictors – Higher averages for numeric predictors z ln Purchase Visits z ln Total Net Sales z sqrt Spending Last One Month z ln Lifetime Average Time Between Visits z ln Product Uniformity z sqrt # Promotion Responses in Past Year z sqrt Spending on Sweaters Cluster 1 –0.575 –0.177 –0.279 0.455 0.493 –0.480 –0.486 Cluster 2 –0.570 –0.804 –0.314 0.484 0.447 –0.573 0.261 Cluster 3 1.011 0.971 0.523 –0.835 –0.834 0.950 0.116 URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 31 BIRCH Clusters • Cluster 3 has highest response rate (red). – Cluster 1: 7.6% – Cluster 2: 7.1% – Cluster 3: 33.0% URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 32 Balancing the Data • For “rare” classes, • provides more equitable distribution. Drawback: Loss of data: – Here, 40% of nonresponders randomly omitted – All responders retained – Responders increases from 16.58% to 24.76% • Test data set should never be balanced URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 33 False Positive vs. False Negative: Which is Worse? • For direct mail marketing, a false negative • error is probably worse than a false positive. Generate misclassification costs based on the observed data. – Construct cost-benefit table URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 34 Decision Cost / Benefit Analysis Outcome True Negative True Positive Classified No Yes Actual No Yes Cost Rationale $0 No contact made; no revenue lost (Anticipated -$26.40 revenue) – (Cost of contact) False Negative No Yes $28.40 Loss of anticipated revenue False Positive Yes No $2.00 Cost of contact URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 35 Establish Baseline Model Performance • Benchmarks – “Don’t Send a Marketing Promotion to Anyone” Model – “Send a Marketing Promotion to Everyone” Model • Will compare candidate models against this baseline error rate. Model “Don’t Send Anyone” “Send to Everyone” TN Cost $0 5908 0 TP Cost – $26.4 0 1151 FN Cost $28.40 1151 0 FP Cost $2.00 0 5908 Overall Error Rate Overall Cost 16.3% $32,688.40 ($4.63 per customer) 83.7% -$18,570.40 (-$2.63 per customer) URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 36 Model Set A • No model beats benchmark of $2.63 profit per customer • Misclassification costs had not been applied • Now define FN cost = $28.40, FP cost = $2 – Outperformed baseline “Send to everyone” model (With 50% Balancing) Model TN Cost $0 TP Cost – $26.4 FN Cost $28.40 FP Cost $2.00 Overall Error Rate Overall Cost per Customer Neural Network 4694 672 479 9.3% 1214 64.4% 24.0% -$0.24 CART 4348 829 322 6.9% 1560 65.3% 26.7% -$1.36 C5.0 4465 782 369 7.6% 1443 64.9% 25.7% -$1.03 Logistic Regression 4293 872 279 6.1% 1615 64.9% 26.8% -$1.68 Model TN Cost $0 TP Cost – $26.4 FN Cost $28.40 FP Cost $2.00 Overall Error Rate Overall Cost per Customer CART 754 1147 4 0.5% 5154 81.8% 73.1% -$2.81 C5.0 858 1143 8 0.9% 5050 81.5% 71.7% -$2.81 URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 37 Model Set A: Effect of Misclassification Costs • For the 447 highlighted records: – Only 20.8% responded. – But model predicts positive response. – Due to high false negative misclassification cost. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 38 Model Set A: PCA Component 1 is Best Predictor • First principal component ($F-PCA-1), Purchasing Habits, represents both the root node split and the secondary split – Most important factor for predicting response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 39 Over-Balancing as a Surrogate for Misclassification Costs • Software limitation: • Neural network and logistic regression models in Clementine: – Lack methods for applying misclassification costs • Over-balancing is an alternate method which can achieve • similar results Starves the classifier of instances of non-response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 40 Over-Balancing as a Surrogate for Misclassification Costs • Neural network model results – Three over-balanced models outperform baseline • Properly applied, over-balancing can be used as a surrogate for misclassification costs Model TN Cost $0 TP Cost – $26.4 FN Cost $28.40 FP Cost $2.00 Overall Error Rate Overall Cost per Customer No Balancing (16.3% - 83.7%) 5865 124 1027 14.9% 43 25.7% 15.2% +$3.68 50% - 50% Balancing 4694 672 479 9.3% 1214 64.4% 24.0% -$0.24 65% - 35% Over-Balancing 1918 1092 59 3.0% 3990 78.5% 57.4% -$2.72 80% - 20% Over-Balancing 1032 1129 22 2.1% 69.4% -$2.75 90% - 10% Over-Balancing 592 1141 10 1.7% 75.4% -$2.72 4876 81.2% 5316 82.3% URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 41 Over-Balancing as a Surrogate for Misclassification Costs • Apply 80% - 20% over-balancing to the other models. TN Cost $0 TP Cost – $26.4 FN Cost $28.40 FP Cost $2.00 Overall Error Rate Overall Cost per Customer 885 1132 19 2.1% 5023 81.6% 71.4% -$2.73 CART 1724 1111 40 2.3% 4184 79.0% 59.8% -$2.81 C5.0 1467 1116 35 2.3% 4441 79.9% 63.4% -$2.77 Logistic Regression 2389 1106 45 1.8% 3519 76.1% 50.5% -$2.96 Model Neural Network URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 42 Combination Models: Voting • Smoothes out strengths and weaknesses of each model – Each model supplies a prediction for each record – Count the votes for each record • Disadvantage of combination models: – Lack of easy interpretability • Four competing combination models… URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 43 Combination Models: Voting Mail a Promotion only if: • All four models predict response – Protects against false positive – All four classification algorithms must agree on a positive prediction • At least three models predict response • At least two models predict response • Any model predicts response – Protects against false negatives URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 44 Combination Models: Voting • None beat the logistic regression model: $2.96 profit per customer • Perhaps combination models will do better with Model Collection B… TN Cost $0 TP Cost – $26.4 FN Cost $28.40 FP Cost $2.00 Overall Error Rate Overall Cost per Customer 2772 1067 84 2.9% 3136 74.6% 45.6% -$2.76 Mail a Promotion Only if Three or Four Models Predict Response 1936 1115 36 1.8% 3972 78.1% 56.8% -$2.90 Mail a Promotion Only if At Least Two Models Predict Response 1207 1135 16 1.3% 4701 80.6% 66.8% -$2.85 Mail a Promotion if Any Model Predicts Response 550 1148 3 0.5% 5358 82.4% 75.9% -$2.76 Combination Model Mail a Promotion Only if All Four Models Predict Response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 45 Model Collection B: Non-PCA Models • Models retain correlated variables – Use restricted to prediction only • Since the correlated variables are highly predictive – Expect Collection B will outperform the PCA models URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 46 Model Collection B: CART and C5.0 • Using misclassification costs, and 50% balancing • Both models outperform the best PCA model Model TN Cost $0 TP Cost – $26.4 FN Cost $28.40 FP Cost $2.00 Overall Error Rate Overall Cost per Customer CART 1645 1140 11 0.7% 4263 78.9% 60.5% -$3.01 C5.0 1562 1147 4 0.3% 4346 79.1% 61.6% -$3.04 URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 47 Model Collection B: Over-Balancing • Apply over-balancing as a surrogate for • misclassification costs for all models Best performance thus far. Model TN Cost $0 TP Cost – $26.4 FN Cost $28.40 FP Cost $2.00 Overall Error Rate Overall Cost per Customer Neural Network 1301 1123 28 2.1% 4607 80.4% 65.7% -$2.78 CART 2780 1100 51 1.8% 3128 74.0% 45.0% -$3.02 C5.0 2640 1121 30 1.1% 3268 74.5% 46.7% -$3.15 Logistic Regression 2853 1110 41 1.4% 3055 73.3% 43.9% -$3.12 URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 48 Combination Models: Voting • Combine the four models via voting and 80%-20% overbalancing • Synergy: Combination model outperforms any individual model. TN Cost $0 TP Cost – $26.4 FN Cost $28.40 FP Cost $2.00 Overall Error Rate Overall Cost per Customer 3307 1065 86 2.5% 2601 70.9% 38.1% -$2.90 2835 1111 40 1.4% 3073 73.4% 44.1% -$3.12 Mail a Promotion Only if At Least Two Models Predict Response 2357 1133 18 0.7% 3551 75.8% 50.6% -$3.16 Mail a Promotion if Any Model Predicts Response 1075 1145 6 0.6% 4833 80.8% 68.6% -$2.89 Combination Model Mail a Promotion Only if All Four Models Predict Response Mail a Promotion Only if Three or Four Models Predict Response URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 49 Combining Models Using Mean Response Probabilities • Combine the confidences that each model reports for its decisions – Allows finer tuning of the decision space • Derive a new variable: – Mean Response Probability (MRP): • Average of response confidences of the four models. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 50 Combining Models Using Mean Response Probabilities • Multi-modality due to the discontinuity of the transformation used in derivation of MRP URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 51 Combining Models Using Mean Response Probabilities • Where shall we define response vs. non-response? – Recall that FN is 14.2 times worse than FP – Set partitions on the low side => fewer FN decisions are made URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 52 Combining Models Using Mean Response Probabilities • Optimal partition: near 50%. • Mail a promotion to a prospective customer only if the mean response probability is at least 50% • Best model in case study. – MRP = 0.51 • $3.1744 profit – “send to everyone” • $2.62 profit – 20.7% profit enhancement (54.44 cents) Combination Model TN Cost $0 TP Cost – $26.4 FN Cost $28.40 MRP 0.95 Partition : MRP 0.95 5648 353 798 12.4% MRP 0.85 Partition : MRP 0.85 3810 994 157 4.0% MRP 0.65 Partition : MRP 0.65 2995 1104 47 1.5% MRP 0.54 Partition : MRP 0.54 2796 1113 38 1.3% MRP 0.52 Partition : MRP 0.52 2738 1121 30 1.1% MRP 0.51 Partition : MRP 0.51 2686 1123 MRP 0.50 Partition : MRP 0.50 2625 MRP 0.46 Partition : MRP 0.46 MRP 0.42 Partition : MRP 0.42 FP Cost $2.00 260 42.4% Overall Error Rate Overall Cost per Customer 15.0% +$1.96 2098 67.8% 31.9% -$2.49 2913 72.5% 41.9% -$3.11 44.6% -$3.13 3170 73.9% 45.3% -$3.1736 28 1.0% 3222 74.2% 46.0% -$3.1744 1125 26 1.0% 3283 74.5% 46.9% -$3.1726 2493 1129 22 0.9% 3415 75.2% 48.7% -$3.166 2369 1133 18 0.8% 3539 75.7% 50.4% -$3.162 3112 73.7% URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 53 Summary • For more on this Case Study, see Data Mining Methods and Models (Wiley, 2006) • So, the best part about all this is: – Data mining is fun! – If you love to play with data, and you love to construct and evaluate models, then data mining is for you. URI Dept of Computer Science and Statistics - An Overview and Example of Data Mining - Larose 54