Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Science in Business Data Mining? Background: support managerial decision making Is there a science to data mining (with CI-methods)? YES, but it depends (and it may be empirical Wizardry driven by efficiency rather than effectiveness!) Outline 1. Data Mining in Business & Management 2. Rules established in Business practices vs. Data mining? 1. Statistics vs. Data driven modelling 2. A personal view 3. How do develop meta-knowledge Sven F. Crone, Lancaster University Management School Research Centre for Forecasting Business Data Mining? Aggregate Demand Adoption Extrapolative Forecasting (incl. Judgement) Market Experiments Intentions Individual Demand Acquisition Marketing Response Activation Relationship High value Customer Target Market Credit Scoring Prospect New Customer Inital Customer Direct Marketing New Customer Main areas for Data Mining: High potential Customer Low value Customer Established Customer Churn Prediction Retention Voluntary Churn Resignation Forced Churn Former Customer adapted from Berry and Linoff (2004) and Olafson et al (2006) Finance: Credit risk (personal & corporate) Sven F. Crone, Marketing: Customer Relationship Management Lancaster University Management School (=Direct Marketing, Database Marketing) Research Centre for Forecasting Best practices Credit Scoring Small & Balanced classes Large & imbalanced sample Discretise all (!) variables Use 2000 of minority class Use undersampling Cross-Selling Binary dummies / WOE to capture non-linearity Use Logistic regression A personal view: Use large sample sizes Original (Imbalanced) class distribution … Extensive use of expert domain knowledge GAP efficient solution ≠ best Data selection is best using prior domain knowledge (use filters) Pre-processing more important than method [Crone et al, 2006; Keogh 2002] (Balanced) sampling & pre-processing is method dependent Best practices exist & are domain dependent Sven F. Crone, (e.g. homogeneous datasets in credit scoring) Lancaster University Management School Research Centre for Forecasting • Flat Maximum effect [Lovie & Lovie, 1986] • • • • How do derive (meta)-knowledge? Lessons from other disciplines: Time Series Forecasting More ‘Evidence based methods” [Armstrong 2000] Empirical Evidence Conditions under which methods perform well (multiple hypothesis) Multiple out-of-sample evaluations (≠ single fold, one origin) Multiple homogeneous datasets from one domain Use of valid benchmark methods & unbiased error measures Honour the domain & decision context (active learning, cost sensitive) Studies must allow replications – document all steps / parameters Domain specific Competitions (valid & reliable) Replications STOP FINE-TUNING / MARGINAL EXTENSION OF SINGLE METHOD ON SINGLE TOY DATASET Develop solutions for domain (Why make life harder?) Where to start? follow high impact approach! Identify most prominent application domains (e.g. credit risk) Select promising application domains for CI-methods Get corporate sponsor & run competition Sven F. Crone, Lancaster University Management School Analyse conditions (!) using meta-studies! Research Centre for Forecasting Embed findings as methodology in SOFTWARE Literature Ian Ayres (2007) Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart, Bantam Thomas H. Davenport, Jeanne G. Harris (2007) Competing on Analytics: The New Science of Winning, Harvard Business School Press Fildes, Nikolopoulos, Crone, Synthetos (2009) Forecasting and Operational Research – a Review, JORS, forthcoming Finlay, Crone (under review), Sampling issues in Credit Scoring – the effect of sample size and sample distribution on predictive accuracy, EJOR Keogh, Kasetty (2002, 2004) On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration, SIGKDD’02 & Data Mining Journal