Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

© Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004 © Deloitte Consulting, 2004 Themes What is Data Mining? How does it relate to statistics? Insurance applications Data sources The Data Mining Process Model Design Modeling Techniques Louise Francis’ Presentation 2 © Deloitte Consulting, 2004 Themes How does data mining need actuarial science? Variable creation Model design Model evaluation How does actuarial science need data mining? Advances in computing, modeling techniques Ideas from other fields can be applied to insurance problems 3 © Deloitte Consulting, 2004 Themes “The quiet statisticians have changed our world; not by discovering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions.” -- Ian Hacking Data mining gives us new ways of approaching the age-old problems of risk selection and pricing…. ….and other problems not traditionally considered ‘actuarial’. 4 © Deloitte Consulting, 2004 What is Data Mining? © Deloitte Consulting, 2004 What is Data Mining? My definition: “Statistics for the Computer Age” Not a radical break with traditional statistics Complements, builds on traditional statistics Statistics enriched with brute-force capabilities of modern computing Many new techniques have come from Computer Science, Marketing, Biology… but all can (should!) be brought under the framework of “statistics” Opens the door to new techniques Therefore Data Mining tends to be associated with industrial-sized data sets 6 © Deloitte Consulting, 2004 Buzz-words Data Mining Knowledge Discovery Machine Learning Statistical Learning Predictive Modeling Supervised Learning Unsupervised Learning ….etc 7 © Deloitte Consulting, 2004 What is Data Mining? Supervised learning: predict the value of a target variable based on several predictive variables “Predictive Modeling” Credit / non-credit scoring engines Retention, cross-sell models Unsupervised learning: describe associations and patterns along many dimensions without any target information Customer segmentation Data Clustering Market basket analysis (“diapers and beer”) 8 © Deloitte Consulting, 2004 So Why Should Actuaries Do This Stuff? Any application of statistics requires subject-matter expertise Psychometricians Econometricians Bioinformaticians Marketing scientists …are all applied statisticians with a particular subjectmatter expertise & area of specialty Add actuarial modelers to this list! “Insurometricians”!? Actuarial knowledge is critical to the success of insurance data mining projects 9 © Deloitte Consulting, 2004 Three Concepts Scoring engines Lift curves A “predictive model” by any other name… How much worse than average are the policies with the worst scores? Out-of-sample tests How well will the model work in the real world? Unbiased estimate of predictive power 10 © Deloitte Consulting, 2004 Classic Application: Scoring Engines Scoring engine: formula that classifies or separates policies (or risks, accounts, agents…) into profitable vs. unprofitable Retaining vs. non-retaining… (Non-)Linear equation f( ) of several predictive variables Produces continuous range of scores score = f(X1, X2, …, XN) 11 © Deloitte Consulting, 2004 What “Powers” a Scoring Engine? Scoring Engine: score = f(X1, X2, …, XN) The X1, X2,…, XN are at least as important as the f( )! Again why actuarial expertise is necessary Think of the predictive power of credit variables A large part of the modeling process consists of variable creation and selection Usually possible to generate 100’s of variables Steepest part of the learning curve 12 © Deloitte Consulting, 2004 Model Evaluation: Lift Curves Sort data by score Break the dataset into 10 equal pieces Best “decile”: lowest score lowest LR Worst “decile”: highest score highest LR Difference: “Lift” Lift = segmentation power Lift translates into ROI of the modeling project 13 © Deloitte Consulting, 2004 Out-of-Sample Testing Randomly divide data into 3 pieces Use Training data to fit models Score the Test data to create a lift curve Training data, Test data, Validation data Perform the train/test steps iteratively until you have a model you’re happy with During this iterative phase, validation data is set aside in a “lock box” Once model has been finalized, score the Validation data and produce a lift curve Unbiased estimate of future performance 14 © Deloitte Consulting, 2004 Data Mining: Applications The classic: Profitability Scoring Model Underwriting/Pricing applications Credit models Retention models Elasticity models Cross-sell models Lifetime Value models Agent/agency monitoring Target marketing Fraud detection Customer segmentation no target variable (“unsupervised learning”) 15 © Deloitte Consulting, 2004 Skills needed Statistical Actuarial Need scalable software, computing environment IT - Systems Administration The subject-matter expertise Programming! Beyond college/actuarial exams… fast-moving field Data extraction, data load, model implementation Project Management Absolutely critical because of the scope & multidisciplinary nature of data mining projects 16 © Deloitte Consulting, 2004 Data Sources Company’s internal data Policy-level records Loss & premium transactions Billing VIN…….. Externally purchased data Credit CLUE MVR Census …. 17 © Deloitte Consulting, 2004 The Data Mining Process © Deloitte Consulting, 2004 Raw Data Research/Evaluate possible data sources Availability Hit rate Implementability Cost-effectiveness Extract/purchase data Check data for quality (QA) At this stage, data is still in a “raw” form Often start with voluminous transactional data Much of the data mining process is “messy” 19 © Deloitte Consulting, 2004 Variable Creation Create predictive and target variables Need good programming skills Need domain and business expertise Steepest part of the learning curve Discuss specifics of variable creation with company experts Underwriters, Actuaries, Marketers… Opportunity to quantify tribal wisdom 20 © Deloitte Consulting, 2004 Variable Transformation Univariate analysis of predictive variables Exploratory Data Analysis (EDA) Data Visualization Use EDA to cap / transform predictive variables Extreme values Missing values …etc 21 © Deloitte Consulting, 2004 Multivariate Analysis Examine correlations among the variables Weed out redundant, weak, poorly distributed variables Model design Build candidate models Regression/GLM Decision Trees/MARS Neural Networks Select final model 22 © Deloitte Consulting, 2004 Model Analysis & Implementation Perform model analytics Calibrate Models Create user-friendly “scale” – client dictates Implement models Necessary for client to gain comfort with the model Programming skills again are critical Monitor performance Distribution of scores/variables, usage of the models,..etc Plan model maintenance schedule 23 © Deloitte Consulting, 2004 Model Design Where Data Mining Needs Actuarial Science © Deloitte Consulting, 2004 Model Design Issues Which target variable to use? Frequency & severity Loss Ratio, other profitability measures Binary targets: defection, cross-sell …etc How to prepare the target variable? Period - 1-year or Multi-year? Losses evaluated @? Cap large losses? Cat losses? How / whether to re-rate, adjust premium? What counts as a “retaining” policy? …etc 25 © Deloitte Consulting, 2004 Model Design Issues Which data points to include/exclude Which variables to consider? Certain classes of business? Certain states? …etc Credit, or non-credit only? Include rating variables in the model? Exclude certain variables for regulatory reasons? …etc What is the “level” of the model? Policy-term level, HH-level, Risk-level ..etc Or should data be summarized into “cells” à la minimum bias? 26 © Deloitte Consulting, 2004 Model Design Issues How should model be evaluated? Lift curves, Gains chart, ROC curve? How to measure ROI? How to split data into train/test/validation? Or crossvalidation? Is there enough data for lift curve to be “credible”? Are your “incredible” results credible? …etc Not an exhaustive list – every project raises different actuarial issues! 27 © Deloitte Consulting, 2004 Reference My favorite textbook: The Elements of Statistical Learning --Jerome Friedman, Trevor Hastie, Robert Tibshirani 28