Download The Data Mining Process

© Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004 © Deloitte Consulting, 2004 Themes  What is Data Mining? How does it relate to statistics?  Insurance applications  Data sources     The Data Mining Process Model Design Modeling Techniques  Louise Francis’ Presentation 2 © Deloitte Consulting, 2004 Themes  How does data mining need actuarial science? Variable creation  Model design  Model evaluation   How does actuarial science need data mining? Advances in computing, modeling techniques  Ideas from other fields can be applied to insurance problems  3 © Deloitte Consulting, 2004 Themes  “The quiet statisticians have changed our world; not by discovering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions.” -- Ian Hacking  Data mining gives us new ways of approaching the age-old problems of risk selection and pricing…. ….and other problems not traditionally considered ‘actuarial’.  4 © Deloitte Consulting, 2004 What is Data Mining? © Deloitte Consulting, 2004 What is Data Mining?  My definition: “Statistics for the Computer Age”   Not a radical break with traditional statistics   Complements, builds on traditional statistics Statistics enriched with brute-force capabilities of modern computing   Many new techniques have come from Computer Science, Marketing, Biology… but all can (should!) be brought under the framework of “statistics” Opens the door to new techniques Therefore Data Mining tends to be associated with industrial-sized data sets 6 © Deloitte Consulting, 2004 Buzz-words         Data Mining Knowledge Discovery Machine Learning Statistical Learning Predictive Modeling Supervised Learning Unsupervised Learning ….etc 7 © Deloitte Consulting, 2004 What is Data Mining?  Supervised learning: predict the value of a target variable based on several predictive variables “Predictive Modeling”  Credit / non-credit scoring engines  Retention, cross-sell models   Unsupervised learning: describe associations and patterns along many dimensions without any target information Customer segmentation  Data Clustering  Market basket analysis (“diapers and beer”)  8 © Deloitte Consulting, 2004 So Why Should Actuaries Do This Stuff?  Any application of statistics requires subject-matter expertise       Psychometricians Econometricians Bioinformaticians Marketing scientists …are all applied statisticians with a particular subjectmatter expertise & area of specialty Add actuarial modelers to this list!   “Insurometricians”!? Actuarial knowledge is critical to the success of insurance data mining projects 9 © Deloitte Consulting, 2004 Three Concepts  Scoring engines   Lift curves   A “predictive model” by any other name… How much worse than average are the policies with the worst scores? Out-of-sample tests How well will the model work in the real world?  Unbiased estimate of predictive power  10 © Deloitte Consulting, 2004 Classic Application: Scoring Engines  Scoring engine: formula that classifies or separates policies (or risks, accounts, agents…) into  profitable vs. unprofitable  Retaining vs. non-retaining…   (Non-)Linear equation f( ) of several predictive variables Produces continuous range of scores score = f(X1, X2, …, XN) 11 © Deloitte Consulting, 2004 What “Powers” a Scoring Engine? Scoring Engine: score = f(X1, X2, …, XN)  The X1, X2,…, XN are at least as important as the f( )!  Again why actuarial expertise is necessary  Think of the predictive power of credit variables   A large part of the modeling process consists of variable creation and selection Usually possible to generate 100’s of variables  Steepest part of the learning curve  12 © Deloitte Consulting, 2004 Model Evaluation: Lift Curves   Sort data by score Break the dataset into 10 equal pieces      Best “decile”: lowest score  lowest LR Worst “decile”: highest score  highest LR Difference: “Lift” Lift = segmentation power Lift translates into ROI of the modeling project 13 © Deloitte Consulting, 2004 Out-of-Sample Testing  Randomly divide data into 3 pieces    Use Training data to fit models Score the Test data to create a lift curve    Training data, Test data, Validation data Perform the train/test steps iteratively until you have a model you’re happy with During this iterative phase, validation data is set aside in a “lock box” Once model has been finalized, score the Validation data and produce a lift curve  Unbiased estimate of future performance 14 © Deloitte Consulting, 2004 Data Mining: Applications  The classic: Profitability Scoring Model           Underwriting/Pricing applications Credit models Retention models Elasticity models Cross-sell models Lifetime Value models Agent/agency monitoring Target marketing Fraud detection Customer segmentation  no target variable (“unsupervised learning”) 15 © Deloitte Consulting, 2004 Skills needed  Statistical   Actuarial   Need scalable software, computing environment IT - Systems Administration   The subject-matter expertise Programming!   Beyond college/actuarial exams… fast-moving field Data extraction, data load, model implementation Project Management  Absolutely critical because of the scope & multidisciplinary nature of data mining projects 16 © Deloitte Consulting, 2004 Data Sources  Company’s internal data      Policy-level records Loss & premium transactions Billing VIN…….. Externally purchased data      Credit CLUE MVR Census …. 17 © Deloitte Consulting, 2004 The Data Mining Process © Deloitte Consulting, 2004 Raw Data  Research/Evaluate possible data sources Availability  Hit rate  Implementability  Cost-effectiveness     Extract/purchase data Check data for quality (QA) At this stage, data is still in a “raw” form Often start with voluminous transactional data  Much of the data mining process is “messy”  19 © Deloitte Consulting, 2004 Variable Creation  Create predictive and target variables  Need good programming skills  Need domain and business expertise Steepest part of the learning curve  Discuss specifics of variable creation with company experts   Underwriters, Actuaries, Marketers…  Opportunity to quantify tribal wisdom 20 © Deloitte Consulting, 2004 Variable Transformation     Univariate analysis of predictive variables Exploratory Data Analysis (EDA) Data Visualization Use EDA to cap / transform predictive variables  Extreme values  Missing values  …etc 21 © Deloitte Consulting, 2004 Multivariate Analysis     Examine correlations among the variables Weed out redundant, weak, poorly distributed variables Model design Build candidate models Regression/GLM  Decision Trees/MARS  Neural Networks   Select final model 22 © Deloitte Consulting, 2004 Model Analysis & Implementation  Perform model analytics   Calibrate Models   Create user-friendly “scale” – client dictates Implement models   Necessary for client to gain comfort with the model Programming skills again are critical Monitor performance   Distribution of scores/variables, usage of the models,..etc Plan model maintenance schedule 23 © Deloitte Consulting, 2004 Model Design Where Data Mining Needs Actuarial Science © Deloitte Consulting, 2004 Model Design Issues  Which target variable to use?      Frequency & severity Loss Ratio, other profitability measures Binary targets: defection, cross-sell …etc How to prepare the target variable?        Period - 1-year or Multi-year? Losses evaluated @? Cap large losses? Cat losses? How / whether to re-rate, adjust premium? What counts as a “retaining” policy? …etc 25 © Deloitte Consulting, 2004 Model Design Issues  Which data points to include/exclude     Which variables to consider?      Certain classes of business? Certain states? …etc Credit, or non-credit only? Include rating variables in the model? Exclude certain variables for regulatory reasons? …etc What is the “level” of the model?   Policy-term level, HH-level, Risk-level ..etc Or should data be summarized into “cells” à la minimum bias? 26 © Deloitte Consulting, 2004 Model Design Issues  How should model be evaluated? Lift curves, Gains chart, ROC curve?  How to measure ROI?  How to split data into train/test/validation? Or crossvalidation?  Is there enough data for lift curve to be “credible”?    Are your “incredible” results credible? …etc Not an exhaustive list – every project raises different actuarial issues! 27 © Deloitte Consulting, 2004 Reference My favorite textbook: The Elements of Statistical Learning --Jerome Friedman, Trevor Hastie, Robert Tibshirani 28

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The Data Mining Process