Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Models: Do You Trust Them? 2003 CAS Annual Meeting Louise Francis, FCAS, MAAA [email protected] Francis Analytics and Actuarial Data Mining, Inc. Overview • Data Quality – Data Cleaning – Software Errors • Model Assumptions – Questions About Key Assumptions Underlying Popular Models in Finance • Option Pricing Theory • Value at Risk • CAPM Data Mining Models • Advanced modeling techniques applied to large data bases – Many records – Many variables • Some uses – Credit scoring – Fraud detection – Pricing Data Issues • “Misplaced faith in black boxes: Data Mining is sometimes perceived as a black box, where you feed the data in and interesting results and patterns emerge. Such an approach is particularly misleading when no prior knowledge or experience is used to validate the results of the mining exercise” – Exploratory Data Mining and Data Cleaning, by Dasu and Johnson Data Exploration and Cleaning • The overwhelming majority of the effort in data modeling is expended on understanding and cleaning data • Generally 85% or more of the effort is spent on data issues • This gets the modeler to the point of applying a modeling technique Dirty Data • A fact of life for actuaries • Even more of a problem when working with large complex databases – The information for many variables that are not used to produce key financial numbers are inaccurately or incompletely recorded Examples of Data Problems • Examples are based on actual problems encountered in Data Mining projects • Examples use simulated data Dirty Data – Incomplete Data Field % Records with Missing Data Claim Number 0% Claimant 1% Accident date 0% Report Date Return to Work Date Close Date 95% 100% 60% Incurred Loss 0% Paid Loss 0% Injury Type 100% Body Part 100% Cause of Loss 100% Age of Claimant 100% Occupation 100% Gender 100% Provider 1 Type 100% Provider 2 Type 100% Dirty Data: Errors Claim Number vs. Report Date 2004 2002 Report Date 2000 1998 1996 1994 1992 R Sq Linear = 0.992 1990 4000 5000 6000 7000 Claim Number 8000 9000 Detecting Unusual Data: Box and Whisker Plot of Workers’ Compensation Payments 10,000 5,000 * ** * * ** * * * ** ** * * ** * ** * * * * ** * * * * * *** * * ** * * * ** * * * ** * * * * * *** * * * * * * * * ** * * * * *** * * * * * * * * * ** ** 0 -5,000 -10,000 ** * *** * * * * * ** * * ** ** * * *** * ** ** * * * * * ** * *** ** * ** * * * * * *** * * * *** * * *** Limited Payment Detecting Unusual Data: Histogram 4,000 Frequency 3,000 2,000 1,000 Mean = 1,266.12 Std. Dev. = 2,308.801 N = 10,445 0 -10,000 -5,000 0 Limited Payment 5,000 10,000 Statistics N Detecting Unusual Data: Descriptive Statistics Valid Missing 10442 0 1,916.52 509.71 18.06 461.03 0.05 (54,777.04) 296,629.88 Mean Median Skewness Kurtosis Std. Error of Kurtosis Minimum Maximum Percentiles 5 7.97 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 49.88 93.67 134.37 180.08 235.22 288.26 351.26 424.10 509.71 605.12 723.73 876.01 1,073.48 1,342.50 1,712.25 2,326.79 3,545.21 6,896.36 Frequency of Unusual Observations Negative Payment No Yes 4.3% Data Challenges • Heterogeneity and Diversity of Data • Join Keys • Scale • Metadata The Fraud Study Data • 1993 AIB closed PIP claims • Dependent Variables • Suspicion Score • Expert assessment of liklihood of fraud or abuse • Predictor Variables • Red flag indicators • Claim file variables • Errors were introduced into data for two variables, suspicion score and claimant age Data Cubes: Pivot Table Example Average of Suspicion Score Legal Representation Injury Type 1.00 1.00 0.67 2.00 0.00 4.00 0.39 5.00 1.32 6.00 0.15 7.00 0.00 8.00 0.33 10.00 0.56 99.00 2.08 Grand Total 0.92 2.00 Grand Total 1.48 0.79 0.00 0.00 3.73 1.79 3.78 2.91 3.32 1.37 0.96 0.68 0.57 0.45 1.40 0.88 1.43 1.91 3.32 2.11 Data Spheres • Applied to numeric data • Can apply to a number of variables • • simultaneously to detect outliers Compute standardized value for each variable, yi Compute Mahalanobis distance: di v j 1 2 yj Data Spheres • More typical values on variables will fall at the center of the data sphere • Less typical values and outliers will be in outer layers • Can look at which variables most influence the Mahalanobis distance Distribution of Age by Data Sphere Layer 4.00000 3.00000 Zscore: Age 2.00000 1.00000 0.00000 -1.00000 -2.00000 1 2 3 4 5 6 Data Sphere Layer 7 8 9 10 Distribution of Suspicion Score by Data Sphere Layer 25.00000 4 5 Zscore: Suspicion Level 20.00000 15.00000 10.00000 5.00000 1,076 1,090 1,079 1,064 0.00000 921741 718 685 727 774 616 -5.00000 1 2 3 4 5 6 Data Sphere Layer 7 8 9 10 Spreadsheet Errors • A large percentage of spreadsheets contain errors. One study found errors in 86% of spreadsheets – From Raymond Panko “What We know About Spreadsheet Errors” • Methods for finding and correcting errors are • fairly well developed for programming in computer languages Such methods are much less frequently applied when the model is in a spreadsheet C Questioning Model Assumptions • Option Pricing Theory C e T SN (d1 ) e rT EN (d 2 ) S 1 d1 [ln( ) (n 2 )T ]/( T ) E 2 d 2 d1 T Option Pricing Theory • Option Pricing Formula widely used in finance in • • pricing options and other derivatives The formula assumes asset distributions are normal or lognormal Evidence that asset return data does not follow the normal distribution is widely available – 1976 Fama paper in Journal of the American Statistical Association Normal Distribution Assumption • The normality assumption is common in other finance application – Value at risk – CAPM Test of Normal Distribution Assumption Normal Q-Q Plot of Monthly Return on S&P 1.15 Expected Normal Value 1.10 1.05 1.00 0.95 0.90 0.85 0.8 0.9 1.0 1.1 Observed Value 1.2 1.3 Test of Normal Distribution Assumption Normal Q-Q Plot of Monthly Return on S&P 1.15 Expected Normal Value 1.10 1.05 1.00 0.95 0.90 0.85 0.8 0.9 1.0 1.1 Observed Value 1.2 1.3 Test of Normal Distribution Assumption Descriptive Statistics Monthly Return on S&P Valid N (listwise) N Statistic 251 251 Mean Statistic .9931 Std. Deviation Statistic .04585 Skewness Statistic Std. Error 1.410 .154 Kurtosis Statistic Std. Error 6.081 .306 Consequences of Assuming Normality • The frequency of extreme events is • underestimated – often by a lot Example: Long Term Capital – “Theoretically, the odds against a loss such as August’s had been prohibitive, such a debacle was, according to mathematicians, an event so freakish as to be unlikely to occur even once over the entire life of the universe and even over numerous repetitions of the universe” • When Genius Failed by Roger Lowenstein, p. 159