Download Handout 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
Models: Do You Trust Them?
2003 CAS Annual Meeting
Louise Francis, FCAS, MAAA
[email protected]
Francis Analytics and Actuarial Data Mining, Inc.
Overview
• Data Quality
– Data Cleaning
– Software Errors
• Model Assumptions
– Questions About Key Assumptions Underlying
Popular Models in Finance
• Option Pricing Theory
• Value at Risk
• CAPM
Data Mining Models
• Advanced modeling techniques applied to
large data bases
– Many records
– Many variables
• Some uses
– Credit scoring
– Fraud detection
– Pricing
Data Issues
• “Misplaced faith in black boxes: Data Mining is
sometimes perceived as a black box, where you
feed the data in and interesting results and
patterns emerge. Such an approach is
particularly misleading when no prior knowledge
or experience is used to validate the results of
the mining exercise”
– Exploratory Data Mining and Data Cleaning, by Dasu
and Johnson
Data Exploration and Cleaning
• The overwhelming majority of the effort in
data modeling is expended on
understanding and cleaning data
• Generally 85% or more of the effort is
spent on data issues
• This gets the modeler to the point of
applying a modeling technique
Dirty Data
• A fact of life for actuaries
• Even more of a problem when working
with large complex databases
– The information for many variables that are
not used to produce key financial numbers
are inaccurately or incompletely recorded
Examples of Data Problems
• Examples are based on actual problems
encountered in Data Mining projects
• Examples use simulated data
Dirty Data – Incomplete Data
Field
% Records with Missing Data
Claim Number
0%
Claimant
1%
Accident date
0%
Report Date
Return to Work Date
Close Date
95%
100%
60%
Incurred Loss
0%
Paid Loss
0%
Injury Type
100%
Body Part
100%
Cause of Loss
100%
Age of Claimant
100%
Occupation
100%
Gender
100%
Provider 1 Type
100%
Provider 2 Type
100%
Dirty Data: Errors
Claim Number vs. Report Date
2004
2002
Report Date
2000
1998
1996
1994
1992
R Sq Linear = 0.992
1990
4000
5000
6000
7000
Claim Number
8000
9000
Detecting Unusual Data: Box and
Whisker Plot of Workers’
Compensation Payments
10,000
5,000
* **
* *
**
*
*
*
**
**
*
*
**
*
**
*
*
*
*
**
*
*
*
*
*
***
*
*
**
*
*
*
**
*
*
*
**
*
* *
*
*
***
*
*
*
*
*
*
*
*
**
*
*
*
*
***
*
*
*
*
*
*
*
*
*
**
**
0
-5,000
-10,000
**
*
***
* *
* *
*
**
*
*
**
**
*
* ***
*
**
**
* *
* *
*
**
*
***
**
*
**
*
*
*
* *
***
* *
*
***
* *
***
Limited Payment
Detecting Unusual Data: Histogram
4,000
Frequency
3,000
2,000
1,000
Mean = 1,266.12
Std. Dev. = 2,308.801
N = 10,445
0
-10,000
-5,000
0
Limited Payment
5,000
10,000
Statistics
N
Detecting
Unusual
Data:
Descriptive
Statistics
Valid
Missing
10442
0
1,916.52
509.71
18.06
461.03
0.05
(54,777.04)
296,629.88
Mean
Median
Skewness
Kurtosis
Std. Error of Kurtosis
Minimum
Maximum
Percentiles
5
7.97
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
49.88
93.67
134.37
180.08
235.22
288.26
351.26
424.10
509.71
605.12
723.73
876.01
1,073.48
1,342.50
1,712.25
2,326.79
3,545.21
6,896.36
Frequency of Unusual Observations
Negative Payment
No
Yes
4.3%
Data Challenges
• Heterogeneity and Diversity of Data
• Join Keys
• Scale
• Metadata
The Fraud Study Data
• 1993 AIB closed PIP claims
• Dependent Variables
• Suspicion Score
• Expert assessment of liklihood of fraud or abuse
• Predictor Variables
• Red flag indicators
• Claim file variables
• Errors were introduced into data for two
variables, suspicion score and claimant age
Data Cubes: Pivot Table Example
Average of Suspicion Score Legal Representation
Injury Type
1.00
1.00
0.67
2.00
0.00
4.00
0.39
5.00
1.32
6.00
0.15
7.00
0.00
8.00
0.33
10.00
0.56
99.00
2.08
Grand Total
0.92
2.00 Grand Total
1.48
0.79
0.00
0.00
3.73
1.79
3.78
2.91
3.32
1.37
0.96
0.68
0.57
0.45
1.40
0.88
1.43
1.91
3.32
2.11
Data Spheres
• Applied to numeric data
• Can apply to a number of variables
•
•
simultaneously to detect outliers
Compute standardized value for each
variable, yi
Compute Mahalanobis distance:
di 
v

j 1
2
yj
Data Spheres
• More typical values on variables will fall at
the center of the data sphere
• Less typical values and outliers will be in
outer layers
• Can look at which variables most influence
the Mahalanobis distance
Distribution of Age by Data
Sphere Layer
4.00000
3.00000
Zscore: Age
2.00000
1.00000
0.00000
-1.00000
-2.00000
1
2
3
4
5
6
Data Sphere Layer
7
8
9
10
Distribution of Suspicion Score by
Data Sphere Layer
25.00000
4
5
Zscore: Suspicion Level
20.00000
15.00000
10.00000
5.00000
1,076 1,090
1,079
1,064
0.00000
921741 718
685
727 774
616
-5.00000
1
2
3
4
5
6
Data Sphere Layer
7
8
9
10
Spreadsheet Errors
• A large percentage of spreadsheets contain
errors. One study found errors in 86% of
spreadsheets
– From Raymond Panko “What We know About
Spreadsheet Errors”
• Methods for finding and correcting errors are
•
fairly well developed for programming in
computer languages
Such methods are much less frequently applied
when the model is in a spreadsheet
C
Questioning Model Assumptions
• Option Pricing Theory
C  e  T SN (d1 )  e  rT EN (d 2 )
S
1
d1  [ln( )  (n     2 )T ]/( T )
E
2
d 2  d1   T
Option Pricing Theory
• Option Pricing Formula widely used in finance in
•
•
pricing options and other derivatives
The formula assumes asset distributions are
normal or lognormal
Evidence that asset return data does not follow
the normal distribution is widely available
– 1976 Fama paper in Journal of the American
Statistical Association
Normal Distribution Assumption
• The normality assumption is common in
other finance application
– Value at risk
– CAPM
Test of Normal Distribution
Assumption
Normal Q-Q Plot of Monthly Return on S&P
1.15
Expected Normal Value
1.10
1.05
1.00
0.95
0.90
0.85
0.8
0.9
1.0
1.1
Observed Value
1.2
1.3
Test of Normal Distribution
Assumption
Normal Q-Q Plot of Monthly Return on S&P
1.15
Expected Normal Value
1.10
1.05
1.00
0.95
0.90
0.85
0.8
0.9
1.0
1.1
Observed Value
1.2
1.3
Test of Normal Distribution
Assumption
Descriptive Statistics
Monthly Return on S&P
Valid N (listwise)
N
Statistic
251
251
Mean
Statistic
.9931
Std.
Deviation
Statistic
.04585
Skewness
Statistic Std. Error
1.410
.154
Kurtosis
Statistic Std. Error
6.081
.306
Consequences of Assuming
Normality
• The frequency of extreme events is
•
underestimated – often by a lot
Example: Long Term Capital
– “Theoretically, the odds against a loss such as
August’s had been prohibitive, such a debacle was,
according to mathematicians, an event so freakish as
to be unlikely to occur even once over the entire life
of the universe and even over numerous repetitions
of the universe”
• When Genius Failed by Roger Lowenstein, p. 159