Download Bootstrapping

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
© Deloitte Consulting, 2005
Introduction to Bootstrapping
James Guszcza, FCAS, MAAA
CAS Predictive Modeling Seminar
Chicago
September, 2005
© Deloitte Consulting, 2005
What’s it all about?

Actuaries compute points estimates of
statistics all the time.






Loss ratio/claim frequency for a population
Outstanding Losses
Correlation between variables
GLM parameter estimates …
A point estimate tells us what the data
indicates.
But how can we measure our confidence in
this indication?
© Deloitte Consulting, 2005
More Concisely…



Point estimate says:
“what do you think?”
Variability of the point estimate says:
“how sure are you?”
Traditional approaches



Credibility theory
Use distributional assumptions to construct
confidence intervals
Is there an easier – and more flexible – way?
© Deloitte Consulting, 2005
Enter the Bootstrap




In the late 70’s the statistician Brad Efron
made an ingenious suggestion.
Most (sometimes all) of what we know about
the “true” probability distribution comes from
the data.
So let’s treat the data as a proxy for the true
distribution.
We draw multiple samples from this proxy…


This is called “resampling”.
And compute the statistic of interest on each
of the resulting pseudo-datasets.
© Deloitte Consulting, 2005
Philosophy


“[Bootstrapping has] requires very little in the
way of modeling, assumptions, or analysis,
and can be applied in an automatic way to
any situation, no matter how complicated”.
“An important theme is the substitution of
raw computing power for theoretical analysis”
--Efron and Gong 1983

Bootstrapping fits very nicely into the “data
mining” paradigm.
© Deloitte Consulting, 2005
The Basic Idea
Theoretical Picture
•Any actual sample of data
was drawn from the unknown
“true” distribution
The “true”
distribution
in the sky
•We use the actual data to
make inferences about the
true parameters (μ)
μ
•Each green oval is the
sample that “might have
been”
Sample 1
Y1
1,
Y1
2…
Y1
Y1
k
Sample 2
Y2
1,
Y2
2…
Y2
Y2
k
Sample 3
Y3
1,
Y3
2…
Y3
Y3
k
…
Sample N
YN1, YN2… YNk
•The distribution of our estimator (Y) depends on both the true
distribution and the size (k) of our sample
YN
© Deloitte Consulting, 2005
The Basic Idea
The Bootstrapping Process
•Treat the actual distribution
as a proxy for the true
distribution.
The actual
sample
•Sample with replacement
your actual distribution N
times.
Y
Y1 , Y2 … Yk
•Compute the statistic of
interest on each “re-sample”.
Re-sample 1
Y*1
1,
Y*1
2…
Y*1
Y*1
k
Re-sample 2
Y*2
1,
Y*2
2…
Y*2
Y*2
k
Re-sample 3
Y*3
1,
Y*3
2…
Y*3
k
…
Re-sample N
Y*N1, Y*N2… Y*Nk
Y*3
•{Y*} constitutes an estimate of the distribution of Y.
Y*N
© Deloitte Consulting, 2005
Sampling With Replacement

In fact, there is a chance of
(1-1/500)500 ≈ 1/e ≈ .368
that any one of the original data points won’t
appear at all if we sample with replacement
500 times.
 any data point is included with Prob ≈ .632


Intuitively, we treat the original sample as the
“true population in the sky”.
Each resample simulates the process of taking
a sample from the “true” distribution.
© Deloitte Consulting, 2005
Theoretical vs. Empirical
•Graph on left: Y-bar calculated from an ∞ number of
samples from the “true distribution”.
•Graph on right: {Y*-bar} calculated in each of 1000 resamples from the empirical distribution.
•Analogy: μ : Y ::
Y : Y*
bootstrap distribution (Y*-bar)
0.6
0.4
0.02
0.2
0.01
0.0
0.00
phi.ybar
0.03
0.8
0.04
true distribution (Y-bar)
70
80
90
100
ybar
110
120
98.5
99.0
99.5
100.0
y.star.bar
100.5
101.0
© Deloitte Consulting, 2005
Summary




The empirical distribution – your data –
serves as a proxy to the “true” distribution.
“Resampling” means (repeatedly) sampling
with replacement.
Resampling the data is analogous to the
process of drawing the data from the “true
distribution”.
We can resample multiple times
Compute the statistic of interest T on each resample
 We get an estimate of the distribution of T.

© Deloitte Consulting, 2005
Motivating Example




Let’s look at a simple case
where we all know the answer
in advance.
Pull 500 draws from the
n(5000,100) dist.
The sample mean ≈ 5000
 Is a point estimate of the
“true” mean μ.
 But how sure are we of this
estimate?
From theory, we know that:
s.d .( X )   / N  100
500
 4.47
raw data
statistic
value
#obs
500
mean
4995.79
sd
98.78
2.5%ile
4812.30
97.5%ile
5195.58
© Deloitte Consulting, 2005
Visualizing the Raw Data



500 draws from n(5000,100)
Look at summary statistics,
histogram, probability density
estimate, QQ-plot.
… looks pretty normal
raw data
statistic
value
#obs
500
mean
4995.79
sd
98.78
2.5%ile
4812.30
97.5%ile
5195.58
Normal Q-Q Plot
4700
0.000
4900
0.002
5100
0.004
n(5000,100) data
4700
4800
4900
5000
5100
5200
5300
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
Sampling With Replacement
Now let’s use resampling to estimate the
s.d. of the sample mean (≈4.47)

Draw a data point at random from the data set.


Draw a second data point.


Then throw it back in…
Keep going until we’ve got 500 data points.


Then throw it back in
You might call this a “pseudo” data set.
This is not merely re-sorting the data.

Some of the original data points will appear more than
once; others won’t appear at all.
© Deloitte Consulting, 2005
Resampling



Sample with
replacement 500 data
points from the
original dataset S
 Call this S*1
Now do this 999
more times!
 S*1, S*2,…, S*1000
Compute X-bar on
each of these 1000
samples.
S*1
S*2
S*N
S*3
...
S*10
S*4
S
S*5
S*9
S*6
S*8
S*7
© Deloitte Consulting, 2005
R Code
norm.data <- rnorm(500, mean=5000, sd=100)
boots <- function(data, R){
b.avg <<- c(); b.sd <<- c()
for(b in 1:R) {
ystar <- sample(data,length(data),replace=T)
b.avg <<- c(b.avg,mean(ystar))
b.sd <<- c(b.sd,sd(ystar))}
}
boots(norm.data, 1000)
© Deloitte Consulting, 2005
Results



From theory we know that
X-bar ~ n(5000, 4.47)
Bootstrapping estimates this
pretty well!
And we get an estimate of
the whole distribution, not
just a confidence interval.
raw data
statistic
value
#obs
500
mean
4995.79
sd
98.78
2.5%ile
4705.08
97.5%ile
5259.27
Normal Q-Q Plot
4985
4995
5005
0.00 0.02 0.04 0.06 0.08
bootstrap X-bar data
X-bar
theory bootstrap
1,000
1,000
5000.00 4995.98
4.47
4.43
4991.23 4987.60
5008.77 5004.82
4985
4990
4995
5000
5005
5010
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
Two Ways of Looking at a Confidence
Interval

Approximate normality assumption


X-bar ±2*(bootstrap dist s.d.)
Percentile method


Just take the desired percentiles of the
bootstrap histogram.
More reliable in cases of asymmetric bootstrap
histograms.
mean(norm.data) - 2 * sd(b.avg)
[1] 4986.926
mean(norm.data) + 2 * sd(b.avg)
[1] 5004.661
raw data
statistic
value
#obs
500
mean
4995.79
sd
98.78
2.5%ile
4705.08
97.5%ile
5259.27
X-bar
theory bootstrap
1,000
1,000
5000.00 4995.98
4.47
4.43
4991.23 4987.60
5008.77 5004.82
© Deloitte Consulting, 2005
And a Bonus
110
105
100

95

90

Note that we can calculate both the mean and standard
deviation of each pseudo-dataset.
This enables us to estimate the correlation between the
mean and s.d.
Normal distribution is not skew  mean, s.d. are
uncorrelated.
Our bootstrapping experiment confirms this.
sample.sd

4985
4990
4995
sample.mean
5000
5005
5010
© Deloitte Consulting, 2005
More Interesting Examples




We’ve seen that bootstrapping replicates a
result we know to be true from theory.
Often in the real world we either don’t know
the ‘true’ distributional properties of a
random variable…
…or are too busy to find out.
This is when bootstrapping really comes in
handy.
© Deloitte Consulting, 2005
Severity Data
2700 size-of-loss data points.

Mean = 3052, Median = 1136
0%
51.84
75%
100%
1136.10
3094.09
48346.82
Let’s estimate the distributions of the sample
mean & 75th %ile.
Gamma? Lognormal? Don’t need to know.
severity distribution
4 e-04

482.42
50%
2 e-04

25%
0 e+00

0
10000
20000
30000
40000
50000
© Deloitte Consulting, 2005
Bootstrapping Sample Avg, 75th %ile
Normal Q-Q Plot
0.000
2800
3000
0.002
3200
0.004
3400
bootstrap dist of severity sample avg
2800
3000
3200
3400
-3
-2
0
1
2
3
2
3
Normal Q-Q Plot
0.000
2800
3000
0.002
3200
3400
bootstrap dist of severity 75th % ile
-1
2800
2900
3000
3100
3200
3300
3400
-3
-2
-1
0
1
© Deloitte Consulting, 2005
What about the 90th %ile?



So far so good – bootstrapping shows that many of our sample
statistics – even average severity! – are approximately normally
distributed.
But this breaks down if our statistics is not a “smooth” function of
the data…
 Often in the loss reserving we want to focus our attention way
out in the tail…
90th %ile is an example.
Normal Q-Q Plot
0.0000
7000
8000
0.0010
9000
bootstrap dist of severity 90th % ile
7000
7500
8000
8500
9000
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
Variance Related to the Mean
6000
5500
5000

As with the normal example, we can calculate both the
sample average and s.d. on each pseudo-dataset.
This time (as one would expect) the variance is a function
of the mean.
sample.sd

2800
2900
3000
3100
sample.mean
3200
3300
3400
© Deloitte Consulting, 2005
Bootstrapping a Correlation Coefficient #1
Plot of Age vs Credit
80

60

40

20

About 700 data points
Credit on a scale of 1-100
 1 is worst; 100 is best
Age, credit are linearly related
 See plot
R2≈.08  ρ≈.28
 Older people tend to have better credit
What is the confidence interval around ρ?
age

0
20
40
60
80
100
© Deloitte Consulting, 2005
Bootstrapping a Correlation Coefficient #1

ρ appears normally distributed.



ρ ≈ .28
s.d.(ρ) ≈ .028
Both confidence interval calculations agree fairly well:
> quantile(boot.avg,probs=c(.025,.975))
2.5%
97.5%
0.2247719 0.3334889
> rho - 2*sd(boot.avg); rho + 2*sd(boot.avg)
0.2250254 0.3354617
Normal Q-Q Plot
0
0.20
5
0.25
0.30
10
0.35
15
correlation coefficient - bootstrap dist
0.20
0.25
0.30
0.35
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
Bootstrapping a Correlation Coefficient #2



Let’s try a different example.
≈1300 zip-code level data points
Variables: population density, median #vehicles/HH
R2≈.50 ; ρ ≈ -.70
Median #Vehicles vs Pop Density
0.0 0.5 1.0 1.5 2.0 2.5
veh

0
5000
loess line
10000
15000
regression line
density
20000
25000
30000
© Deloitte Consulting, 2005
Bootstrapping a Correlation Coefficient #2
ρ more skew.
 ρ ≈ -.70
 95% conf interval: (-.75, -.67)
 Not symmetric around ρ
 Effect becomes more pronounced the higher the
value of ρ.

Normal Q-Q Plot
0
-0.75
5
10
-0.70
15
-0.65
20
correlation coefficient - bootstrap dist
-0.75
-0.70
-0.65
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
Bootstrapping Loss Ratio



Now for what we’ve all been waiting for…
Total loss ratio of a segment of business is
our favorite point estimate.
Its variability depends on many things:





Size of book
Loss distribution
Accuracy of rating plan
Consistency of underwriting…
How could we hope to write down the true
probability distribution?

Bootstrapping to the rescue…
© Deloitte Consulting, 2005
Bootstrapping Loss Ratio & Frequency

≈50,000 insurance policies







Severity dist from previous example
LR = .79
Claim frequency = .08
Let’s build confidence intervals around these
two point estimates.
We will resample the data 500 times
Compute total LR and freq on each sample
Plot the histogram
© Deloitte Consulting, 2005
Results: Distribution of total LR


A little skew, but somewhat close to normal
 LR ≈ .79
 s.d.(LR) ≈ .05
 conf interval ≈ ±0.1
Confidence interval calculations disagree a bit:
> quantile(boot.avg,probs=c(.025,.975))
2.5%
97.5%
0.6974607 0.8829664
> lr - 2*sd(boot.avg); lr + 2*sd(boot.avg)
0.6897653 0.8888983
Normal Q-Q Plot
0
0.7
2
0.8
4
0.9
6
1.0
8
bootstrap total LR
0.7
0.8
0.9
1.0
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
Dependence on Sample Size



Let’s take a sub-sample of 10,000 policies
 How does this affect the variability of LR?
Again re-sample 500 times
Skewness, variance increase considerably
 LR:
.79

.78
 s.d.(LR):
.05

.13
Normal Q-Q Plot
0.0
0.6
1.0
0.8 1.0
2.0
1.2
1.4
3.0
bootstrap total LR
0.6
0.8
1.0
1.2
1.4
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
Distribution of Capped LR



Capped LR is analogous to trimmed mean from robust
statistics
Remove leverage of a few large data points
Here we cap policy-level losses at $30,000


Affects 50 out of 2700 claims
Closer to frequency


distribution less skew – close to normal
s.d. cut in half! .05  .025
Normal Q-Q Plot
0
0.55
5
0.60
10
0.65
15
0.70
bootstrap LR - losses capped @ $30K
0.55
0.60
0.65
0.70
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
Results: Distribution of Frequency


Much less variance than LR; very close to normal
 freq ≈ .08
 s.d.(freq) ≈ .017
Confidence interval calculations match very well:
> quantile(boot.avg,probs=c(.025,.975))
2.5%
97.5%
0.07734336 0.08391072
> lr - 2*sd(boot.avg); lr + 2*sd(boot.avg)
0.07719618 0.08388898
Normal Q-Q Plot
0
0.076
50
0.080
100 150
200
0.084
bootstrap total freq
0.074
0.076
0.078
0.080
0.082
0.084
0.086
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
When are LRs statistically different?

Example: Divide our 50,000 policies into two
sub-segments: {clean drivers, other}
LRtot = .79
 LRclean = .58
 LRother = .84






LLRclean = -27%
LRRother = +6%
Clean drivers appear to have ≈ 30% lower LR
than non-clean drivers
How sure are we of this indication?
Let’s use bootstrapping.
© Deloitte Consulting, 2005
Bootstrapping the difference in LRs

Simultaneously re-sample the two segments
500 times.

At each iteration, calculate
LRc*, LRo*, (LRc*- LRo*), (LRc* / LRo*)

Analyze the resulting empirical distributions.


What is the average difference in loss ratios?
what percent of the time is the difference in
loss ratios greater than x%?
© Deloitte Consulting, 2005
LR distributions of the sub-populations
Normal Q-Q Plot
0
0.4
1
0.6
2
0.8
3
4
LR: clean driving record
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-3
-2
0
1
2
3
2
3
Normal Q-Q Plot
0
0.70
1
2
0.80
3
4
0.90
5
6
1.00
LR: non-clean record
-1
0.70
0.75
0.80
0.85
0.90
0.95
1.00
1.05
-3
-2
-1
0
1
© Deloitte Consulting, 2005
LRR distributions of the sub-populations
Normal Q-Q Plot
0.0
0.5
1.0
0.7
2.0
0.9
3.0
1.1
LRR: clean driving record
0.5
0.6
0.7
0.8
0.9
1.0
1.1
-3
-2
0
1
2
3
2
3
Normal Q-Q Plot
0
1.00
5
1.05
10
1.10
15
LRR: non-clean record
-1
1.00
1.05
1.10
-3
-2
-1
0
1
© Deloitte Consulting, 2005
Distribution of LRR Differences
Normal Q-Q Plot
0
-0.1
1
0.1
2
0.3
3
0.5
LRR_other - LRR_clean
0.0
0.2
0.4
0.6
-3
-2
0
1
2
3
2
3
Normal Q-Q Plot
0.0
1.0
0.5
1.5
1.0
2.0
1.5
2.5
LRR_other / LRR_clean
-1
1.0
1.5
2.0
2.5
-3
-2
-1
0
1
© Deloitte Consulting, 2005
Final Example: loss reserve variability

A major issue in the loss reserving
community is reserve variability


Bootstrapping is a natural way to tackle this
problem.


Predictive variance of your estimate of
outstanding losses.
Hard to find an analytic formula for variability
of this o/s losses.
Approach here: bootstrap cases, not
residuals.
© Deloitte Consulting, 2005
Bootstrapping Reserves


S = database of 5000 claims
Sample with replacement all
policies in S




S*3
...
S*10
S*4
S
S*1, S*2,…, S*500
Estimate o/s reserves on each
sample

S*2
S*N
Call this S*1
Same size as S
Now do this 499 more times!

S*1
Get a distribution of reserve
estimates
S*5
S*9
S*6
S*8
S*7
© Deloitte Consulting, 2005
Simulated Loss Data

Simulate database of 5000 claims


Each of the 5000 claims was drawn from a
lognormal distribution with parameters


μ=8; σ=1.3
Build in loss development patterns.



500 claims/year; 10 years
Li+j = Li * (link + ε)
ε is a random error term
See CLRS presentation (2005) for more
details.
© Deloitte Consulting, 2005
Bootstrapping Reserves

Compute our reserve estimate on each S*k



These 500 reserve estimates constitute an
estimate of the distribution of outstanding losses
Notice that we did this by resampling our
original dataset S of claims.
Note: this bootstrapping method differs from
other analyses which bootstrap the residuals
of a model.

These methods rely on the assumption that your
model is correct.
© Deloitte Consulting, 2005
Distribution of Outstanding Losses

4 e-04
total reserves - all 10 years
3 e-04

0 e+00
1 e-04
2 e-04

Blue bars: the
bootstrapped
distribution
Dotted line:
kernel density
estimate of the
distribution
Pink line:
superimposed
normal
19000
20000
21000
22000
23000
24000
25000
© Deloitte Consulting, 2005
Distribution of Outstanding Losses




95% confidence
interval

4 e-04
3 e-04
Mean: $21.751M
Median: $21.746M
σ:
$0.982M
σ/μ ≈ 4.5%
2 e-04

total reserves - all 10 years
1 e-04
The simulated dist of
outstanding losses
appears ≈ normal.
0 e+00

19000
20000
21000
22000
23000
24000
(19.8M, 23.7M)
•Note: the 2.5 and 97.5 %iles of the bootstrapping distribution
roughly agree with $21.75 ± 2σ
25000
© Deloitte Consulting, 2005
Distribution of Outstanding Losses

We can examine a QQ plot to verify that the
distribution of o/s losses is approximately
normal.



However, the tails are somewhat heavier than normal.
Remember – this is just simulated data!
Real-life results have been consistent with these results.
Normal Q-Q Plot
19000
0 e+00
21000
2 e-04
23000
25000
4 e-04
total reserves - all 10 years
19000
20000
21000
22000
23000
24000
25000
-3
-2
-1
0
1
2
3
© Deloitte Consulting, 2005
References

Bootstrap Methods and their Applications
--Davison and Hinkley

An Introduction to the Bootstrap
--Efron and Tibshirani

“A Leisurely Look at the Bootstrap”
--Efron and Gong
American Statistician 1983

“Bootstrap Methods for Standard Errors”
-- Efron and Tibshirani
Statistical Science 1986

“Applications of Resampling Methods in Actuarial
Practice”
-- Derrig, Ostaszewski, Rempala
PCAS 2000