Download The zero-adjusted Inverse Gaussian distribution as a model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The zero-adjusted Inverse Gaussian
distribution as a model for insurance claims
Gillian Heller 1 , Mikis Stasinopoulos2 and Bob Rigby2
1
2
Dept of Statistics, Macquarie University, Sydney, Australia. email: [email protected]
STORM, London Metropolitan University. emails: [email protected]
and [email protected]
Abstract: We introduce a method for modelling insurance claim sizes, including
zero claims. A mixed discrete-continuous model, with a probability mass at zero
and an Inverse Gaussian continuous component, is used. The Inverse Gaussian
distribution accommodates the extreme right skewness of the claim distribution.
The model explicitly specifies a logit-linear model for the occurrence of a claim;
and log-linear models for the mean claim size (given a claim has occurred); and the
dispersion of claim sizes (given a claim has occurred). The method is illustrated
on aa Australian motor vehicle insurance data set.
Keywords: Inverse Gaussian model; zero-adjusted; insurance claims; gamlss.
1
Introduction
The purpose of modelling claim sizes on insurance policies is to price premiums accurately, and to estimate the risk of extreme claim events. In
a fixed period, a policy will either experience a claim, which is a nonnegative amount typically having an extremely right-skewed distribution,
or no claim, in which the claim amount is identically zero. The distribution
of the claim size is then mixed discrete-continuous: a continuous, rightskewed distribution mixed with a single probability mass at zero. In this
respect the phenomenon is similar to rainfall, which is either identically
zero on a dry day, or a continuous non-negative size on a wet day.
1.1
Models for insurance claims
Much attention has been paid in the actuarial literature to alternative
distributions for claim sizes (e.g. Hogg and Klugman (1984)) and some authors have developed regression models (usually generalized linear models)
for explaining claim sizes as a function of risk factors (e.g. Haberman and
Renshaw (1996)). All of these are models for claim sizes in the subclass of
policies which had a claim in the period of observation.
Jørgensen and de Souza (1994) and Smyth and Jørgensen (2002) considered models for claim sizes, including the zero claims. These are based on
2
Zero-adjusted Inverse Gaussian
the Tweedie distribution, which may be characterised as a Poisson sum of
Gamma random variates. A problem with the Tweedie distribution model
is that the probabilities at zero can not modeled explicitly as a function
of explanatory variables; and as we shall see in the example, the Gamma
distribution is inadequate for modelling the extreme right-skewness which
is present in our data.
2
The zero-adjusted Inverse Gaussian model
Let yi = size of claim on ith policy, i = 1, . . . , n. We can write the distribution of y as a mixed discrete-continuous probability function:
f (y)
= 1−π
= π · g(y)
y=0
y>0
(1)
where g(y) is the density of a continuous, right-skewed distribution and π
is the probability of a claim.
2.1
Continuous part of the model
The extreme right skewness of claims distributions has been well documented. Candidate distributions within the exponential family are the
Gamma and Inverse Gaussian distributions.
Motor vehicle insurance example We illustrate the method on a class
of motor vehicle insurance policies from an Australian insurance company
in 2004-05. There were 67,856 policies, of which 4,624 (6.8%) had at least
one claim in the period of observation. Of these, 4,333 policies (6.4%) had
one claim, and the remaining 291 policies (0.4%) between 2 and 4 claims.
The maximum claim size was $56,000. A histogram of the non-zero claims,
and the pdfs of the fitted Gamma and inverse Gaussian distributions are
shown in Figure 1. (For clarity of display the horizontal axis has been
truncated, at $15,000. Sixty-five observations were omitted.) The Gamma
clearly does not reproduce the shape of the observed claim size distribution;
the Inverse Gaussian looks to be a far better fit, accommodating both the
mode near zero and the extremely long tail of the distribution.
The density of the inverse Gaussian is:
"
µ
¶2 #
1 y−µ
1
exp −
g(y) = p
2y
µσ
2πy 3 σ
y>0
which has E(y) = µ and V ar(y) = σ 2 µ3 . The use of the Inverse Gaussian distribution for modelling claim sizes has been recommended by, for
example, Berg (1994).
Heller et al.
3
0 e+00
2 e−04
4 e−04
f(y)
6 e−04
8 e−04
Inverse Gaussian
Gamma
0
5000
10000
15000
Claim size
FIGURE 1. Claim size distribution: motor vehicle insurance
2.2
Discrete part of the model
The obvious model for the probability of a claim is the Bernoulli. Let wi be
a binary variable indicating the occurrence of at least one claim, and πi be
the probability of at least one claim, on policy i. Note that the occurrence
of more than one claim in the period of observation is rare. Then
f (wi ) = πiwi (1 − πi )1−wi
wi = 0, 1
However, we have to correct for the typical feature of policy-level data, that
not all policies have been in force for the entire period of observation.
Let ti = exposure of policy i, 0 < ti ≤ 1. (Exposure is the proportion of
the period of observation for which the policy has been in force.) We will be
assuming that the ti are known. If ci is the number of claims in the period,
and we assume a Poisson process with mean number of claims (per unit
exposure time) πi then ci |ti ∼ P o(ti πi ), P (ci = 0|ti = 1) = e−πi ≈ 1 − πi
and P (ci = 0|ti ) = e−ti πi ≈ 1 − ti πi , provided ti πi is small. This gives
f (wi ) = (πi∗ )wi (1 − πi∗ )1−wi
wi = 0, 1
i.e. Bernoulli with πi∗ = ti πi . We incorporate covariates through the logit
link function on πi :
πi
log
= ηi
1 − πi
4
Zero-adjusted Inverse Gaussian
i.e.
πi∗ /ti
= ηi
(2)
1 − πi∗ /ti
and the correction for differing periods of exposure enters the model through
the modified link function (2). The predictor ηi is defined in the next section.
log
2.3
The mixture model
The zero-adjusted Inverse Gaussian (ZAIG) model is then
f (yi )
=
1 − πi∗
yi = 0
"
¶2 #
yi − µi
yi > 0
µi σi
¡
¢
which has E(yi ) = πi∗ µi and V ar(yi ) = πi∗ µi 2 1 − πi∗ + µi σi2 . Following
Rigby and Stasinopoulos (2005), who specify generalized additive models
for the location, scale and shape parameters of a variety of distributions,
we specify the following models on the parameters µi , σi and πi∗ :
1
1
= πi∗ · p
exp −
3
2y
2πyi σi
i
log
µ
log(µi )
= x01µi βµ + fµ (x2µi )
log(σi )
= x01σi βσ + fσ (x2σi )
πi∗ /ti
1 − πi∗ /ti
= x01πi βπ + fπ (x2πi )
where x1µi , x2µi , x1σi , x2σi , x1πi and x2πi are covariate vectors for µi , σi
and πi∗ , which may be different, the same, or may have some but not all
elements in common; βµ , βσ and βπ are the corresponding parameter vectors; and fµ , fσ and fπ are nonparametric functions, typically smoothing
splines.
In order to correct for multiple claims in the period, we use the P
fact that,
if yj ∼ IG(µ, σ), j = 1, . . . , c independently, then the total t = j yj has
the distribution
t ∼ IG(µ∗ , σ ∗ )
where µ∗ = cµ and σ ∗ = σ/c. As
log(µ∗ )
and
∗
log(σ )
= log(µ) + log(c)
=
log(σ) − log(c)
we use log(ci ) and − log(ci ) as offsets in the models for µi and σi respectively, where ci is the number of claims on policy i. (A doubtful assumption
here is that multiple claim amounts on the same policy are independent.)
Heller et al.
3
5
Estimation
The ZAIG has been incorporated into the gamlss package in R (Stasinopoulos et al. (2006)). Maximum (penalised) likelihood estimation is used. The
penalized log likelihood function of the model is maximized iteratively using
either the RS or CG algorithm of Rigby and Stasinopoulos (2005), which in
turn uses a back-fitting algorithm to perform each step of the Fisher scoring procedure. Both RS and CG algorithms use the log likelihood of the
data, and its first derivatives (and optionally expected second derivatives)
with respect to distributional parameters, which in this case are µ, σ and
ν = π ∗ . The CG algorithm, a generalization of the algorithm used by Cole
and Green (1992), additionally uses the expected cross derivatives.
3.1
Motor vehicle insurance
The following covariates were available:
Variable
Range
Characteristics of policy holder:
Age band
1,2,3,4,5,6 (1 is youngest)
Gender
male, female
Area of residence A, B, C, D, E, F
Characteristics of vehicle:
Value
$0-$350,000
Make
A, B, C, D
Age
1, 2, 3, 4 (1 is recent)
Body type
bus, convertible, coupe, hatchback, hardtop, motorised
caravan/combi, minibus, panel van, roadster, sedan,
station wagon, truck, utility
Using the GAIC as model selection criterion, the following final model was
selected:
log(µ)
log(σ)
π
log( 1−π
)
=
=
=
age band + gender + area + offset{log(claims)}
area + offset{-log(claims)}
age band + area + vehicle body + spline(vehicle value)
Comments on the model
• Model for π: The model for the occurrence of a claim has terms for
both policyholder and vehicle characteristics. Policyholder age, area
and vehicle body are all categorical, so their form is not an issue;
vehicle value is the only continuous covariate that we have, and it
enters in the model in a smoothing spline form. This is understood
when we examine the scatterplot of claim/no claim, with a smoothing
spline, in Figure 2. The relationship is nonlinear; the probability of a
claim is at a maximum for vehicle value around $40,000.
Zero-adjusted Inverse Gaussian
1.0
6
0.0
0.2
0.4
Claim
0.6
0.8
Smoothed data
0
5
10
15
20
25
30
35
Vehicle value in $10,000 units
FIGURE 2. Occurrence of a claim (0/1) plotted against vehicle value, with
smoothing spline
• Model for µ: This contains only policyholder characteristics, which
is surprising. A more complicated model involving vehicle value, make
and some interaction terms, was a close second in the model selection. However, it was felt that this was too complex and difficult to
interpret, so the simpler version was chosen.
• Model for σ: Area is the only covariate for σ. The variation of the
claim size distribution with area is shown in Figure 3: it can be seen
that areas D, E and F have shapes which are different from A, B and
C, reflected in lower values for σ̂. In fact areas D, E and F are rural
whereas A, B and C are urban.
The explanatory variables age band and area appear in the model equations
for both π and µ. It is of interest whether they affect the occurrence of a
claim, and claim size, in the same way. Figure 4.a shows the effect of age
band (eβ̂ ), on both π/(1 − π) and µ; figure 4.b shows the effect of area on
both π/(1 − π) and µ. Note that age band=3 and area=A are the reference
categories. Age band 1 (the youngest drivers) increases both the odds of
a claim and the mean claim size, to a similar extent; age bands 2 and 4
have a similar effect to age band 3; and age bands 5 and 6 (older drivers)
decrease both the odds of a claim, and the mean claim size, their effect
being greater on the odds of a claim. The effect of area on the odds of a
claim, and mean claim size, is less clear: the only clear indication is that
the mean claim size is increased in area F.
Heller et al.
5000
10000
200
0
0
5000
10000
15000
0
5000
10000
^ = 2251 , σ
^ = 0.034
E. µ
^ = 2864 , σ
^ = 0.033
F. µ
15000
40
20
Frequency
0
10000
0
80
40
5000
60
^ = 1837 , σ
^ = 0.035
D. µ
Frequency
Claim size
Claim size
0
5000
10000
15000
Claim size
0
5000
10000
Claim size
FIGURE 3. Claim size distribution by area
4
15000
Claim size
0
0
100
Frequency
150
50
15000
Claim size
20 40 60 80
0
Frequency
^ = 2030 , σ
^ = 0.038
C. µ
0
50
Frequency
150
^ = 1860 , σ
^ = 0.038
B. µ
0
Frequency
^ = 1909 , σ
^ = 0.038
A. µ
7
Conclusion
We introduce a method for modelling insurance claim sizes using a zero
adjusted Inverse Gaussian (ZAIG) model, which explicitly specifies a logitlinear model for the occurrence of a claim; and log-linear models for the
mean claim size (given a claim has occurred); and the dispersion of claim
sizes (given a claim has occurred). These three models may incorporate
different covariates, or some of the same covariates, and may depend on
common covariates in different ways. The Inverse Gaussian distribution accommodates the extreme right skewness of the claim distributions. Given
the risk factors for a potential new policyholder, the expected claim size
may easily be computed as the expected value of the ZAIG distribution,
conditional on the covariate values; and quartiles of the claim size distribution may be calculated for each combination of covariate values. The
ZAIG distribution introduced here is a useful distribution for modelling
data where the total amount per unit of time is observed but where zero
amounts are possible. Rainfall data and smoking/drinking habits data are
possible candidates for modelling using the ZAIG distribution.
References
Berg, P.T. (1994). Deductibles and the inverse Gaussian distribution. ASTIN
Bulletin, 24, 319–323.
15000
8
Zero-adjusted Inverse Gaussian
b. Area
1.8
1.6
a. Age band
1.6
1.4
1.2
^
exp(β)
0.6
0.8
1.0
1.0
0.4
0.6
0.8
^
exp(β)
1.2
1.4
Occurrence of claim
Claim size
1
2
3
4
Age band
5
6
A
B
C
D
E
F
Area
FIGURE 4. Effect of age category and area (exp(β̂)) on occurrence of claim and
claim size
Cole, T. and Green, P. (1992) Smoothing reference centile curves: The LMS
method and penalized likelihood. Statist. in Med, 11, 1305-1319.
Hogg, R.V. and Klugman, S.A. (1984). Loss Distributions. New York: Wiley.
Haberman, S. and Renshaw, A.E. (1996). Generalized Linear Models and
Actuarial Science. The Statistician, 45 (4), 407-436.
Jørgensen, B. and de Souza, M.C.P. (1994). Fitting Tweedie’s compound
Poisson model to insurance claims data. Scandinavian Actuarial Journal, 69-93.
Rigby, R.A. and Stasinopoulos, D.M. (2005). Generalized Additive Models for Location, Scale and Shape (with discussion). Appl. Statist.,
54, 1-38
Smyth, G.K. and Jørgensen, B. (2002). Fitting Tweedie’s compound Poisson model to insurance claims data: dispersion modelling. ASTIN
Bulletin, 32(1), 143-157.
Stasinopoulos D. M., Rigby R.A. and Akantziliotou C. (2006) gamlss: A collection of functions to fit Generalized Additive Models for Location
Scale and Shape, R package version 1.1-0, url = http://www.londonmet.
ac.uk/gamlss/.