Download Estimating claim size and probability in the auto-insurance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Estimating claim size and
probability in the auto-insurance
industry: the zeroadjusted
Inverse Gaussian (ZAIG)
distribution
Danny Pimentel Claro
Insper Working Paper
WPE: 165/2009
Copyright Insper. Todos os direitos reservados.
É proibida a reprodução parcial ou integral do conteúdo deste
documento por qualquer meio de distribuição, digital ou impresso, sem a expressa autorização do
Insper ou de seu autor.
A reprodução para fins didáticos é permitida observando-sea
citação completa do documento
Estimating claim size and probability in the auto-insurance
industry: the zero-adjusted Inverse Gaussian (ZAIG)
distribution
Abstract
This article aims at the estimation of insurance claims from an auto data set. Using a
ZAIG method, we identify factors that influence claim size and probability, and
compared the results with the analysis of a Tweedie method. Results show that ZAIG
can accurately predict claim size and probability. Factors like territory, vehicles´
advanced age, origin and body influence distinctly claim size and probability. The
distinct impact is not always present in Tweedie’s estimated model. Auto insurers
should consider estimating risk premium using ZAIG method. The fitted models may be
useful to develop a strategy for premium pricing.
1. Introduction
There is a well-known problem in the insurance industry that is the proper
pricing of an insurance policy. The pure premium of an insurance company to an
insured individual is made up of two components: claim probability and expected claim
size. The claim probability for any individual is related to the expected number of
claims occurring in a given year. The claim size is simply the dollar cost associated with
each claim. It is extensively reported in literature the difficulty to estimate size and
probability of claims in the insurance industry (e.g. Jong and Heller, 2008). In the past,
the main difficulty was related to the credibility of the data sets of insurance companies
(Weisberg and Tomberling, 1982). Insurance data sets were typically very large, of the
order of ten of thousands up to millions of cases. Problems arise such as missing values
and inconsistent or invalid records. As current IT systems have become
sophisticatedevery year, the processing of information is more credible than ever before.
The challenge is then to employ a proper statistical technique to analyze
insurance data. Claims and risks have long been estimated as a pure algorithmic
technique or a simple stochastic technique (Wuthrich and Merz, 2008). These methods
resulted in poor estimations. Jørgensen and de Souza (1994) suggested a Poisson sum of
Gamma random varieties called Tweedie to estimate insurance risk. One problem with
the Tweedie distribution model is that more explanatory variables are likely to be found
significant (Smyth and Jørgensen, 2002). According to Smyth and Jørgensen (2001),
another problem is that, based on the proposed models, no estimation for claim sizes is
possible.
Recent literature realized that a zero-adjusted Inverse Gaussian (ZAIG)
distribution may be appropriate to estimate claim and risk in insurance data (Heller,
Stasinopoulos and Rigby, 2006). A mixed discrete-continuous model, with a probability
mass at zero and an Inverse Gaussian continuous component, appears to accurately
estimate in extreme right skewness distribution. This suggests that probabilities can be
calculated from data sets with a large number of zero claims. The ZAIG model
explicitly specifies a logit-linear model for the occurrence of a claim (i.e. claim
probability). When a claim has occurred, the ZAIG model also specifies log-linear
models for the mean claim size and the dispersion of claim sizes. It is important to
measure the probability and size of claims separately because it is possible for the
probability to depend on a set of independent variables different from those that
influence claim size. Therefore, ZAIG estimation appears to be more appropriate to
estimate price of insurance policies.
Once a method of estimation is defined, the challenge is to identify potential
explanatory variables. Typically, policy holders are divided into discrete classes on the
basis of certain measurable characteristics predictive of their propensity to generate
losses. We evaluate claims by considering variables related to vehicles that are
frequently used in literature. In addition to territory, claims have also been studied
according to the brand manufacturer and vehicle’s characteristics: age, body and origin.
Following previous research all of these variables must become part of the estimation.
Our objective is to present the ZAIG method of estimation to estimate
probability claims and expected claim size in the insurance industry and to formally test
the results with the estimation of a Tweedie regression model using insurance data set.
An insurance data was collected to analyze the impact of factors estimated by Tweedie
and ZAIG methods.
This work is divided in six sections. In the next section, the theoretical
background is discussed based on previous research on claim insurance estimation. Also
the ZAIG method is presented in the section. The third section discusses the
methodology and the data set. The fourth section presents the analysis of results and the
comparison of findings of the two methods, ZAIG and Tweedie. Finally, concluding
remarks highlight the major contributions of our study.
2. Theoretical Background
Insurance: Importance of predictions and predictors
Insurance companies are constantly looking for ways to better predict claims.
Overall, insurance involves the sum of a large number of individual risks from which
very few insurance claims will be reported. Meulbroek (2001) argues that insurance
companies need to treat risk management as a series of related factors and events.
Boland (2007) suggests that, in order to handle claims arising from incidents that
already occurred, insurers must employ methods of prediction to deal with the extent of
its liability. Therefore, an insurance company has to find ways to predict claims and
appropriately charge a premium to take responsibility for the risk.
The prediction problem has to be considered in the light of competitive market
insurance (Weisberg, Tomberlin and Chatterjee, 1984). It is possible for an insurer to
benefit at least temporarily by identifying segments of the market that are currently
being overcharged and offering coverage at lower rates or by avoiding segments that are
being undercharged (Doherty, 1981). Regulators are usually concerned about the
possibility of rate structures that severely penalize individuals with some characteristics
(e.g. place they live, model of vehicle). Therefore, insurers look for better ways to
capture the characteristics of individuals that affect claim size and probability, and
consequently discriminate among insurer drivers that have high propensity to generate
losses.
Insurance companies attempt to estimate reasonably price insurance policies
based on the losses reported in certain kinds of policy-holders. The estimation has to
consider past data in order to grasp trends that occurred (Weisberg and Tomberlin,
1982). Information available to predict the price for a period in the future usually
consists of the claim experience for a population or a large sample from the population
over a period in the past. Precise estimation may consider a large number of exposures
in a data set and a stable claim generation process over time.
The predictors to estimate price insurance policy were selected from the
automobile industry. In our study, we consider the issue of price prediction in the
context of automobile industry because the most sophisticated proposals have been
developed in this industry (Jong and Heller, 2008). Previous research in the automobile
setting includes predictors such as territory (e.g. Chang and Fairly, 1979) and brand
manufacturer (e.g. Heller, Stasinopoulos and Rigby, 2006). Weisberg, Tomberlin and
Chatterjee (1984) suggest including variables associated with the status of the vehicle,
such as age, body, origin.
Previous studies have recognized the use of Tweedie method of estimating auto
insurance claims (Smyth and Jørgensen, 2001) and recent studies have shown that
ZAIG method may produce accurate models of estimation (Heller, Stasinopoulos and
Rigby, 2006). In order to estimate, it is necessary to let yi be the amount expended with
claims for client i and xi be a vector of independent variables related to this client. One
may represent variable yi as
 0, with probability (1 -  i )
,
yi  
Wi , with probability  i
(1)
where Wi is a positive right skewed distribution. This type of variable belongs to the
class of the zero inflated probability distributions (e.g. Gan, 2001). The parameter  is
the claim probability and Wi represents the claim size.
It is important to note that a claim is, in general, a rare event. A small proportion
of claims in a sample may lead to problems in predicting claim occurrence by a logistic
model because, in this case, the predicted probabilities tend to be small. King and Zeng
(2002) proposed a correction to be used in these situations. They used the fact that, in
the presence of rare events, the independent variable coefficients are consistent, but the
intercept may not be.
Tweedie regression models
A Tweedie distribution (Jørgensen, 1987, 1997) is a member of the class of
exponential dispersion models. It is defined as a distribution of the exponential family
(e.g. Heller and Jong, 2007) with mean  and variance   p; as in Jørgensen and de
Souza (1994) and Smyth and Jørgensen (2002), it will be considered, in this work, the
case 1<p<2. It is possible to write
Ni
 0, if N i  0
, Wi   X ij ,
(2)
yi  
j i
Wi , if N i  0
where Ni is a Poisson random variable and X i1 ,  , X iNi are independent identically
distributed Gamma random variables (continuous variable). As a consequence Wi also
follows a Gamma distribution, which has a positive and right skewed density
probability function. In this work, it was used a log-linear Tweedie regression model,
given by
T
 i  e xi  ,
(3)
ZAIG regression model
The variable yi follows a ZAIG distribution (Heller, Stasinopoulos and Rigby,
2006) if Wi is a Gaussian inverse random variable. The Gaussian inverse is a positive
and highly skewed distribution with two parameters: the mean (i) and a dispersion
parameter (i). It may be proved that
E ( yi )   i  i


and Var( yi )   i i2 1   i  i  i2 .
(4)
In the context of this work, I is the expected claim size and i is a parameter
related to claim size dispersion. It is possible to propose regression models for i, i and
i as
 i  h 1xiT  ,  i  h 2 ziT   ,  i  h 3wiT  ,
(5)
where h1, h2 and h2 are continuous twice differentiable invertible functions, ,  and 
are parametric vectors and xi, zi and wi are vectors of independent variables.
In an insurance context, it is highly convenient to use different sets of independent
variables to model these three parameters. Consider, for instance, a variable that
indicates the local of the residence of the car´s owner. It is well known that robbery
rates vary in a city, but the price of a vehicle do not. It is expected that the local of
residence will be important in explaining i, but not i, then one may include the
variable in the probability model but not in the expected claim size model. This example
illustrates Heller, Stasinopoulos and Rigby (2006) statement “… A problem with the
Tweedie distribution model is that the probabilities at zero can not modeled explicit as
a function of explanatory variables…”. The following models are adjusted
e xi 
T
i 
1 e
xTi 
,  i  e xi  and  i  e xi  .
T
T
(6)
In short this modeling option assures that i and i are, as expected, always positive and
that i is modeled as a logistic regression.
It is important to remember the bad performance of logistic models in predicting
claims, when the frequency of claims in the sample is small.
3. Methodology
A sample was collected from a major automobile insurance company resulting in
a data set of 33,257 passenger vehicle records. The data set was processed to clean for
missing values and selection of relevant variables. We focus the analysis on yearly
claims that refer to robberies or accidents with repair costs greater than the vehicles
value. Claim size refers to the amount of dollars paid as liability of a claim. Claim
probability refers to the percentage of claims of a year period. The average annual claim
probability is 1.15%, and the average claim size is $21,002.30. For every insurance
policy holder, twenty explanatory variables were employed. The variables are coded by
means of binary variables, described in Table 1. Table 2 shows the descriptive statistics
of the variables.
Table 1: List of explanatory variables
Variable
Vehicle’s Age (II)
Vehicle’s Age (III)
Vehicle’s Age (IV)
Vehicle’s Age (V)
Vehicle’s Age (VI)
Origin
Model/Manufacturer
Territory
Vehicle Body (II)
Vehicle Body (III)
Intercept
Description
Equals 1 if the insured vehicle has one or two years (related to contract’s
year), 0 otherwise
Equals 1 if the insured vehicle has three or four years (related to contract’s
year), 0 otherwise
Equals 1 if the insured vehicle has five or six years (related to contract’s year),
0 otherwise
Equals 1 if the insured vehicle has seven to nine years (related to contract’s
year), 0 otherwise
Equals 1 if the insured vehicle has ten or more years (related to contract’s
year), 0 otherwise
Equals 1 if the insured vehicle is imported and 0 if it is national
A combination of different models and manufacturers in the data set. Groups
were assigned on the basis of a CHAID analysis.
Clusters of territory were assigned on the basis of Hierarchical Cluster
Analysis Method for claim size. It represents a division of the area into a set
of exclusive areas thought to be relatively homogeneous in terms of claims.
Vehicles are assigned to territories according to where they are usually
garaged.
Equals 1 for midsize vehicle, 0 otherwise
Equals 1 for luxury vehicle, 0 otherwise
Vehicle with zero years (related to contract’s year - Vehicle’s Age (I)),
national, Model/Manufacturer (I), Territory (I) and small vehicle (Vehicle
Body (I))
Table 2: Descriptive statistics for claim size and for proportion of claims
Variable
Value
Vehicle’s Age
I
II
III
IV
V
VI
National
Imported
I
II
III
IV
V
VI
VII
VIII
IX
X
Origin
Model/
Manufacturer
Proportion
of claim
0.94%
1.06%
1.12%
1.07%
2.08%
2.17%
1.19%
0.68%
1.68%
1.27%
0.37%
1.07%
2.67%
0.81%
0.52%
0.72%
1.35%
0.62%
Claim size
Total sample
Claim size>0
Mean
SD
Mean
SD
N
306.50
244.61
210.90
157.03
221.47
199.84
251.06
136.59
330.31
627.34
73.19
152.81
416.29
260.14
91.11
147.18
424.82
203.19
8947
10,707
6,674
3,361
2,554
1,014
30,771
2,486
11,676
1,102
1,879
6,432
487
620
3,841
2,911
1,708
2,601
3,511.06
2,703.39
2,181.39
1,672.73
1,624.80
1,482.98
2,735.15
1,849.15
2,883.55
6,009.68
1,313.43
1,646.32
3,139.55
3,607.39
1,352.48
2,177.97
4,065.73
2,686.77
32,645.59
22,974.27
18,766.88
14,660.81
10,672.59
9,210.65
21,049.92
19,974.17
19,676.86
49,380.62
19,644.96
14,244.86
15,594.96
32,256.95
17,497.01
20,401.90
31,547.34
33,030.77
16,132.40
12,868.49
8,726.57
7,063.96
3,990.81
4,374.82
13,781.40
10,491.21
10,732.91
21,589.70
9,559.60
7,255.25
11,948.36
26,900.88
7,007.88
16,009.04
16,003.95
9,728.11
Variable
Territory
Vehicle Body
Value
Proportion
of claim
I
II
III
IV
Small
Midsize
Luxury
Complete
sample
7.69%
2.07%
0.45%
1.07%
1.20%
1.44%
0.54%
1.15%
Claim size
Total sample
Claim size>0
Mean
SD
Mean
SD
1,608.42
502.80
55.39
218.91
183.95
368.18
159.92
5,685.66
4,213.33
828.71
2,494.36
1,854.85
3,537.86
2,586.33
20,909.50
24,259.88
12,406.50
20,445.81
15,345.89
25,586.44
29,834.43
1,007.63
16,871.42
716.30
12,964.14
7,390.55
15,029.12
19,322.83
N
26
2,895
448
29,888
15,517
11,397
6,343
242.50 2,679.22 21,002.30 13,643.46 33,257
4. Results and Discussion
ZAIG model was estimated by the package GAMLSS (Stasinopoulos and Rigby,
2007 and Stasinopoulos, Rigby and Akantziliotou, 2008) for R system (R Development
Core Team, 2008). Tweedie model was estimated in SPSS package (version 16.0). Table
1 presents the results of our estimates. The Vehicle’s age, Model/Manufacturer, territory
and Vehicle Body served as the independent variables. The dependent variable is the
amount expended with claims that refers to a robbery or an accident with repair costs
greater than the vehicles value. Coefficients t-test and error term of the estimated
models are reported.
Table 1: Results
Variable
Intercept
Vehicle’s
Age (II)
Vehicle’s
Age (III)
Vehicle’s
Age (IV)
Vehicle’s
Age (V)
Vehicle’s
Age (VI)
Model/Manufacturer (II)
Model/Manu-
Tweedie
Equation 1
Estimate
7.61**
(.69)
-0.05
(.16)
-0.19
(.18)
-0.43
(.23)
-0.16
(.21)
-0.22
(.30)
0.56
(.31)
-1.26**
ZAIG
Equation 2: =1-
Equation 3: 
(Claim probability)
(Claim size)
t value Estimate t value Estimate t value
11.08
-3.11
9.91**
16.16
-2.35**
(.76)
(.61)
-0.30
0.14
0.93
-0.30**
-2.86
(.15)
(.11)
-1.09
0.18
1.08
-0.53
-0.44
(.16)
(1.19)
-1.91
0.16
0.80
-0.81**
-6.62
(.20)
(.12)
-0.78
0.79**
4.31
-1.07**
-10.11
(.18)
(.11)
-0.73
0.77**
3.13
-1.24**
-11.50
(.25)
(.11)
1.79
-0.12
-0.40
0.51**
2.66
(.29)
(.19)
-0.04
-0.20
-2.84
-1.08**
-2.74
Equation 4: 
Estimate
-0.55**
(.09)
-0.22*
(.10)
2.93**
(.11)
-0.06
(.14)
-0.32*
(.12)
-0.56**
(.17)
t value
-58.90
-2.11
25.70
-0.39
-2.54
-3.27
Variable
facturer (III)
Model/Manufacturer (IV)
Model/Manufacturer (V)
Model/Manufacturer (VI)
Model/Manufacturer (VII)
Model/ManuFacturer (VIII)
Model/ManuFacturer (IX)
Model/ManuFacturer (X)
Origin
Territory
(II)
Territory
(III)
Territory
(IV)
Vehicle Body
(II)
Vehicle Body
(II)
Scale
Tweedie
Equation 1
Estimate
(.44)
-0.71**
(.16)
0.47
(.42)
-0.45
(.55)
-1.23**
(.26)
-0.78**
(.28)
0.05
(0.26)
-0.69*
(.29)
-0.19
(.34)
-1.10
(.69)
-3.31**
(.98)
-1.91**
(.68)
0.44**
(.14)
-0.36
(.24)
104.51
ZAIG
Equation 2: =1-
Equation 3: 
(Claim probability)
(Claim size)
t value Estimate t value Estimate t value
(.19)
(.39)
-0.08
-1.25
-4.42
-0.52**
-3.65
(. 07)
(.14)
1.13
0.06
0.41
0.90**
2.86
(. 14)
(.31)
-0.83
-0.76
-1.65
0.25
0.48
(.46)
(.52)
0.19
1.63
-4.82
-1.31**
-5.56
(.12)
(.24)
-0.82
-3.53
0.15
1.45
-2.77
(.23)
(.10)
0.20
-0.19
-0.80
0.22
1.31
(.23)
(.17)
0.33
1.55
-2.41
-1.04**
-3.85
(.22)
(.27)
-0.54
-0.08
-0.28
0.43**
3.58
(.28)
(.12)
-1.58
-1.32
-1.75
0.25
0.40
(.75)
(.62)
-0.23
-0.31
-3.38
-2.90**
-2.82
(.73)
(1.03)
0.20
0.32
-2.81
-1.90**
-2.55
(.62)
(.75)
0.11
0.86
3.19
0.18*
2.50
(.12)
(.07)
-1.51
-1.02**
-4.63
0.46**
6.55
(.22)
(.07)
Equation 4: 
Estimate
t value
0.35**
(.08)
-0.68**
(.13)
4.50
Several explanatory variables were significantly related with dependent
variables. Considering all vehicle´s age variables, we can say that there is a significant
increase in the expected claim probability as the vehicle becomes older (0.79 and 0.77).
On the other hand, the expected claim size decreases for older vehicles. This is in line
with intuition, because old vehicles are less expensive to replace and there is also the
fact that old vehicles are more attractive to be stolen. One might suggest that old
vehicles are more attractive because there is a great auto-part replacement market that is
flooded with parts of these old cars and old cars tend to be poorly maintained, which
increases the probability of accidents.
The variable model/manufacturer is related to claims probability and size. In
general, the model/manufacturer is more related to claim probability than claim size.
Noteworthy, there is no way to clearly identify whether claim size or probability is
causing the significance of Tweedie´s coefficients.
-5.06
The variable for origin of vehicles influences the claim size (Equation 3). This
suggests, ceteris paribus, that imported vehicles tend to be related to great claim size.
Overall, imported vehicles are of high purchase value.
The variable territory is generally related to claim probability, however not
related to claim size. One might suggest that claim size depends solely on the vehicle’s
value and not on the territory of the claim. Based only on Tweedie model, there is no
way to clearly identify whether claim size or probability is causing the significance of
Tweedie´s coefficients for the territory variable.
Vehicles’ body is related to claim size and probability. The claim probably
decreases for luxury vehicles. However luxury cars lead to higher claim size followed
by midsize cars. When taking Tweedie´s results, the difficulty to accurately predict
claims is shown in the non significance of Tweedie’s coefficient for luxury cars (i.e.
vehicle body 3). One might suggest that the not significant coefficient is due to a
negative claim probability effect and a positive claim size effect, as found in ZAIG’s
coefficients.
5. Concluding Remarks
In this work we tackled a well-known problem in the insurance industry that is
the proper pricing of an insurance policy. Employing the ZAIG method of estimation
for claims and risks in the insurance industry, we found distinct factors that influence
claim size and probability. Factors like territory, vehicles´ advanced age, origin and
body influence distinctly claim size and probability. The distinct impact is not always
present in Tweedie’s estimated model. The ZAIG method of estimation also allows for
insurance companies to create a score system to predict claims, based on the logistic
model. This score system intends to identify police holders who tend to be more risky.
The estimated models may be employed to develop a strategy for premium pricing.
Some limitations of this study may be recognized. First, the methods require a
high computational effort which might preclude the use of large data sets. Second, there
is room for developing suitable methods for longitudinal data analysis. Future work may
consider the use of estimating equations techniques or multivariate ZAIG distributions.
We concentrated our research on the auto insurance industry and specific vehicles’
variables. Studies may address other insurance industries and include customer related
variables.
6. References
Boland, Philip J. Statistical and probabilistic methods in actuarial science. Boca
Raton: Chapman & Hall/CRC, 2007. 351 p.
Doherty N. A. (1981). Is Rate Classification Profitable?. The Journal of Risk and
Insurance, Vol. 48, (2), 286-295.
Gan, N. (2000). General zero-inflated models and their applications. (Phd
Thesis). North Carolina State University.
Heller, G., Stasinopoulos, M. and Rigby, B. (2006) The zero-adjusted inverse
Gaussian distribution as a model for insurance claims. Proceedings of the 21th
International Workshop on Statistical Modelling. J. Hinde, J. Einbeck, and J. Newell
(Eds.).
226-233,
Ireland,
Galway.
Available
in
http://studweb.north.londonmet.ac.uk/~stasinom/papers/ZAIG.pdf.
Accessed on 10/27/2008.
King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political
Analysis, 9, 137-163.
Meulbroek L. (2001). A Better Way to Manage Risk. Harvard Business
Review, Feb 2001, 22-23.
Jørgensen, B. (1987). Exponential dispersion models. Journal of the Royall
Statistical Society, ser. B., 49, 127-162.
Jong, Piet de; Heller, Gillian Z. Generalized linear models for insurance data.
Cambridge: Cambridge University Press, 2008. 196 p.
Jørgensen, B. (1997). Theory of Dispersion Models. Chapman & Hall,
London.
Jørgensen, B and de Souza, M.C.P. (1994). Fitting Tweedie´s compound
Poisson model to insurance claims data. Scandinavian Actuarial Journal: 69-93.
R Development Core Team (2008). R: A Language and Environment for
Statistical Computing. Available in http://www.R-project.org. Accessed on 12/01/2008.
Smyth, G.K. and Jørgensen, B. (2002). Fitting tweedie’s compound poisson
model to insurance claims data: dispersion modeling. ASTIN Bulletin 32: 143-157.
Stasinopoulos, D. M. Rigby R.A. (2007). Generalized additive models for
location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, (7),
Dec 2007,
Stasinopoulos, M. , Rigby, B. and Akantziliotou, C.(2008). gamlss:
Generalized Additive Models for Location Scale and Shape. Available in
http://www.gamlss.com/. Accessed on 12/05/08.
Weisberg, H. I.; Tomberlin T. J. and Chatterjee S. (1984). Predicting Insurance
Losses under Cross-Classification: A Comparison of Alternative Approaches. Journal
of Business & Economic Statistics. Vol. 2, (2)m 170-178.
Weisberg, H. I. and Tomberlin, T. J. (1982). A Statistical Perspective on
Actuarial Methods for Estimating Pure Premiums from Cross-Classified Data. The
Journal of Risk and Insurance, Vol. 49, (4), 539-563.
Wüthrich, Mario V.; Merz, Michael. Stochastic claims reserving methods in
insurance. West Sussex: John Wiley & Sons, Ltd, 2008. 424 p.