Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Estimating Claim Size and Probability in the Auto-insurance Industry: the Zero-adjusted Inverse Gaussian (ZAIG) Distribution Adriana Bruscato Bortoluzzo Danny Pimentel Claro Marco Antonio Leonel Caetano Rinaldo Artes Insper Working Paper WPE: 175/2009 Copyright Insper. Todos os direitos reservados. É proibida a reprodução parcial ou integral do conteúdo deste documento por qualquer meio de distribuição, digital ou impresso, sem a expressa autorização do Insper ou de seu autor. A reprodução para fins didáticos é permitida observando-sea citação completa do documento Estimating Claim Size and Probability in the Auto-insurance Industry: the Zero-adjusted Inverse Gaussian (ZAIG) Distribution Adriana Bruscato Bortoluzzo [email protected] Danny Pimentel CLaro Marco Antonio Leonel Caetano Rinaldo Artes This article aims at the estimation of insurance claims from an auto data set. Using a ZAIG method, we identify factors that influence claim size and probability, and compared the results with the analysis of a Tweedie method. Results show that ZAIG can accurately predict claim size and probability. Factors like territory, vehicles´ advanced age, origin and body influence distinctly claim size and probability. The distinct impact is not always present in Tweedie’s estimated model. Auto insurers should consider estimating risk premium using ZAIG method. The fitted models may be useful to develop a strategy for premium pricing. Keywords: ZAIG method, Tweedie method, Premium pricing, Insurance 1. Introduction There is a well-known problem in the insurance industry that is the proper pricing of an insurance policy. The pure premium of an insurance company to an insured individual is made up of two components: claim probability and expected claim size. The claim probability for any individual is related to the expected number of claims occurring in a given year. The claim size is simply the dollar cost associated with each claim. It is extensively reported in literature the difficulty to estimate size and probability of claims in the insurance industry (e.g. Jong and Heller, 2008). In the past, the main difficulty was related to the credibility of the data sets of insurance companies (Weisberg and Tomberlin, 1982). Insurance data sets were typically very large, of the order of ten of thousands up to millions of cases. Problems arise such as missing values and inconsistent or invalid records. As current information technology systems have become sophisticated every year, the processing of information is more credible than ever before. The challenge is then to employ a proper statistical technique to analyze insurance data. Claims and risks have long been estimated as a pure algorithmic technique or a simple stochastic technique (Wüthrich and Merz, 2008). These methods result in poor estimations. Jørgensen and de Souza (1994) suggested a Poisson sum of Gamma random variables called Tweedie to estimate insurance risk. According to Smyth and Jørgensen, another problem is that, based on the proposed models, no estimation for claim sizes is possible. One problem with the Tweedie distribution model is that more explanatory variables are likely to be found significant (Smyth and Jørgensen, 2002). Recent literature realized that a zero-adjusted Inverse Gaussian (ZAIG) distribution may be appropriate to estimate claim and risk in insurance data (Heller, Stasinopoulos and Rigby, 2006). A mixed discrete-continuous model, with a probability mass at zero and an Inverse Gaussian continuous component, appears to accurately estimate in extreme right skewness distribution. This suggests that probabilities can be calculated from data sets with a large number of zero claims. The ZAIG model explicitly specifies a logit-linear model for the occurrence of a claim (i.e. claim probability). When a claim has occurred, the ZAIG model also specifies log-linear models for the mean claim size and the dispersion of claim sizes. It is important to measure the probability and size of claims separately because it is possible for the probability to depend on a set of independent variables different from those that influence claim size. Therefore, ZAIG estimation appears to be more appropriate to estimate price of insurance policies. Once a method of estimation is defined, the challenge is to identify potential explanatory variables. Typically, policy holders are divided into discrete classes on the basis of certain measurable characteristics predictive of their propensity to generate losses. We evaluate claims by considering variables related to vehicles that are frequently used in literature. In addition to territory, claims have also been studied according to the brand manufacturer and vehicle’s characteristics: age, body and origin. Following previous research all of these variables must become part of the estimation. Our objective is to present the ZAIG method of estimation to estimate probability claims and expected claim size in the insurance industry and to formally test the results with the estimation of a Tweedie regression model using insurance data set. An insurance data was collected to analyze the impact of factors estimated by Tweedie and ZAIG methods. This work is divided in six sections. In the next section, the theoretical background is discussed based on previous research on claim insurance estimation. Also the ZAIG method is presented in the section. The third section discusses the methodology and the data set. The fourth section presents the analysis of results and the comparison of findings of the two methods, ZAIG and Tweedie. Finally, concluding remarks highlight the major contributions of our study. 2. Theoretical Background Insurance: Importance of predictions and predictors Insurance companies are constantly looking for ways to better predict claims. Overall, insurance involves the sum of a large number of individual risks from which very few insurance claims will be reported. Meulbroek (2001) argues that insurance companies need to treat risk management as a series of related factors and events. Boland (2007) suggests that, in order to handle claims arising from incidents that already occurred; insurers must employ methods of prediction to deal with the extent of its liability. Therefore, an insurance company has to find ways to predict claims and appropriately charge a premium to take responsibility for the risk. The prediction problem has to be considered in the light of competitive market insurance (Weisberg, Tomberlin and Chatterjee, 1984). It is possible for an insurer to benefit at least temporarily by identifying segments of the market that are currently being overcharged and offering coverage at lower rates or by avoiding segments that are being undercharged (Doherty, 1981). Regulators are usually concerned about the possibility of rate structures that severely penalize individuals with some characteristics (e.g. place they live, model of vehicle). Therefore, insurers look for better ways to capture the characteristics of individuals that affect claim size and probability, and consequently discriminate among insurer drivers that have high propensity to generate losses. Insurance companies attempt to estimate reasonably price insurance policies based on the losses reported in certain kinds of policy-holders. The estimation has to consider past data in order to grasp trends that occurred (Weisberg and Tomberlin, 1982). Information available to predict the price for a period in the future usually consists of the claim experience for a population or a large sample from the population over a period in the past. Precise estimation may consider a large number of exposures in a data set and a stable claim generation process over time. The predictors to estimate price insurance policy were selected from the automobile industry. In our study, we consider the issue of price prediction in the context of automobile industry because the most sophisticated proposals have been developed in this industry (Jong and Heller, 2008). Previous research in the automobile setting includes predictors such as territory (e.g. Chang and Fairley, 1979) and brand manufacturer (e.g. Heller, Stasinopoulos and Rigby, 2006). Weisberg, Tomberlin and Chatterjee (1984) suggest including variables associated with the status of the vehicle, such as age, body, origin. Previous studies have recognized the use of Tweedie method of estimating auto insurance claims (Smyth and Jørgensen, 2002) and recent studies have shown that ZAIG method may produce accurate models of estimation (Heller, Stasinopoulos and Rigby, 2006). In order to estimate, it is necessary to let yi be the amount expended with claims for client i and xi be a vector of independent variables related to this client. One may represent the variable yi as 0, with probability (1 - π i ) yi = , Wi , with probability π i where Wi is a positive right skewed distribution. This type of variable belongs to the class of the zero inflated probability distributions (e.g. Gan, 2000). The parameter π i is the claim probability and Wi represents the claim size, related to client i. It is important to note that a claim is, in general, a rare event. A small proportion of claims in a sample may lead to problems in predicting claim occurrence by a logistic model because, in this case, the predicted probabilities tend to be small. King and Zeng (2001) proposed a correction to be used in these situations. They used the fact that, in the presence of rare events, the independent variable coefficients are consistent, but the intercept may not be. Tweedie regression models A Tweedie distribution (Jørgensen, 1987, 1997) is a member of the class of exponential dispersion models. It is defined as a distribution of the exponential family (e.g. Jong and Heller, 2008) with mean µ and variance φ µ p; as in Jørgensen and de Souza (1994) and Smyth and Jørgensen (2002), it will be considered, in this work, the case 1<p<2. It is possible to write (i 0, if ( i = 0 yi = , Wi = ∑ X ij , j =i Wi , if ( i > 0 (1) where (i is a Poisson random variable that represents the number of claims occurred for the client i and X i1 , L , X i(i are independent identically distributed Gamma random variables (continuous variable). As a consequence Wi also follows a Gamma distribution, which has a positive and right skewed density probability function. In this work, it was used a log-linear Tweedie regression model, given by µ i = ex γ , T i where xi is a matrix of independent variables and γ is the parametric vector. ZAIG regression model The variable yi follows a ZAIG distribution (Heller, Stasinopoulos and Rigby, 2006) if Wi is a Gaussian inverse random variable. The Gaussian inverse is a positive and highly skewed distribution with two parameters: the mean (µi) and a dispersion parameter (λi). It may be proved that E ( yi ) = π i µ i and ( ) Var ( yi ) = π i µ i2 1 − π i + µ i λ i2 . In the context of this work, µi is the expected claim size and λi is a parameter related to claim size dispersion. It is possible to propose regression models for πi, µi and λi as π i = h 1(xiT β ), µ i = h 2 (ziT γ ) , λ i = h 3(wiT δ ), where h1, h2 and h2 are continuous twice differentiable invertible functions, β, γ and δ are parametric vectors and xi, zi and wi are vector of independent variables for client i. In an insurance context, it is highly convenient to use different sets of independent variables to model these three parameters. Consider, for instance, a variable that indicates the local of the residence of the car´s owner. It is well known that robbery rates vary in a city, but the price of a vehicle do not. It is expected that the local of residence will be important in explaining πi, but not µi, then one may include the variable in the probability model but not in the expected claim size model. This example illustrates Heller, Stasinopoulos and Rigby (2006) statement “… A problem with the Tweedie distribution model is that the probabilities at zero can not modeled explicit as a function of explanatory variables…”. The following models are adjusted: e xi β T πi = 1+ e xTi β , µ i = e xi γ and λ i = e xi δ . T T (2) In short this modeling option assures that µi and λi are, as expected, always positive and that πi is modeled as a logistic regression. It is important to remember the bad performance of logistic models in predicting claims, when the frequency of claims in the sample is small. 3. Methodology A sample was collected from a major automobile insurance company resulting in a data set of 33,257 passenger vehicle records. The data set was processed to clean for missing values and selection of relevant variables. We focus the analysis on yearly claims that refer to robberies or accidents with repair costs greater than the vehicles value. Claim size refers to the amount of dollars paid as liability of a claim. Claim probability refers to the percentage of claims of a year period. The average annual claim probability is 1.15%, and the average claim size is $21,002.30. For every insurance policy holder, twenty explanatory variables were employed. The variables are coded by means of binary variables, described in Table 1. Table 2 shows the descriptive statistics of the variables. Table 1. List of explanatory variables Variable Description Vehicle’s Age (II) Equals 1 if the insured vehicle has one or two years (related to contract’s year), 0 otherwise Vehicle’s Age (III) Equals 1 if the insured vehicle has three or four years (related to contract’s year), 0 otherwise Vehicle’s Age (IV) Equals 1 if the insured vehicle has five or six years (related to contract’s year), 0 otherwise Vehicle’s Age (V) Equals 1 if the insured vehicle has seven to nine years (related to contract’s year), 0 otherwise Vehicle’s Age (VI) Equals 1 if the insured vehicle has ten or more years (related to contract’s year), 0 otherwise Origin Equals 1 if the insured vehicle is imported and 0 if it is national Model/Manufacturer A combination of different models and manufacturers in the data set. Groups were assigned on the basis of a CHAID analysis. Territory Clusters of territory were assigned on the basis of Hierarchical Cluster Analysis Method for claim size. It represents a division of the area into a set of exclusive areas thought to be relatively homogeneous in terms of claims. Vehicles are assigned to territories according to where they are usually garaged. Vehicle Body (II) Equals 1 for midsize vehicle, 0 otherwise Vehicle Body (III) Equals 1 for luxury vehicle, 0 otherwise Intercept Vehicle with zero years (related to contract’s year - Vehicle’s Age (I)), national, Model/Manufacturer (I), Territory (I) and small vehicle (Vehicle Body (I)) Table 2. Descriptive statistics for claim size and for proportion of claims Claim size Proportion Variable Value Total sample Claim size>0 2 of claim Mean SD Mean SD I 0.94% 306.50 3,511.06 32,645.59 16,132.40 II 1.06% 244.61 2,703.39 22,974.27 12,868.49 10,707 III 1.12% 210.90 2,181.39 18,766.88 8,726.57 6,674 IV 1.07% 157.03 1,672.73 14,660.81 7,063.96 3,361 V 2.08% 221.47 1,624.80 10,672.59 3,990.81 2,554 VI 2.17% 199.84 1,482.98 9,210.65 4,374.82 1,014 National 1.19% 251.06 2,735.15 21,049.92 13,781.40 30,771 Imported 0.68% 136.59 1,849.15 19,974.17 10,491.21 Model/ I 1.68% 330.31 2,883.55 19,676.86 10,732.91 11,676 Manufacturer II 1.27% 627.34 6,009.68 49,380.62 21,589.70 Vehicle’s Age Origin 8947 2,486 1,102 Claim size Proportion Variable Value Total sample Claim size>0 2 of claim Mean Territory Vehicle Body SD Mean SD III 0.37% 73.19 1,313.43 19,644.96 9,559.60 1,879 IV 1.07% 152.81 1,646.32 14,244.86 7,255.25 6,432 V 2.67% 416.29 3,139.55 15,594.96 11,948.36 487 VI 0.81% 260.14 3,607.39 32,256.95 26,900.88 620 VII 0.52% 91.11 1,352.48 17,497.01 7,007.88 3,841 VIII 0.72% 147.18 2,177.97 20,401.90 16,009.04 2,911 IX 1.35% 424.82 4,065.73 31,547.34 16,003.95 1,708 X 0.62% 203.19 2,686.77 33,030.77 9,728.11 2,601 I 7.69% 1,608.42 5,685.66 20,909.50 1,007.63 26 II 2.07% 502.80 4,213.33 24,259.88 16,871.42 2,895 III 0.45% 55.39 828.71 12,406.50 716.30 IV 1.07% 218.91 2,494.36 20,445.81 12,964.14 29,888 Small 1.20% 183.95 1,854.85 15,345.89 7,390.55 15,517 Midsize 1.44% 368.18 3,537.86 25,586.44 15,029.12 11,397 Luxury 0.54% 159.92 2,586.33 29,834.43 19,322.83 1.15% 242.50 2,679.22 21,002.30 13,643.46 33,257 448 6,343 Complete sample 4. Results and Discussion ZAIG model was estimated by the package GAMLSS (Stasinopoulos and Rigby, 2007; Stasinopoulos, Rigby and Akantziliotou, 2008) for R system (R Development Core Team, 2008). Tweedie model was estimated in SPSS package (version 16.0). Table 1 presents the results of our estimates. The Vehicle’s age, Model/Manufacturer, territory and Vehicle Body served as the independent variables. The dependent variable is the amount expended with claims that refers to a robbery or an accident with repair costs greater than the vehicles value. Coefficients t-statistic and standard error for the estimated models are reported. Table 3. Results ZAIG Tweedie Variable Equation 2: ν=1-π π Equation 3: µ (Claim probability) (Claim size) Equation 1 Estimate t value Estimate Intercept 7.61** 11.08 (.69) Vehicle’s -0.05 Age (II) (.16) Vehicle’s -0.19 Age (III) (.18) Vehicle’s -0.43 Age (IV) (.23) Vehicle’s -0.16 Age (V) (.21) Vehicle’s -0.22 Age (VI) (.30) Model/Manu- 0.56 facturer (II) (.31) Model/Manu- -1.26** facturer (III) (.44) Model/Manu- -0.71** facturer (IV) (.16) Model/Manu- 0.47 facturer (V) (.42) Model/Manu- -0.45 -2.35** t value Estimate t value Estimate t value -3.11 9.91** 16.16 -0.55** -58.90 (.76) -0.30 0.14 (.61) 0.93 (.15) -1.09 0.18 0.16 1.08 0.79** 0.80 0.77** 4.31 -0.12 3.13 -1.08** -0.40 -0.52** -2.74 0.90** -3.65 -0.76 -10.11 -1.24** 0.51** -0.04 -0.08 2.86 0.06 -11.50 0.25 -0.06 -0.32* -0.56** (.17) 2.66 -0.20 -1.25 0.41 (. 14) -1.65 25.70 -0.39 -2.54 (.12) (. 07) (.31) -0.83 -1.07** 2.93** (.14) (.19) (.14) 1.13 -6.62 (.19) (.39) -4.42 -0.81** -2.11 (.11) (.11) (.29) -2.84 -0.44 (.11) (.25) 1.79 -0.53 -0.22* (.10) (.12) (.18) -0.73 -2.86 (1.19) (.20) -0.78 -0.30** (.09) (.11) (.16) -1.91 Equation 4: λ 0.48 -3.27 ZAIG Tweedie Variable Equation 2: ν=1-π π Equation 3: µ (Claim probability) (Claim size) Equation 1 Estimate t value Estimate facturer (VI) (.55) Model/Manu- -1.23** facturer (VII) (.26) Model/Manu- -0.78** Facturer (VIII) (.28) Model/Manu- 0.05 Facturer (IX) (0.26) Model/Manu- -0.69* Facturer (X) (.29) Origin -0.19 (.46) -4.82 -1.10 (II) (.69) Territory -3.31** (III) (.98) Territory -1.91** (IV) (.68) Vehicle Body 0.44** (II) (.14) Vehicle Body -0.36 (II) (.24) Scale 104.51 -1.31** -2.77 -0.82 -5.56 -0.19 -3.53 -1.04** -0.80 -0.08 -3.85 -1.32 -0.28 -2.90** -1.75 -1.90** -2.82 0.11 -2.55 -1.02** (.22) 0.22 4.50 1.31 0.33 1.55 0.43** 3.58 0.25 0.40 -0.23 -0.31 0.20 0.32 (.62) 0.86 (.12) -1.51 0.35** 1.45 (.73) (.75) 3.19 0.15 (.62) (1.03) -2.81 1.63 (.12) (.75) -3.38 0.19 (.22) (.28) -1.58 t value (.17) (.27) -0.54 Estimate (.10) (.23) -2.41 t value (.12) (.23) 0.20 Estimate (.52) (.24) (.34) Territory t value Equation 4: λ 0.18* 2.50 (.07) -4.63 0.46** (.07) (.08) 6.55 -0.68** -5.06 (.13) *p<0.05; **p<0.01. Regression coefficients are standardized coefficients (β) and standard error within parentheses. Several explanatory variables were significantly related with dependent variables. Considering all vehicle´s age variables, we can say that there is a significant increase in the expected claim probability as the vehicle becomes older (0.79 and 0.77). On the other hand, the expected claim size decreases for older vehicles. This is in line with intuition, because old vehicles are less expensive to replace and there is also the fact that old vehicles are more attractive to be stolen. One might suggest that old vehicles are more attractive because there is a great auto-part replacement market that is flooded with parts of these old cars and old cars tend to be poorly maintained, which increases the probability of accidents. The variable model/manufacturer is related to claims probability and size. In general, the model/manufacturer is more related to claim probability than claim size. Noteworthy, there is no way to clearly identify whether claim size or probability is causing the significance of Tweedie´s coefficients. The variable for origin of vehicles influences the claim size (Equation 3). This suggests, ceteris paribus, that imported vehicles tend to be related to great claim size. Overall, imported vehicles are of high purchase value. The variable territory is generally related to claim probability, however not related to claim size. One might suggest that claim size depends solely on the vehicle’s value and not on the territory of the claim. Based only on Tweedie model, there is no way to clearly identify whether claim size or probability is causing the significance of Tweedie´s coefficients for the territory variable. Vehicles’ body is related to claim size and probability. The claim probably decreases for luxury vehicles. However luxury cars lead to higher claim size followed by midsize cars. When taking Tweedie´s results, the difficulty to accurately predict claims is shown in the non significance of Tweedie’s coefficient for luxury cars (i.e. vehicle body 3). One might suggest that the not significant coefficient is due to a negative claim probability effect and a positive claim size effect, as found in ZAIG’s coefficients. 5. Concluding Remarks In this work we tackled a well-known problem in the insurance industry that is the proper pricing of an insurance policy. Employing the ZAIG method of estimation for claims and risks in the insurance industry, we found distinct factors that influence claim size and probability. Factors like territory, vehicles´ advanced age, origin and body influence distinctly claim size and probability. The distinct impact is not always present in Tweedie’s estimated model. The ZAIG method of estimation also allows for insurance companies to create a score system to predict claims, based on the logistic model. This score system intends to identify police holders who tend to be more risky. The estimated models may be employed to develop a strategy for premium pricing. Some limitations of this study may be recognized. First, the methods require a high computational effort which might preclude the use of large data sets. Second, there is room for developing suitable methods for longitudinal data analysis. Future work may consider the use of estimating equations techniques or multivariate ZAIG distributions. We concentrated our research on the auto insurance industry and specific vehicles’ variables. Studies may address other insurance industries and include customer related variables. 6. References [1] Boland, Philip J. Statistical and probabilistic methods in actuarial science. Boca Raton: Chapman & Hall/CRC, 2007. 351 p. [2] Chang, L. and Fairley W. B. Pricing Automobile Insurance under Multivariate Classification of Risks: Additive versus Multiplicative. The Journal of Risk and Insurance. Vol. 46, (2), 75-98, 1979. [3] Doherty, N. A., Is Rate Classification Profitable? The Journal of Risk and Insurance, Vol. 48, (2), 286-295, 1981. [4] Gan, N., General zero-inflated models and their applications. North Carolina State University, 2000. Phd Thesis. [5] Heller, G., Stasinopoulos, M. and Rigby, B. The zero-adjusted inverse Gaussian distribution as a model for insurance claims. Proceedings of the 21th International Workshop on Statistical Modelling. J. Hinde, J. Einbeck, and J. Newell (Eds.). 226-233, Ireland, Galway, 2006. Available in http://studweb.north.londonmet.ac.uk/~stasinom/papers. Accessed on 10/27/2008. [6] Jong, Piet de; Heller, Gillian Z. Generalized linear models for insurance data. Cambridge: Cambridge University Press, 2008. 196 p. [7] Jørgensen, B. Exponential dispersion models. Journal of the Royall Statistical Society, ser. B., 49, 127-162, 1987. [8] Jørgensen, B. Theory of Dispersion Models. London: Chapman & Hall, 1997. [9] Jørgensen, B and De Souza, M. C. P. Fitting Tweedie´s compound Poisson model to insurance claims data. Scandinavian Actuarial Journal, 69-93, 1994. [10] King, G. and Zeng, L. Logistic regression in rare events data. Political Analysis, 9, 137- 163, 2001. [11] Meulbroek, L. A Better Way to Manage Risk. Harvard Business Review, 22-23, 2001. [12] R Development Core Team. R: A Language and Environment for Statistical Computing, 2008. Available in http://www.R-project.org. Accessed on 12/01/2008. [13] Smyth, G. K. and Jørgensen, B. Fitting tweedie’s compound poisson model to insurance claims data: dispersion modeling. ASTI( Bulletin 32: 143-157, 2002. [14] Stasinopoulos, D. M. and Rigby, R. A. Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, (7), 2007. [15] Stasinopoulos, M., Rigby, B. and Akantziliotou, C. gamlss: Generalized Additive Models for Location Scale and Shape, 2008. Available in http://www.gamlss.com/. Accessed on 12/05/08. [16] Weisberg, H. I.; Tomberlin, T. J. and Chartterjee, S. Predicting Insurance Losses under Cross-Classification: A Comparison of Alternative Approaches. Journal of Business & Economic Statistics. Vol. 2, (2)m 170-178, 1984. [17] Weisberg, H. I. and Tomberlin, T. J. A Statistical Perspective on Actuarial Methods for Estimating Pure Premiums from Cross-Classified Data. The Journal of Risk and Insurance, Vol. 49, (4), 539-563, 1982. [18] Wüthrich, M. V.; Merz. M. Stochastic claims reserving methods in insurance. West Sussex: John Wiley & Sons, 2008. 424 p.