Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Estimating claim size and probability in the auto-insurance industry: the zeroadjusted Inverse Gaussian (ZAIG) distribution Danny Pimentel Claro Insper Working Paper WPE: 165/2009 Copyright Insper. Todos os direitos reservados. É proibida a reprodução parcial ou integral do conteúdo deste documento por qualquer meio de distribuição, digital ou impresso, sem a expressa autorização do Insper ou de seu autor. A reprodução para fins didáticos é permitida observando-sea citação completa do documento Estimating claim size and probability in the auto-insurance industry: the zero-adjusted Inverse Gaussian (ZAIG) distribution Abstract This article aims at the estimation of insurance claims from an auto data set. Using a ZAIG method, we identify factors that influence claim size and probability, and compared the results with the analysis of a Tweedie method. Results show that ZAIG can accurately predict claim size and probability. Factors like territory, vehicles´ advanced age, origin and body influence distinctly claim size and probability. The distinct impact is not always present in Tweedie’s estimated model. Auto insurers should consider estimating risk premium using ZAIG method. The fitted models may be useful to develop a strategy for premium pricing. 1. Introduction There is a well-known problem in the insurance industry that is the proper pricing of an insurance policy. The pure premium of an insurance company to an insured individual is made up of two components: claim probability and expected claim size. The claim probability for any individual is related to the expected number of claims occurring in a given year. The claim size is simply the dollar cost associated with each claim. It is extensively reported in literature the difficulty to estimate size and probability of claims in the insurance industry (e.g. Jong and Heller, 2008). In the past, the main difficulty was related to the credibility of the data sets of insurance companies (Weisberg and Tomberling, 1982). Insurance data sets were typically very large, of the order of ten of thousands up to millions of cases. Problems arise such as missing values and inconsistent or invalid records. As current IT systems have become sophisticatedevery year, the processing of information is more credible than ever before. The challenge is then to employ a proper statistical technique to analyze insurance data. Claims and risks have long been estimated as a pure algorithmic technique or a simple stochastic technique (Wuthrich and Merz, 2008). These methods resulted in poor estimations. Jørgensen and de Souza (1994) suggested a Poisson sum of Gamma random varieties called Tweedie to estimate insurance risk. One problem with the Tweedie distribution model is that more explanatory variables are likely to be found significant (Smyth and Jørgensen, 2002). According to Smyth and Jørgensen (2001), another problem is that, based on the proposed models, no estimation for claim sizes is possible. Recent literature realized that a zero-adjusted Inverse Gaussian (ZAIG) distribution may be appropriate to estimate claim and risk in insurance data (Heller, Stasinopoulos and Rigby, 2006). A mixed discrete-continuous model, with a probability mass at zero and an Inverse Gaussian continuous component, appears to accurately estimate in extreme right skewness distribution. This suggests that probabilities can be calculated from data sets with a large number of zero claims. The ZAIG model explicitly specifies a logit-linear model for the occurrence of a claim (i.e. claim probability). When a claim has occurred, the ZAIG model also specifies log-linear models for the mean claim size and the dispersion of claim sizes. It is important to measure the probability and size of claims separately because it is possible for the probability to depend on a set of independent variables different from those that influence claim size. Therefore, ZAIG estimation appears to be more appropriate to estimate price of insurance policies. Once a method of estimation is defined, the challenge is to identify potential explanatory variables. Typically, policy holders are divided into discrete classes on the basis of certain measurable characteristics predictive of their propensity to generate losses. We evaluate claims by considering variables related to vehicles that are frequently used in literature. In addition to territory, claims have also been studied according to the brand manufacturer and vehicle’s characteristics: age, body and origin. Following previous research all of these variables must become part of the estimation. Our objective is to present the ZAIG method of estimation to estimate probability claims and expected claim size in the insurance industry and to formally test the results with the estimation of a Tweedie regression model using insurance data set. An insurance data was collected to analyze the impact of factors estimated by Tweedie and ZAIG methods. This work is divided in six sections. In the next section, the theoretical background is discussed based on previous research on claim insurance estimation. Also the ZAIG method is presented in the section. The third section discusses the methodology and the data set. The fourth section presents the analysis of results and the comparison of findings of the two methods, ZAIG and Tweedie. Finally, concluding remarks highlight the major contributions of our study. 2. Theoretical Background Insurance: Importance of predictions and predictors Insurance companies are constantly looking for ways to better predict claims. Overall, insurance involves the sum of a large number of individual risks from which very few insurance claims will be reported. Meulbroek (2001) argues that insurance companies need to treat risk management as a series of related factors and events. Boland (2007) suggests that, in order to handle claims arising from incidents that already occurred, insurers must employ methods of prediction to deal with the extent of its liability. Therefore, an insurance company has to find ways to predict claims and appropriately charge a premium to take responsibility for the risk. The prediction problem has to be considered in the light of competitive market insurance (Weisberg, Tomberlin and Chatterjee, 1984). It is possible for an insurer to benefit at least temporarily by identifying segments of the market that are currently being overcharged and offering coverage at lower rates or by avoiding segments that are being undercharged (Doherty, 1981). Regulators are usually concerned about the possibility of rate structures that severely penalize individuals with some characteristics (e.g. place they live, model of vehicle). Therefore, insurers look for better ways to capture the characteristics of individuals that affect claim size and probability, and consequently discriminate among insurer drivers that have high propensity to generate losses. Insurance companies attempt to estimate reasonably price insurance policies based on the losses reported in certain kinds of policy-holders. The estimation has to consider past data in order to grasp trends that occurred (Weisberg and Tomberlin, 1982). Information available to predict the price for a period in the future usually consists of the claim experience for a population or a large sample from the population over a period in the past. Precise estimation may consider a large number of exposures in a data set and a stable claim generation process over time. The predictors to estimate price insurance policy were selected from the automobile industry. In our study, we consider the issue of price prediction in the context of automobile industry because the most sophisticated proposals have been developed in this industry (Jong and Heller, 2008). Previous research in the automobile setting includes predictors such as territory (e.g. Chang and Fairly, 1979) and brand manufacturer (e.g. Heller, Stasinopoulos and Rigby, 2006). Weisberg, Tomberlin and Chatterjee (1984) suggest including variables associated with the status of the vehicle, such as age, body, origin. Previous studies have recognized the use of Tweedie method of estimating auto insurance claims (Smyth and Jørgensen, 2001) and recent studies have shown that ZAIG method may produce accurate models of estimation (Heller, Stasinopoulos and Rigby, 2006). In order to estimate, it is necessary to let yi be the amount expended with claims for client i and xi be a vector of independent variables related to this client. One may represent variable yi as 0, with probability (1 - i ) , yi Wi , with probability i (1) where Wi is a positive right skewed distribution. This type of variable belongs to the class of the zero inflated probability distributions (e.g. Gan, 2001). The parameter is the claim probability and Wi represents the claim size. It is important to note that a claim is, in general, a rare event. A small proportion of claims in a sample may lead to problems in predicting claim occurrence by a logistic model because, in this case, the predicted probabilities tend to be small. King and Zeng (2002) proposed a correction to be used in these situations. They used the fact that, in the presence of rare events, the independent variable coefficients are consistent, but the intercept may not be. Tweedie regression models A Tweedie distribution (Jørgensen, 1987, 1997) is a member of the class of exponential dispersion models. It is defined as a distribution of the exponential family (e.g. Heller and Jong, 2007) with mean and variance p; as in Jørgensen and de Souza (1994) and Smyth and Jørgensen (2002), it will be considered, in this work, the case 1<p<2. It is possible to write Ni 0, if N i 0 , Wi X ij , (2) yi j i Wi , if N i 0 where Ni is a Poisson random variable and X i1 , , X iNi are independent identically distributed Gamma random variables (continuous variable). As a consequence Wi also follows a Gamma distribution, which has a positive and right skewed density probability function. In this work, it was used a log-linear Tweedie regression model, given by T i e xi , (3) ZAIG regression model The variable yi follows a ZAIG distribution (Heller, Stasinopoulos and Rigby, 2006) if Wi is a Gaussian inverse random variable. The Gaussian inverse is a positive and highly skewed distribution with two parameters: the mean (i) and a dispersion parameter (i). It may be proved that E ( yi ) i i and Var( yi ) i i2 1 i i i2 . (4) In the context of this work, I is the expected claim size and i is a parameter related to claim size dispersion. It is possible to propose regression models for i, i and i as i h 1xiT , i h 2 ziT , i h 3wiT , (5) where h1, h2 and h2 are continuous twice differentiable invertible functions, , and are parametric vectors and xi, zi and wi are vectors of independent variables. In an insurance context, it is highly convenient to use different sets of independent variables to model these three parameters. Consider, for instance, a variable that indicates the local of the residence of the car´s owner. It is well known that robbery rates vary in a city, but the price of a vehicle do not. It is expected that the local of residence will be important in explaining i, but not i, then one may include the variable in the probability model but not in the expected claim size model. This example illustrates Heller, Stasinopoulos and Rigby (2006) statement “… A problem with the Tweedie distribution model is that the probabilities at zero can not modeled explicit as a function of explanatory variables…”. The following models are adjusted e xi T i 1 e xTi , i e xi and i e xi . T T (6) In short this modeling option assures that i and i are, as expected, always positive and that i is modeled as a logistic regression. It is important to remember the bad performance of logistic models in predicting claims, when the frequency of claims in the sample is small. 3. Methodology A sample was collected from a major automobile insurance company resulting in a data set of 33,257 passenger vehicle records. The data set was processed to clean for missing values and selection of relevant variables. We focus the analysis on yearly claims that refer to robberies or accidents with repair costs greater than the vehicles value. Claim size refers to the amount of dollars paid as liability of a claim. Claim probability refers to the percentage of claims of a year period. The average annual claim probability is 1.15%, and the average claim size is $21,002.30. For every insurance policy holder, twenty explanatory variables were employed. The variables are coded by means of binary variables, described in Table 1. Table 2 shows the descriptive statistics of the variables. Table 1: List of explanatory variables Variable Vehicle’s Age (II) Vehicle’s Age (III) Vehicle’s Age (IV) Vehicle’s Age (V) Vehicle’s Age (VI) Origin Model/Manufacturer Territory Vehicle Body (II) Vehicle Body (III) Intercept Description Equals 1 if the insured vehicle has one or two years (related to contract’s year), 0 otherwise Equals 1 if the insured vehicle has three or four years (related to contract’s year), 0 otherwise Equals 1 if the insured vehicle has five or six years (related to contract’s year), 0 otherwise Equals 1 if the insured vehicle has seven to nine years (related to contract’s year), 0 otherwise Equals 1 if the insured vehicle has ten or more years (related to contract’s year), 0 otherwise Equals 1 if the insured vehicle is imported and 0 if it is national A combination of different models and manufacturers in the data set. Groups were assigned on the basis of a CHAID analysis. Clusters of territory were assigned on the basis of Hierarchical Cluster Analysis Method for claim size. It represents a division of the area into a set of exclusive areas thought to be relatively homogeneous in terms of claims. Vehicles are assigned to territories according to where they are usually garaged. Equals 1 for midsize vehicle, 0 otherwise Equals 1 for luxury vehicle, 0 otherwise Vehicle with zero years (related to contract’s year - Vehicle’s Age (I)), national, Model/Manufacturer (I), Territory (I) and small vehicle (Vehicle Body (I)) Table 2: Descriptive statistics for claim size and for proportion of claims Variable Value Vehicle’s Age I II III IV V VI National Imported I II III IV V VI VII VIII IX X Origin Model/ Manufacturer Proportion of claim 0.94% 1.06% 1.12% 1.07% 2.08% 2.17% 1.19% 0.68% 1.68% 1.27% 0.37% 1.07% 2.67% 0.81% 0.52% 0.72% 1.35% 0.62% Claim size Total sample Claim size>0 Mean SD Mean SD N 306.50 244.61 210.90 157.03 221.47 199.84 251.06 136.59 330.31 627.34 73.19 152.81 416.29 260.14 91.11 147.18 424.82 203.19 8947 10,707 6,674 3,361 2,554 1,014 30,771 2,486 11,676 1,102 1,879 6,432 487 620 3,841 2,911 1,708 2,601 3,511.06 2,703.39 2,181.39 1,672.73 1,624.80 1,482.98 2,735.15 1,849.15 2,883.55 6,009.68 1,313.43 1,646.32 3,139.55 3,607.39 1,352.48 2,177.97 4,065.73 2,686.77 32,645.59 22,974.27 18,766.88 14,660.81 10,672.59 9,210.65 21,049.92 19,974.17 19,676.86 49,380.62 19,644.96 14,244.86 15,594.96 32,256.95 17,497.01 20,401.90 31,547.34 33,030.77 16,132.40 12,868.49 8,726.57 7,063.96 3,990.81 4,374.82 13,781.40 10,491.21 10,732.91 21,589.70 9,559.60 7,255.25 11,948.36 26,900.88 7,007.88 16,009.04 16,003.95 9,728.11 Variable Territory Vehicle Body Value Proportion of claim I II III IV Small Midsize Luxury Complete sample 7.69% 2.07% 0.45% 1.07% 1.20% 1.44% 0.54% 1.15% Claim size Total sample Claim size>0 Mean SD Mean SD 1,608.42 502.80 55.39 218.91 183.95 368.18 159.92 5,685.66 4,213.33 828.71 2,494.36 1,854.85 3,537.86 2,586.33 20,909.50 24,259.88 12,406.50 20,445.81 15,345.89 25,586.44 29,834.43 1,007.63 16,871.42 716.30 12,964.14 7,390.55 15,029.12 19,322.83 N 26 2,895 448 29,888 15,517 11,397 6,343 242.50 2,679.22 21,002.30 13,643.46 33,257 4. Results and Discussion ZAIG model was estimated by the package GAMLSS (Stasinopoulos and Rigby, 2007 and Stasinopoulos, Rigby and Akantziliotou, 2008) for R system (R Development Core Team, 2008). Tweedie model was estimated in SPSS package (version 16.0). Table 1 presents the results of our estimates. The Vehicle’s age, Model/Manufacturer, territory and Vehicle Body served as the independent variables. The dependent variable is the amount expended with claims that refers to a robbery or an accident with repair costs greater than the vehicles value. Coefficients t-test and error term of the estimated models are reported. Table 1: Results Variable Intercept Vehicle’s Age (II) Vehicle’s Age (III) Vehicle’s Age (IV) Vehicle’s Age (V) Vehicle’s Age (VI) Model/Manufacturer (II) Model/Manu- Tweedie Equation 1 Estimate 7.61** (.69) -0.05 (.16) -0.19 (.18) -0.43 (.23) -0.16 (.21) -0.22 (.30) 0.56 (.31) -1.26** ZAIG Equation 2: =1- Equation 3: (Claim probability) (Claim size) t value Estimate t value Estimate t value 11.08 -3.11 9.91** 16.16 -2.35** (.76) (.61) -0.30 0.14 0.93 -0.30** -2.86 (.15) (.11) -1.09 0.18 1.08 -0.53 -0.44 (.16) (1.19) -1.91 0.16 0.80 -0.81** -6.62 (.20) (.12) -0.78 0.79** 4.31 -1.07** -10.11 (.18) (.11) -0.73 0.77** 3.13 -1.24** -11.50 (.25) (.11) 1.79 -0.12 -0.40 0.51** 2.66 (.29) (.19) -0.04 -0.20 -2.84 -1.08** -2.74 Equation 4: Estimate -0.55** (.09) -0.22* (.10) 2.93** (.11) -0.06 (.14) -0.32* (.12) -0.56** (.17) t value -58.90 -2.11 25.70 -0.39 -2.54 -3.27 Variable facturer (III) Model/Manufacturer (IV) Model/Manufacturer (V) Model/Manufacturer (VI) Model/Manufacturer (VII) Model/ManuFacturer (VIII) Model/ManuFacturer (IX) Model/ManuFacturer (X) Origin Territory (II) Territory (III) Territory (IV) Vehicle Body (II) Vehicle Body (II) Scale Tweedie Equation 1 Estimate (.44) -0.71** (.16) 0.47 (.42) -0.45 (.55) -1.23** (.26) -0.78** (.28) 0.05 (0.26) -0.69* (.29) -0.19 (.34) -1.10 (.69) -3.31** (.98) -1.91** (.68) 0.44** (.14) -0.36 (.24) 104.51 ZAIG Equation 2: =1- Equation 3: (Claim probability) (Claim size) t value Estimate t value Estimate t value (.19) (.39) -0.08 -1.25 -4.42 -0.52** -3.65 (. 07) (.14) 1.13 0.06 0.41 0.90** 2.86 (. 14) (.31) -0.83 -0.76 -1.65 0.25 0.48 (.46) (.52) 0.19 1.63 -4.82 -1.31** -5.56 (.12) (.24) -0.82 -3.53 0.15 1.45 -2.77 (.23) (.10) 0.20 -0.19 -0.80 0.22 1.31 (.23) (.17) 0.33 1.55 -2.41 -1.04** -3.85 (.22) (.27) -0.54 -0.08 -0.28 0.43** 3.58 (.28) (.12) -1.58 -1.32 -1.75 0.25 0.40 (.75) (.62) -0.23 -0.31 -3.38 -2.90** -2.82 (.73) (1.03) 0.20 0.32 -2.81 -1.90** -2.55 (.62) (.75) 0.11 0.86 3.19 0.18* 2.50 (.12) (.07) -1.51 -1.02** -4.63 0.46** 6.55 (.22) (.07) Equation 4: Estimate t value 0.35** (.08) -0.68** (.13) 4.50 Several explanatory variables were significantly related with dependent variables. Considering all vehicle´s age variables, we can say that there is a significant increase in the expected claim probability as the vehicle becomes older (0.79 and 0.77). On the other hand, the expected claim size decreases for older vehicles. This is in line with intuition, because old vehicles are less expensive to replace and there is also the fact that old vehicles are more attractive to be stolen. One might suggest that old vehicles are more attractive because there is a great auto-part replacement market that is flooded with parts of these old cars and old cars tend to be poorly maintained, which increases the probability of accidents. The variable model/manufacturer is related to claims probability and size. In general, the model/manufacturer is more related to claim probability than claim size. Noteworthy, there is no way to clearly identify whether claim size or probability is causing the significance of Tweedie´s coefficients. -5.06 The variable for origin of vehicles influences the claim size (Equation 3). This suggests, ceteris paribus, that imported vehicles tend to be related to great claim size. Overall, imported vehicles are of high purchase value. The variable territory is generally related to claim probability, however not related to claim size. One might suggest that claim size depends solely on the vehicle’s value and not on the territory of the claim. Based only on Tweedie model, there is no way to clearly identify whether claim size or probability is causing the significance of Tweedie´s coefficients for the territory variable. Vehicles’ body is related to claim size and probability. The claim probably decreases for luxury vehicles. However luxury cars lead to higher claim size followed by midsize cars. When taking Tweedie´s results, the difficulty to accurately predict claims is shown in the non significance of Tweedie’s coefficient for luxury cars (i.e. vehicle body 3). One might suggest that the not significant coefficient is due to a negative claim probability effect and a positive claim size effect, as found in ZAIG’s coefficients. 5. Concluding Remarks In this work we tackled a well-known problem in the insurance industry that is the proper pricing of an insurance policy. Employing the ZAIG method of estimation for claims and risks in the insurance industry, we found distinct factors that influence claim size and probability. Factors like territory, vehicles´ advanced age, origin and body influence distinctly claim size and probability. The distinct impact is not always present in Tweedie’s estimated model. The ZAIG method of estimation also allows for insurance companies to create a score system to predict claims, based on the logistic model. This score system intends to identify police holders who tend to be more risky. The estimated models may be employed to develop a strategy for premium pricing. Some limitations of this study may be recognized. First, the methods require a high computational effort which might preclude the use of large data sets. Second, there is room for developing suitable methods for longitudinal data analysis. Future work may consider the use of estimating equations techniques or multivariate ZAIG distributions. We concentrated our research on the auto insurance industry and specific vehicles’ variables. Studies may address other insurance industries and include customer related variables. 6. References Boland, Philip J. Statistical and probabilistic methods in actuarial science. Boca Raton: Chapman & Hall/CRC, 2007. 351 p. Doherty N. A. (1981). Is Rate Classification Profitable?. The Journal of Risk and Insurance, Vol. 48, (2), 286-295. Gan, N. (2000). General zero-inflated models and their applications. (Phd Thesis). North Carolina State University. Heller, G., Stasinopoulos, M. and Rigby, B. (2006) The zero-adjusted inverse Gaussian distribution as a model for insurance claims. Proceedings of the 21th International Workshop on Statistical Modelling. J. Hinde, J. Einbeck, and J. Newell (Eds.). 226-233, Ireland, Galway. Available in http://studweb.north.londonmet.ac.uk/~stasinom/papers/ZAIG.pdf. Accessed on 10/27/2008. King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9, 137-163. Meulbroek L. (2001). A Better Way to Manage Risk. Harvard Business Review, Feb 2001, 22-23. Jørgensen, B. (1987). Exponential dispersion models. Journal of the Royall Statistical Society, ser. B., 49, 127-162. Jong, Piet de; Heller, Gillian Z. Generalized linear models for insurance data. Cambridge: Cambridge University Press, 2008. 196 p. Jørgensen, B. (1997). Theory of Dispersion Models. Chapman & Hall, London. Jørgensen, B and de Souza, M.C.P. (1994). Fitting Tweedie´s compound Poisson model to insurance claims data. Scandinavian Actuarial Journal: 69-93. R Development Core Team (2008). R: A Language and Environment for Statistical Computing. Available in http://www.R-project.org. Accessed on 12/01/2008. Smyth, G.K. and Jørgensen, B. (2002). Fitting tweedie’s compound poisson model to insurance claims data: dispersion modeling. ASTIN Bulletin 32: 143-157. Stasinopoulos, D. M. Rigby R.A. (2007). Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, Vol. 23, (7), Dec 2007, Stasinopoulos, M. , Rigby, B. and Akantziliotou, C.(2008). gamlss: Generalized Additive Models for Location Scale and Shape. Available in http://www.gamlss.com/. Accessed on 12/05/08. Weisberg, H. I.; Tomberlin T. J. and Chatterjee S. (1984). Predicting Insurance Losses under Cross-Classification: A Comparison of Alternative Approaches. Journal of Business & Economic Statistics. Vol. 2, (2)m 170-178. Weisberg, H. I. and Tomberlin, T. J. (1982). A Statistical Perspective on Actuarial Methods for Estimating Pure Premiums from Cross-Classified Data. The Journal of Risk and Insurance, Vol. 49, (4), 539-563. Wüthrich, Mario V.; Merz, Michael. Stochastic claims reserving methods in insurance. West Sussex: John Wiley & Sons, Ltd, 2008. 424 p.