Download Exam C - Actuarial Outpost

Document related concepts
no text concepts found
Transcript
Christopher Boersma
Exam C
CONTENTS
CONTENTS
Contents
1 Random Variables
4
2 Basic Distributional Quantities
2.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
5
6
6
3 Classifying and Creating Distributions
3.1 Introduction . . . . . . . . . . . . . . . . . . . .
3.2 The Role of Parameters . . . . . . . . . . . . .
3.3 Tail Weight . . . . . . . . . . . . . . . . . . . .
3.4 Value-at-Risk . . . . . . . . . . . . . . . . . . .
3.5 Creating New Distributions . . . . . . . . . . .
3.6 Selected Distributions and Their Relationships
3.7 Linear Exponential Family . . . . . . . . . . . .
3.8 Discrete Distributions . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
8
8
8
9
10
12
13
14
4 Frequency and Severity with Coverage Modifications
4.1 Comments on Inflation . . . . . . . . . . . . . . . . . . .
4.2 Deductibles . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 The loss elimination ratio . . . . . . . . . . . . . . . . .
4.4 Policy Limits . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Co-insurance, deductibles and limits . . . . . . . . . . .
4.6 Claim Frequency . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
16
16
17
17
17
17
5 Aggregate Loss Models
5.1 Model Choices . . . . . . . . . . . . . . . . . .
5.2 Compound model for Aggregate claims . . . . .
5.3 The Recursive Method . . . . . . . . . . . . . .
5.4 The Impact of Individual Policy Modifications .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
18
21
21
6 Review of Mathematical Statistics
6.1 Point Estimation . . . . . . . . . . .
6.2 Interval Estimation . . . . . . . . . .
6.3 Tests of Hypothesis . . . . . . . . . .
6.4 Log-Transformed Confidence Interval
7 Estimation for Complete Data
7.1 The Empirical Distribution . . . .
7.2 Nelson-Åalen estimate . . . . . . .
7.3 Kaplan-Meier Estimator . . . . . .
7.4 Empirical Distribution for Grouped
7.5 estimation for Large Data . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
23
23
24
. . . .
. . . .
. . . .
Data
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
26
26
27
27
.
.
.
.
.
.
.
.
8 Kernel Density Models
9 Parameter Estimation
9.1 Method of Moments . . . . . . .
9.2 Maximum Likelihood . . . . . . .
9.3 Variance and Interval Estimation
9.4 Non-normal Confidence Intervals
9.5 Bayesian Estimation . . . . . . .
30
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
32
34
34
CONTENTS
10 Model Selection
10.1 Representations . . . .
10.2 Graphical Comparison
10.3 Hypothesis Tests . . .
10.4 Selecting a Model . . .
11 Full
11.1
11.2
11.3
11.4
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
35
35
36
Credibility
Introduction . . . . . . .
Compound Distribution
Poisson . . . . . . . . .
Partial Credibility . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
38
39
12 Bayesian Estimation
40
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
12.2 Definition of Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
12.3 Conjugate Prior: Common Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
13 Linear Credibility
13.1 Introduction . . . . . . . . . . . . . . .
13.2 Bühlmann Credibility . . . . . . . . .
13.3 Types of Bühlmann Problems . . . . .
13.4 Empirical Bayes Parameter Estimation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
44
44
45
46
14 Simulation
14.1 Basics . . . . . . . . . . . . . . . . .
14.2 Simulation for some common models
14.3 Estimating Mean or a Probability . .
14.4 Compound Models . . . . . . . . . .
14.5 Bootstrap Method . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
49
49
50
50
.
.
.
.
.
A Key Equations to Memorize
52
B Calculator Tips
52
B.1 Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
B.2 Data or Mini-Spread Sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
C Calculator Integration
53
D Exact Credibility
54
E Beta Distribution
57
E.1 beta (b = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
E.2 beta (b = 1, a = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3
1
1
RANDOM VARIABLES
Random Variables
Definition 1.1. The cumulative distribution function (AKA distribution function, cdf ) usually denoted
FX (x) or F (x) is defined as:
FX (x) = Pr(X ≤ x)
(1.1)
Definition 1.2. The survival function, usually denoted SX (x) or S(x), for a random variable X is the probability
that X is greater than a given number. That is,
SX (x) = P r(X > x) = 1 − FX (x)
(1.2)
Definition 1.3. The probability density function, also called the density function, usually denoted fX (x) or
f (x), is the derivative of the distribution function (Definition 1.1) or, equivalently, the negative of the derivative of
the survival function (Definition 1.2). That is,
f (x) = F 0 (x) = −S 0 (x)
(1.3)
The density function is defined only at those points where the derivative exists. The abbreviation pdf is often used.
Definition 1.4. The probability function, also called the probability mass function, usually denoted pX (x)
or p(x), describes the probability at a distinct point when it is not 0. The formal definition is
pX (x) = Pr(X = x)
(1.4)
Definition 1.5. The hazard rate (AKA force of mortality, failure rate) and usually denoted hX (x) or h(x),
is defined as
fX (x)
hX (x) =
(1.5)
SX (x)
4
2
2
BASIC DISTRIBUTIONAL QUANTITIES
Basic Distributional Quantities
Definition 2.1. An observation is truncated at d if when it is below d it is below d it is not recorded but when it
is above d it is recorded at itsobserved value.
Definition 2.2. An observation is censored at u if, when it is above u it is recorded as being equal to u but when
it is below u it is recorded at its observed value.
2.1
Moments
Definition 2.3. The kth raw moment of a random variable is the expected value of the kth power of the variable,
provided it exists. It is denoted by E(X k ) or by µ0k . The first raw moment is called the mean of the random variable
and is usually denoted by µ.
And the formula:
Z
µ0k = E(X k ) =
=
∞
xk f (x)dx
(2.1)
−∞
X
xkj p(xj )
(2.2)
j
Z
∞
E(x) =
S(x)dx
(2.3)
0
Definition 2.4. The empirical model is a discrete distribution based on a sample size n which assigns probability
1/n to each data point.
Definition 2.5. The kth central moment of a random variable is the expected value of the kth power deviation
of the variable from its mean. It is denoted by E[(X − µ)k ] or by µk . The second central moment is usually called
the variance and denoted σ 2 and the square root, σ, is called the standard deviation. The ratio of the standard
deviation to the mean, σµ , is called the coefficient of variation. The ratio of the third central moment to the cube
of the standard deviation, σµ33 , is called skewness. The ratio of the fourth central moment to the fourth power of
the standard deviation, σµ44 , is called kurtosis and equals:
Z ∞
µk = E[(X − µ)k ] =
(x − µ)k f (x)dx
(2.4)
−∞
=
X
(xj − µ)k p(xj )
j
Theorem 2.1. Let X be a symmetric distribution, then it has a skewness ( σµ33 ) of 0
Definition 2.6. The limited loss variable (AKA right censored variable) is
½
X, X < u
Y =X ∧u=
u, X ≥ u
Its expected value, E[X ∧ u], is called the limited expected value.
5
(2.5)
2.2 Percentiles
2
Z
E[(X ∧ u)k ] =
0
Z
=
u
BASIC DISTRIBUTIONAL QUANTITIES
xk f (x) + uk [1 − F (u)]
(2.6)
S(x)dx
(2.7)
u
0
Definition 2.7. The left censored and shifted variable is
½
0,
Y = (X − d)+ =
X − d,
X<d
X≥d
E[(X − d)+ ] = E(X) − E(X ∧ d)
Z ∞
=
S(x)dx
(2.8)
(2.9)
d
Definition 2.8. For a given value of d with P r(X > d) > 0, the excess loss variable [AKA mean residual life
function, complete expectation of life (ed ) or left truncated (>) and shifted (d) variable] is Y = X − d
given that X > d. It’s expected value,
eX (d) = e(d) = E(Y ) = E(X − d|X > d)
(2.10)
is called the mean excess loss function.
E(X − d|X > d) =
2.2
E(X) − E(X ∧ d)
1 − F (d)
(2.11)
Percentiles
Definition 2.9. The 100pth percentile of a random variable is any value πp such that F (πp −) ≤ p ≤ F (πp ). The
50th percentile, π0.5 is called the median.
2.3
Generating Functions
Sk =
n
X
Xi
(2.12)
i=1
Theorem 2.2. For a random variable following equation (2.12), E(Sk ) = E(X1 ) + · · · + E(Xk ). Also if X1 , · · · , Xk
are independent, Var(Sk ) = V ar(X1 ) + · · · + V ar(Xk ). If the random variables, X1 , X2 , . . . are independent and their
Sk − E(Sk )
first two moments meet certain conditions, lim p
has a normal distribution with mean 0 and variance 1.
k→∞
V ar(Sk )
Definition 2.10. For a random variable X, the moment generating function is
MX (t) = E(etX )
(2.13)
for all t for which the expected value exists. The probability generating function is
PX (z) = E(z X ) =
∞
X
k=0
for all z for which the expectation exists.
6
pk z k
(2.14)
2.3 Generating Functions
2
BASIC DISTRIBUTIONAL QUANTITIES
1. MX (t) = PX (et ) and PX (z) = MX (ln z).
(n)
2. E(X n ) = MX (0)
(m)
3. PX (0) = m! × pm .
¡ dm X ¢
(m)
4. PX (1) = E dz
= E[X(X − 1) . . . (X − m + 1)1X ] = E[X!/(X − k)!].
mz
0
00
0
5. PX
(1) = E(X); PX
(1) = E[X(X − 1)] = E[X 2 ] − PX
(1).
Theorem 2.3. Let Sk = X1 + · · · + Xk , where the random variables in the sum are independent, Then
MSk (t) =
k
Y
MXj (t)
(2.15)
PXj (z)
(2.16)
j=1
and
PSk (z) =
k
Y
j=1
provided all the component mgfs and pgfs exists.
7
3 CLASSIFYING AND CREATING DISTRIBUTIONS
3
Classifying and Creating Distributions
3.1
Introduction
3.2
The Role of Parameters
Definition 3.1. A parametric distribution is a set of distribution functions, each member of which is determined
by specifying one or more values called parameters. The number of parameters is fixed and finite.
Definition 3.2. A parametric distribution is a scale distribution if, when a random variable from that set of
distributions is multiplied by a positive constant, the resulting random variable is also in that set of distributions.
Definition 3.3. For random variables with nonnegative support, a scale parameter is a parameter for a scale
distribution that meets two conditions. First, when a member of the scale distribution is multiplied by a positive
constant, the scale parameter is multiplied by a positive constant, the scale parameter is multiplied by the same
constant. Second, when a member of the scale distribution is multiplied by a positive constant, all other parameters
are unchanged.
Definition 3.4. A parametric distribution family is a set of parametric distributions that are related in some
meaningful way.
Definition 3.5. A random variable Y is a k-point mixture of the random variables X1 , X2 , · · · , Xk if its cdf is
given by
FY (y) = a1 FX1 (y) + a2 FX2 (y) + · · · + ak FXk (y)
where all aj > 0 and a1 + a2 + · · · + ak = 1.
Definition 3.6. A variable-component mixture distribution has a distribution function that can be written
as
K
K
X
X
F (x) =
aj Fj (x),
aj = 1, aj > 0
(3.1)
j=1
j=1
Definition 3.7. A data-dependant distribution is at least as complex as the data or knowledge that produced
it, and the number of “parameters” increases as the number of data points or amount of knowledge increase.
3.3
Tail Weight
For the purposes of this section define two function, f1 (x), f2 (x) where the tail weight of f1 (x) is heavier than
the tail of f2 (x).
3.3.1
Existence of Moments
R∞
If a moment exists then 0 xk f (x)dx converges. The heavier the tail, the less likely the expression is to
converge. Thus, existence of all positive moments indicated a light right tail, while existence of only positive
moments up to a certain value indicates a heavy tail.
8
3.4 Value-at-Risk
3.3.2
3
CLASSIFYING AND CREATING DISTRIBUTIONS
Limiting Ratios
An indication that one distribution has a heavier tail than another is that the ratio of the two survival functions
should diverge to infinity:
S1 (x)
S 0 (x)
f1 (x)
lim
= lim 10
= lim
=∞
(3.2)
x→∞ S2 (x)
x→∞ S (x)
x→∞ f2 (x)
2
3.3.3
Hazard rate
f (x)
S(x)
The nature of the hazard rate function h(x) =
lim
x→∞
3.3.4
also reveals information about the tail of the distribution:
d
d
h1 (x) <
h2 (x)
dx
dx
(3.3)
Equilibrium Distribution
Definition 3.8. Assume X is a continuous distribution f (x) with a survival function S(x) and mean E(X). Then
the equilibrium distribution is
S(x)
fe (x) =
,
x≥0
E(X)
Theorem 3.1. Let X have a equilibrium distribution of fe (x) following definition 3.8 then the corresponding survival
function is
R∞
Z ∞
S(t)dt
Se (x) =
fe (t)dt = x
,
x≥0
E(x)
x
and hazard rate
he (x) =
1
fe (x)
S(x)
=
= R∞
Se (x)
e(x)
S(t)dt
x
Theorem 3.1 provides an alternative definition of the survival function:
S(x) =
3.4
1
E(x) − R0x [ e(t)
]dt
e
e(x)
Value-at-Risk
Definition 3.9. A coherent risk measure is a risk measure ρ(X) that has the following four properties for any
two loss random variables X and Y:
1. Subadditivity: ρ(X + Y ) ≤ ρ(X) + ρ(Y ).
2. Monotonicity: If X ≤ Y for all possible outcomes, then ρ(X) ≤ ρ(Y ).
3. Positive homogeneity: For any positive constant c. ρ(cX) = cρ(X).
4. Translation invariance: For any positive constant c. ρ(X + c) = ρ(X) + c.
Definition 3.10. Let X denote a loss random variable. The Value-at-Risk of X at the 100p% level, denoted
VaRp (X) or πp , is the l00p percentile (or quantile) of the distribution of X. And for continuous distributions it is
the value of πp satisfying
Pr(X > πp ) = 1 − p
(3.4)
VaR is not “coherent” as it does not meet the subaddivity requirement in some cases.
9
3.5 Creating New Distributions
3
CLASSIFYING AND CREATING DISTRIBUTIONS
Definition 3.11. Let X denote a loss random variable. The Tail-Value-at-Risk of X at the 100p% security
level, denoted TVaRp (X), is the expected loss given that the loss exceeds the l00p percentile (or quantile) of the
distribution of X. And for continuous distributions it can expressed as
R∞
xf (x)dx
πp
TVaRp (X) = E(X|X > πp ) =
(3.5)
1 − F (πp )
3.4.1
Mean residual life patterns
The mean residual life, E(X − d|X > d), also gives information about tail weight. If the mean residual life
function is increasing in d, then at large values the expected outcome is much larger and thus probability is
moved to the right, indicating a heaver tail than a model where the mean residual life function is decreasing or
increasing at a slower rate.
3.5
3.5.1
Creating New Distributions
Multiplication by a constant
Theorem 3.2. Let X be a continuous random variable with pdf fX (x) and cdf FX (x). Let Y = θX with θ > 0.
Then
³y´
FY (y) = FX
θ
(3.6)
³y´
1
fY (y) = fX
θ
θ
3.5.2
Raising to a power
Theorem 3.3. Let X be a continuous random variable with pdf fX (x) and cdf FX (x) with FX (0) = 0. Let Y = X 1/τ .
Then, if τ > 0,
FY (y) = FX (y τ )
fY (y) = τ y τ −1 fX (y τ )
(3.7)
Definition 3.12. When raising a distribution to a power, if τ > 0 the resulting distribution is called transformed,
if τ = −1 it is called inverse, and if τ < 0 (but is not -1) it is called inverse transformed.
Definition 3.13. The incomplete gamma function with parameter α > 0 is denoted and defined by
Z x
1
Γ(α; x) =
tα−1 e−t dt
Γ(α) 0
while the gamma function is denoted and defined by
Z
Γ(α) =
∞
tα−1 e−t dt
0
In addition, Γ(α) = (α − 1)Γ(α − 1) and for positive integer values of n, Γ(n) = (n − 1)!
10
(3.8)
3.5 Creating New Distributions
3.5.3
3
CLASSIFYING AND CREATING DISTRIBUTIONS
Exponentiation
Theorem 3.4. Let X be a continuous random variable with pdf fX (x) and cdf FX (x) with fX (x) > 0 for all real x.
Let Y = eX . Then, for y > 0,
FY (y) = FX [ln(y)]
1
fY (y) = fX [ln(y)]
y
3.5.4
(3.9)
General
Fy (y) = Fx [g(y)]
fy (y) = |g 0 (y)|fx [g(y)]
3.5.5
(3.10)
(3.11)
Mixing
Theorem 3.5. Let X have pdf fX|Λ (x | λ) and cdf FX|Λ (x | λ), where λ is a parameter of X. While X may have other
parameters, they are not relevant. Let λ be a realization of the random variable Λ with pdf fΛ (λ). The unconditional
pdf of X is
Z
fX (x) =
fX|Λ (x | λ)fΛ (λ)dλ
where the integral is taken over all values of λ with positive probability. The resulting distribution is a mixture
distribution. The distribution function can be determined from
Z
FX (x) = FX|Λ (x | λ)fΛ (λ)dλ
Moments of the mixture distribution can be found from
E(X k ) = E[E(X k | Λ)]
(3.12)
V ar(X) = E[V ar(X | Λ)] + V ar[E(X | Λ)]
(3.13)
and, in particular,
Examples:
Conditional Distribution Y |Λ
Y = Poison(Λ)
Y = Exponential(Λ)
Y = Inv.Exponential(Λ)
Y = Normal(Λ, σc2 )
3.5.6
Distribution of Λ
Λ = Gamma(α, θ)
Λ = Inv.Gamma(α, θ)
Λ = Gamma(α, θ)
Λ = Normal(µ, σd2 )
Unconditional Distribution of Y
Neg.Bin(r = α, β = θ)
Pareto(α, θ)
Inv.Pareto(τ = α, θ)
Normal(µ, σc2 + σd2 )
Frailty Models
Definition 3.14. A frailty model has a hazard rate of
hX|Λ (x|λ) = λa(x)
(3.14)
Where a(x) is a known (or specified) function and Λ is a frailty random variable.
Theorem 3.6. Let f (x) be a frailty model then the marginal survival function is [A(x) =
h
i
SX (x) = E e−ΛA(x) = MΛ [−A(x)]
11
Rx
0
a(t)dt]
(3.15)
3.6 Selected Distributions and Their Relationships 3
CLASSIFYING AND CREATING DISTRIBUTIONS
The type of mixture to be used determines the choice of a(x) and hence A(x). The most important subclass of
the frailty models is the class of exponential mixtures with a(x) = 1 and A(x) = x, so that SX|Λ (x|λ) = e6−λx,
x ≥ 0. Other useful mixtures include Weibull mixtures with a(x) = γxγ−1 and A(x) = xγ .
3.5.7
Splicing
Definition 3.15. A k-component spliced distribution has a density function that can be expressed as follows:

a1 f1 (x),
c0 < x < c1



 a2 f2 (x),
c1 < x < c2
fX (x) =
..
..


.
.


ak fk (x), ck−1 < x < ck
For j = 1, · · · , k, each aj > 0 and each fj (x) must be a legitimate density function with all probability on the interval
(cj−1 , cj ). Also, a1 + · · · + ak = 1.
3.6
3.6.1
Selected Distributions and Their Relationships
Two Parametric Families
Figure 1:
Figure 2:
12
3.7 Linear Exponential Family
3.6.2
3
CLASSIFYING AND CREATING DISTRIBUTIONS
Limiting distributions
τα
Example 3.1. Show that the transformed gamma distribution,
Γ(α+τ )
γ(x/θ)γτ
Γ(α)Γ(τ ) x[1+(x/θ)γ ]α+τ
formed beta distribution,
, as θ → ∞, α →
found in the text. This example uses two common limits:
lim
α→∞
τ (x/θ)τ α e−(x/θ)
xΓ(α)
¡ ¢1/γ
∞, and αθ
e−α αα−0.5 (2π)0.5
=1
Γ(α)
³
lim
α→∞
x ´a+b
= ex
a
1+
, is a limiting case of the trans→ ξ, a constant. All steps can be
(3.16)
(3.17)
Note in order to make sure ξ is a constant we need to make: θ = ξα1/γ .
f (x)
Γ(α + τ )γxγτ −1
Γ(α)Γ(τ )θγτ (1 + xγ θ−γ )α+τ
£
¤α+τ −0.5 γτ −1
e−τ α+τ
γx
α
h
iα+τ
(x/ξ)γ
γτ
Γ(τ )ξ
1+ α
=
=
The two limits
³
eτ
e(x/ξ)
=
γ
=
τ ´α+τ −0.5
α→∞
α
¸α+τ
·
(x/ξ)γ
lim 1 +
α→∞
α
lim
1+
can be substituted to yield
lim f (x) =
α→∞
γxγτ −1 e−(x/ξ)
Γ(τ )ξ γτ
γ
And this is the transformed gamma distribution. (with α → τ , τ → γ, θ → ξ).
A similar argument can show that the inverse gamma distribution is obtained by letting τ go to infinity instead
of α (exercise in text).
3.7
Linear Exponential Family
Definition 3.16. A random variable X (discrete or continuous) has a distribution from the linear exponential
family if its pdf may be parameterized in terms of a parameter θ and expressed
f (x; θ) =
p(x)er(θ)x
q(θ)
(3.18)
The function p(x) depends only on x (not on θ), and the function q(θ) is a normalizing constant. Also, the
support of the random variable must not depend on θ. The parameter r(θ) is called the canonical parameter of
the distribution.
Theorem 3.7. Let f (x; θ) be a member of the linear exponential family as represented in definition 3.16. Then the
mean of f (x; θ) is
q 0 (θ)
(3.19)
E(X) = µ(θ) = 0
r (θ)q(θ)
and the variance is
µ0 (θ)
Var(X) = 0
(3.20)
r (θ)
Proof can be found in the text.
13
3.8 Discrete Distributions
3.8
3.8.1
3
CLASSIFYING AND CREATING DISTRIBUTIONS
Discrete Distributions
Poisson Distribution
Definition 3.17. The probability function of the Poisson distribution is given by
pk =
e−λ λk
k!
(3.21)
It has a pgf of P (z) = eλ(z−1) .
E(N ) = P 0 (1) = λ
E[N (N − 1)] = P 00 (1) = λ2
2
V ar(N ) = E[N 2 ] − (E[N ])2 = E[N 2 ] − (E[N ]) − E[N ] + E[N ]
= E[N (N − 1)] + E(N ) − [E(N )]2 = λ2 + λ − λ2 = λ
Theorem 3.8. Let N1 , · · · , Nn be independent Poisson variables with parameters λ1 , · · · , λn . Then N = N1 , · · · , Nn
has a Poisson distribution with parameter λ1 + · · · + λn .
Theorem 3.9. Suppose that the number of events N is a Poisson random variable with mean λ. Further suppose
that each event can be classified into one of m types with probabilities p1 , · · · , pm independent of all other events.
Then the number of events N1 , · · · , Nm corresponding to event types 1, · · · , m respectively, are mutually independent
Poisson random variables with means λp1 , · · · , λpm respectively.
3.8.2
The Negative Binomial Distribution
Definition 3.18. The probability function of the negative binomial distribution is given by
¶µ
µ
¶r µ
¶k
1
β
k+r−1
pk =
k
1+β
1+β
(3.22)
It has a pgf of P (z) = [1 − β(z − 1)]−r .
E(N )
V ar(N )
= rβ
= rβ(1 + β)
Definition 3.19. The geometric distribution is a special case of the negative binomial distribution when r = 1.
Namely,
βk
(3.23)
pk =
(1 + β)k+1
Definition 3.20. A distribution is memoryless if
Pr(X > x + y|X > x) = Pr(Y > y)
The Geometric and Exponential distributions are both examples of a memoryless distribution.
14
(3.24)
3.8 Discrete Distributions
3.8.3
3
CLASSIFYING AND CREATING DISTRIBUTIONS
The Binomial Distribution
Definition 3.21. The probability function of the binomial distribution is given by
µ
¶
m
pk =
q k (1 − q)m−k
k
(3.25)
pgf = P (z) = [1 + q(z − 1)]m
E(N ) = mq
V ar(N ) = mq(1 − q)
3.8.4
The (a,b,0) class
Definition 3.22. Let pk be the pf of a discrete random variable. It is member of the (a,b,0) class of distributions,
provided that there exists constants a and b such that
pk
b
=a+
pk−1
k
k = 1, 2, 3, · · ·
(3.26)
Three distributions are part of this class: Poisson, binomial and negative binomial (which means geometric as
well). The expression can be re-written as:
pk
k
= ak + b
(3.27)
pk−1
Distribution
Poisson
Binomial
Negative Binomial
Geometric (r = 1)
3.8.5
a
0
q
− 1−q
β
1+β
β
1+β
b
λ
q
(m + 1) 1−q
β
(r − 1) 1+β
0
p0
e−λ
(1 − q)m
(1 + β)−r
(1 + β)−1
Truncation and modification at zero
Definition 3.23. Let pk be the pf of a discrete random variable. It is a member of the (a,b,1) class of distributions provided that there exists constants a and b such that
pk
b
=a+
pk−1
k
k = 2, 3, . . .
p0 = 1 −
∞
X
(3.28)
pk
k=1
Definition 3.24. Let pk be the pf of a discrete random variable that is a member of the (a,b,1) class. It is called
truncated if p0 = 0. (AKA zero-truncated)
Definition 3.25. Let pk be the pf of a discrete random variable that is a member of the (a,b,1) class. It is called
zero-modified if p0 > 0 and is a mixture of an (a,b,0) class and a distribution where p0 = 1. (AKA truncated
with zeros)
15
4 FREQUENCY AND SEVERITY WITH COVERAGE MODIFICATIONS
4
Frequency and Severity with Coverage Modifications
4.1
Comments on Inflation
In economics, inflation is a rise in the general level of prices of goods and services in an economy
over a period of time [Wikipedia - August 23, 2009].
Inflation makes everything more expensive, but this in turn makes any fixed price item cheaper. For an insurance
company the deductibles and limits rarely change from year to year, but inflation impacts these items as they
would become cheaper every year. Inflation also makes all claims more expensive as the costs have increased. A
c
few useful relationships of the expectation (I will use c∗ = 1+r
). A couple useful properties of the expectation
will be used in this section:
1. E[(1 + r)X] = (1 + r)E[X]
2. E[(1 + r)X ∧ c] = (1 + r)E [X ∧ c∗ ]
4.2
Deductibles
The per-loss variable is
½
YL =
while the per-payment variable is
½
YP =
0,
X≤d
X − d, X > d
undef ined, X ≤ d
X − d,
X>d
Definition 4.1. An ordinary deductible modifies a random variable into either the excess loss (Def. 2.8) or left
censored and shifted variable (Def. 2.7). The difference depends on whether the result of apply the deductible is to
be per payment or per loss, respectively.
Definition 4.2. If f meets certian conditions [not sure how to define them, (linear?)] then
f (Y P ) =
For example
E(Y P ) =
However,
f (Y L )
SX (d)
(4.1)
E(Y L )
SX (d)
Var(Y L )
SX (d)
¡ L 2¢ µ
¶2
E [Y ]
E(Y L )
P
Var(Y ) =
−
SX (d)
SX (d)
Var(Y P ) 6=
Definition 4.3. A franchise deductible modifies the ordinary deductible by adding the deductible whenever there
is a positive amount paid.
Theorem 4.1. For an ordinary deductible, the expected cost per loss is
E(X) − E(X ∧ d)
For a franchise deductible the expected cost per loss is E(X) − E(X ∧ d) + d[1 − F (d)].
16
(4.2)
4.3 The loss elimination ratio 4 FREQUENCY AND SEVERITY WITH COVERAGE MODIFICATIONS
4.3
The loss elimination ratio
Definition 4.4. The loss elimination ratio is the ratio of the decrease in the expected payment with an ordinary
deductible to the expected payment without the deductible
E(X) − [E(X) − E(X ∧ d)]
E(X ∧ d)
=
E(X)
E(X)
(4.3)
Theorem 4.2. For an ordinary deductible of d after uniform inflation of 1 + r, the expected cost per loss is
(1 + r) {E(X) − E [X ∧ d∗ ]}
4.4
(4.4)
Policy Limits
Theorem 4.3. For a policy limit of u, after uniform inflation of 1+r, the expected cost is
(1 + r)E [X ∧ u∗ ]
4.5
(4.5)
Co-insurance, deductibles and limits
If co-insruance is the only modification, this changes the loss variable X to the payment variable Y = αX.
This is discussed in section 3.5.1. When all four items covered in this chapter are present (deductible, limit,
coinsurance and inflation) we create the following per-loss random variable:

X < d∗
 0,
α[(1 + r)X − d], d∗ ≤ X < u∗
YL =

α(u − d),
u∗ ≤ X
Theorem 4.4. For the per-loss variable,
E(Y L ) = α(1 + r) [E (X ∧ u∗ ) − E (X ∧ d∗ )]
(4.6)
Theorem 4.5. For the per-loss variable
E[(Y L )2 ]
=
α2 (1 + r)2 {E[(X ∧ u∗ )2 ] − E[(X ∧ d∗ )2 ]
−2d∗ E(X ∧ u∗ ) + 2d∗ E(X ∧ d∗ )}
4.6
Claim Frequency
The results can be further generalized to an increase or decrease in the deductible. Let N d be the frequency
∗
∗
X (d )
when the deductible is d and let N d be the frequency when the deductible is d∗ . Let v = 1−F
1−FX (d) . As long
∗
as d∗ > d, we will have v < 1 and the formulas will lead to a legitimate distribution for N d . This includes the
special case of d = 0. If d∗ < d, then v > 1 and there is no assurance that a legitimate distribution will result.
This includes the special case d∗ = 0 (removal of deductible).
17
5
5
AGGREGATE LOSS MODELS
Aggregate Loss Models
Definition 5.1. The collective risk model has the representation in (2.12) with the Xj s being independent and
identically distributed random variables, unless otherwise specified. More formally, the independence assumptions
are:
1. Conditional on N = n, the random variables X1 , X2 , . . . , Xn are i.i.d random variables.
2. Conditional on N = n, the common distribution of the random variables X1 , X2 , . . . , Xn , does not depend on
n.
3. The distribution of N does not depend in any way on the values of X1 , X2 , . . .
Definition 5.2. The individual risk model represents the aggregate loss as a sum, S = X1 + . . . + Xn , of a
fixed number, n, of insurance contracts. The loss amounts for the n contracts are (X1 , . . . , Xn ), where the Xj s are
assumed to be independent but are not assumed to be identically distributed. The distribution of the Xj s usually
has a probability mass at zero, corresponding to the probability of no loss or payment
The collective risk model is used for combining identical risks, whereas the individual risk model is used for
combine non-identical risks, such as medical claims for different people (different age, etc.). There are a number
of advantages for these models, which can be found in the text. However, a more accurate and flexible model
can be constructed by examining frequency and severity separately. We will refer to N as the claim count
random variable and will refer to its distribution as the claim count distribution (AKA: number of
claims, claims, frequency distribution). The Xj s are the individual or single-loss random variables.
The modifier individual or single will be dropped when the reference is clear. Strictly speaking, the Xj s are
payments because they represent a real cash transaction. However, the term loss is more customary, and we
will continue with it (AKA: severity). Finally, S is the aggregate loss random variable or the total loss
random variable.
5.1
Model Choices
5.2
Compound model for Aggregate claims
5.2.1
Compound Model
The text is extremely wordy here; there are a lot of details that may be important, however my Actex manual
simplifies the first three sections of chapter 6 into on concise 4 page summary.
Definition 5.3. S = X1 + X2 + . . . + XN has a compound distribution if it is a combination of N representing
the number of losses and the Xi ’s representing the size of the loss j.
I will use the following interchangeably:
µx
σx2
µn
=
=
=
E(X)
µn
V ar(X) σn2
λ σn2
=
λ
= E(N )
= V ar(N )
(N=Poisson)
Some common properties of the compound distribution are:
E[S] = E[E(S|N )]
= E(N )E(X) = µn µx
V ar[E(S|N )] =
E[V ar(S|N )] =
V ar(N )[E(X)]2 = σn2 µ2x
E(N )V ar(X) = µn σx2
V ar[S] = V ar[E(S|N )] + E[V ar(S|N )]
= σn2 µ2x + µn σx2
18
5.2 Compound model for Aggregate claims
5
E[S] = µn µx = λµx
V ar[S] =
λ(σx2
+
µ2x )
AGGREGATE LOSS MODELS
(N=Poisson)
2
= λE(X )
(N=Poisson)
Many of the aggregate loss problems on the exam involve identifying N and severity X, and then finding E[S]
and V ar[S], and then applying the normal approximation to S to compute probabilities.
5.2.2
Stop-Loss Insurance
Definition 5.4. Insurance on the aggregate losses, subject to a deductible, is called stop-loss insurance. The
expected cost of this insurance is called the net stop-loss premium and can be computed as E[(S − d)+ ], where d
is the deductible and the notation (.)+ means to use the value in parenthesis if it is positive but to use zero otherwise
The next few theorems relate to discrete distributions. They require that: P r(k < S < k + 1) = 0, which you
will generally only find in a discrete distribution. I have decided to use a slighty different notation and variables
then the textbook uses. The textbook does these problems for a general case where k = jh (h could be any
number). I will assume h = 1 to simplify and clarify these theorems (I doubt the SOA would use the case where
h 6= 1).
Theorem 5.1. Suppose Pr(k < S < k + 1) = 0. Then, for k ≤ d ≤ k + 1,
E[(S − d)+ ] = (b − d)E[(S − k)+ ] + (d − a)E[(S − (k + 1))+ ]
(5.1)
That is, the net stop-loss premium can be calculated via linear interpolation for value between two discrete
values.
P
Theorem 5.2. Assume P r(S = k) = fk ≥ 0 ( P r(S = k) = 1) for k = 0, 1, . . .. Then, provided that d is a
non-negative integer
E[(S − d)+ ] =
∞
X
1 − FS (k + d)
(5.2)
k=0
E[S ∧ d] = E[S] −
∞
X
1 − FS (k + d)
k=0
=
∞
X
FS (k + d) − FS (k)
(5.3)
k=0
Corollary. Under the conditions of Theorem 5.2,
E[(S − {d + 1})+ ] = E[(S − d)+ ] − 1 + FS (d)
E[S ∧ d + 1] = E[S ∧ d] + 1 − FS (d)
(5.4)
(5.5)
For example: E[S ∧ 1] = E[S ∧ 0] + 1 − FS (0).
5.2.3
Convolution Method
An alternative approach to find the distribution of the sum of random variables is the method of convolution.
In this method, we would first find the distribution of X1 + X2 . Then we would use that to find the distribution
of (X1 + X2 ) + X3 , and so on. To apply the method of convolution to discrete integer-valued random variables,
we apply a combinatorial approach. For instance, to find P (X1 + X2 = k), we look at all X1 , X2 pairs that add
up to k, and add up those probabilities:
X
P (X1 + X2 = k) =
P (X1 = j ∩ X2 = k − j)
all j
19
5.2 Compound model for Aggregate claims
5
AGGREGATE LOSS MODELS
And if X1 and X2 are independent:
X
P (X1 + X2 = k) =
P (X1 = j)P (X2 = k − j)
all j
For future reference all other equations will be assume to be independence.
Theorem 5.3. S = X1 + X2 where fX1 is the pdf of X1 and FX2 is the cdf of X2 . Then:
X
FS (k) =
fX1 (j)FX2 (k − j)
all j
We can also apply the convolution method to continuous random variables:
Z
fS (s) =
fX1 (t)fX2 (s − t)dt
Z
FS (s) =
fX1 (t)FX2 (s − t)dt
Convolutions may arise in the context of compound distributions.
Definition 5.5. Let X1 , X2 , . . . , Xn be independent random variables that all have the same pdf or pf (X), then
we can define a n-fold convolution for the distribution of X1 + X2 + · · · + Xn as
à n
!
Z x
X
∗(n−1)
∗n
FX
(x − t)fX (t)dt
(5.6)
Pr
X i ≤ z = FX
(z) =
0
i=1
and the pdf is
Z
x
∗n
fX
(x) =
0
These can be converted to discrete pf by using
P
∗(n−1)
fX
(x − y)fX (y)dy
instead.
Example 5.1. Dice Example
1. Find the probability that the sum of two rolls of a normal dice is 7
2. Find the probability that the sum of three rolls of a normal dice is 11
To find out the probability of rolling a 7 on two dice we would first roll dice 1 and then depending on the value we
would know what we need on dice two. Convolution states that if we look at all the possibilities for Dice 2 that
produces a result for a given roll on Dice 1 we can generate the overall probability. For rolling a 7:
Pr(X1 + X2 ) = Pr(X1 = 1) Pr(X2 = 6) + Pr(X1 = 2) Pr(X2 = 5) + Pr(X1 = 3) Pr(X2 = 4)
+ Pr(X1 = 4) Pr(X2 = 3) + Pr(X1 = 5) Pr(X2 = 2) + Pr(X1 = 6) Pr(X2 = 1)
1
1
1
=6× × =
6
6
6
If you complete the entire distribution you would get:
f (12) = f (2) =
1
2
3
4
5
6
; f (3) = f (11) =
; f (4) = f (10) =
; f (5) = f (9) =
; f (6) = f (8) =
; f (7) =
;
36
36
36
36
36
36
And we can use this to get f (X1 + X2 + X3 = 11):
f (X1 + X2 + X3 = 11) = Pr(X1 + X2 = 5) Pr(X3 = 6) + Pr(X1 + X2 = 6) Pr(X3 = 5)
+ Pr(X1 + X2 = 7) Pr(X3 = 4) + Pr(X1 + X2 = 8) Pr(X3 = 3)
+ Pr(X1 + X2 = 9) Pr(X3 = 2) + Pr(X1 + X2 = 10) Pr(X3 = 1)
5 1
6 1
5 1
4 1
3 1
4+5+6+5+4+3
27
1
4 1
+
+
+
+
+
=
=
=
=
36 6
36 6
36 6
36 6
36 6
36 6
216
216
8
20
5.3 The Recursive Method
5.2.4
5
AGGREGATE LOSS MODELS
Describe S by conditioning N
A relationship we can use to find fS is:
P (A) = E[P (A|U )]
We can use the above expression to solve:
FS (y) =
=
P (S ≤ y) = E[P (S ≤ y | N )]
∞
∞
X
X
∗n
P (S ≤ y | N = n)P (N = n) =
FX
(y)P (N = n)
n=0
n=0
This allows us to solve problems by breaking the problem into three steps:
1. Find the distribution of N or P (N = n)
∗n
2. Find FX
∗n
3. Add up all P (N = n)FX
Another useful equation:
Pr(S = k) =
∞
X
n=0
5.3
Ã
Pr
n
X
!
xi = k Pr(N = n)
(5.7)
i=1
The Recursive Method
Theorem 5.4. For the (a, b, 1) class,
fS (x) =
[p1 − (a + b)p0 ] fX (x) +
Px∧m
y=1
(a + by/x)fX (y)fS (x − y)
1 − afX (0)
Corollary. For the (a, b, 0) class, the result 5.4 reduces to
Px∧m
y=1 (a + by/x)fX (y)fS (x − y)
fS (x) =
1 − afX (0)
Corollary. For the Poisson distribution the result 5.4 reduces to
fS (x) =
5.4
x∧m
λ X
yfX (y)fS (x − y),
x y=1
x = 1, 2, . . .
The Impact of Individual Policy Modifications
Suppose for a compound distribut5ion S, the frequency distribution is N . If a modification (such as deductible,
limit, co-ins. etc) is applied to X, and we wish to find the aggregate amount paid by the insurer, then we can
model the aggregate payment in two ways:
1. Find Y L , the cost per loss (which might be 0 if there is a deductible), and then the aggregate amount
paid is S ∗ = Y1L + · · · + YNL , where N has the original frequency distribution or
2. Find Y P , the cost per payment, and find the distribution N ∗ , the number of payments made, and then
S ∗ = Y1P + · · · + YNP∗ is the aggregate amount paid.
21
5.4 The Impact of Individual Policy Modifications
5
N∗ =
½
Ij =
N
X
AGGREGATE LOSS MODELS
Ij
j=1
0 if loss j does not result in a payment
1 if loss j results in a payment
22
6 REVIEW OF MATHEMATICAL STATISTICS
6
6.1
Review of Mathematical Statistics
Point Estimation
Definition 6.1. An estimator, θ̂, is unbiased if E(θ̂ | θ) = θ for all θ. The bias is
biasθ̂ = E(θ̂ | θ) − θ
(6.1)
Definition 6.2. Let θ̂n be an estimator of θ based on a sample size of n. The estimator is asymptotically
unbiased if
lim E(θ̂n | θ) = θ
(6.2)
n→∞
for all θ.
Definition 6.3. An estimator is consistent (often called, in this context, weekly consistent) if, for all σ > 0 and
any θ,
lim P r(|θ̂n − θ| > σ) = 0
(6.3)
n→∞
Definition 6.4. The mean-squared error (MSE) of an estimator is
M SEθ̂ (θ) = E[(θ̂ − θ)2 |θ]
= V ar(θ̂|θ) + [biasθ̂ (θ)]
(6.4)
2
(6.5)
Definition 6.5. An estimator, θ̂, is called a uniformly minimum variance unbiased estimator (UMVUE)
if it is unbiased and for any true value of θ there is no other unbiased estimator that has a smaller variance.
6.2
Interval Estimation
Definition 6.6. A 100(1 − α)% confidence interval for a parameter θ is a pair of random values, L and U ,
computed from a random sample such that P r(L ≤ θ ≤ U ) ≥ 1 − α for all θ.
6.3
Tests of Hypothesis
A statistical hypothesis test is a method of making statistical decisions using experimental data.
... In frequency probability, these decisions are almost always made using null-hypotheses tests; that
is, ones that answer the question: “Assuming that the null hypothesis is true, what is the probability
of observing a value for he test statistic that is at least as extreme as the value that was actually
observed”. One use of hypothesis testing is deciding whether experimental results contain enough
information to cast doubt on conventional wisdom. (Wikipedia, July 14, 2009).
The decision on whether to choose H0 (null) or HA (alternative) is made by calculating a quantity called a test
statistic. It is a function of the observations and is treated as a random variable. That is, in designing the
test procedure we are concerned with the samples that might have been obtained and not with the particular
sample that was obtained. The test specification is completed by constructing a rejection region. It is a
subset of the possible values of the test statistic. If the value of the test statistic for the observed sample is in
the rejection region, the null hypothesis is rejected and the alternative hypothesis is announced as the result
that is supported by the data. Otherwise, the null hypothesis is not rejected. The boundaries of the rejection
region are called the critical values. In hypothesis testing there is a difference in meaning between accepting
23
6.4 Log-Transformed Confidence Interval
6
REVIEW OF MATHEMATICAL STATISTICS
and not rejecting. By not rejecting H0 we are not saying it is true, rather we are saying that the data doesn’t
tell us anything. There are two types of errors we can get when doing hypothesis: Incorrectly rejecting the null
hypothesis (in favour of the alternative HA ) when H0 is in fact true and secondly incorrectly “not rejecting” the
null hypothesis when the null hypothesis is in fact true. Since “not rejecting” is equivalent to a saying nothing
then we can presume that the first error (rejecting H0 in favour of HA ) is generally considered worse than the
second error. These errors are summarized below.
Reject H0
Not Reject H0
H0 true
Type I
success
HA true
success
Type II
Definition 6.7. The significance level of a hypothesis test is the probability of making a Type I error given that
the null hypothesis is true. If it can be true in more than one way, the level of significance is the maximum of such
probabilities. The significance level is usually denoted by the letter α.
Definition 6.8. A hypothesis test is uniformly most powerful if no other test exists that has the same or lower
significance level and for a particular value within the alternative hypothesis has a smaller probability of making a
Type II error.
Definition 6.9. The p-value is the smallest level of significance at which H0 would be rejected when a specified
test procedure is used on a given data set. Once the p-value has been determined the conclusion at any particular
level α results from computing the p-value to α:
1. p-value ≤ α ⇒ reject H0 at level α.
2. p-value > α ⇒ do not reject H0 at level α.
[Probability and Statistics, Devore, 2000]
6.4
Log-Transformed Confidence Interval
Definition 6.10. The 100(1 − α)% log-transformed confidence interval for Sn (t) is
³
´
Sn (t)U , Sn (t)(1/U )
where
q


d n (t)]
zα/2 Var[S

U = exp 
Sn (t) ln Sn (t)
24
(6.6)
7 ESTIMATION FOR COMPLETE DATA
7
Estimation for Complete Data
Definition 7.1. A data-dependent distribution is at least as complex as the data or knowledge that produced
it and the number of “parameters” increases as the number of data points or amount of knowledge increase
Definition 7.2. A parametric distribution is a set of distribution functions, each member of which is determined
by specifying on one more values called parameters. The number of parameters is fixed and finite.
Definition 7.3. The empirical distribution is obtained by assigning probability 1/n to each data point.
Definition 7.4. A kernel smoothed distribution is obtained by replacing each data point with a continuous
random variable and then assigning probability 1/n to each such random variable. The random variable used must
be identical except for a location or scale change that is related to its associated data point.
The text uses a number of different variables in the next few sections and chapters. I think the variables are
poorly explained. I will attempt to do a better job below.
Definition 7.5. The data set is defined as follows. There are n insureds in the data set. The i-th insured has
three different variables: the entry time into data set is labeled di , the death time for the insured is labeled as xi
and the censored time (for someone who leaves the study before they die) is labeled ui . Either ui is defined or xi
defined, but not both.
Definition 7.6. The data summary is defined as follows. There are m data points. The j-th point has three
different variables: the time for the data point is yj , the quantity of deaths at time yj is labeled sj , and the
quantity at risk or the number of people who could die at time yj are labeled rj .
µ
¶
Y
nS(x)
=
= S(x)
n
n
S(x)[1 − S(x)]
V ar[Sn (x)] =
n
S
(x)[1
− Sn (x)]
n
Vd
ar[Sn (x)] =
n
E[Sn (x)] = E
7.1
(7.1)
(7.2)
The Empirical Distribution
Definition 7.7. The empirical distribution function is
Fn (x) =
number of observations ≤ x
n
where n is the total number of observations. Or

 0,
1−
Fn (x) =

1,
rj
n
,
x < y1
yj−1 ≤ x < yj , j = 2, . . . , k
x ≥ yk
Definition 7.8. The empirical survival function is
1 − Fn (t)
25
(7.3)
7.2 Nelson-Åalen estimate
1. Sample Variance:
1
n−1
7
Pn
i=1 (xi
ESTIMATION FOR COMPLETE DATA
− x̄)2 .
2. Empirical estimate of the variance:
1
n
Pn
i=1 (xi
− x̄)2 .
3. Empirical estimate of E[(X ∧ u)k ]:


1 X k
xi − uk · [number of xi ’s > u]
n
xi ≤u
Definition 7.9. The cumulative hazard rate function is defined as
H(x) = − ln S(x)
7.2
(7.4)
Nelson-Åalen estimate
Definition 7.10. The Nelson-Åalen estimate ([1],[99]) of the cumulative hazard rate function is


x < y1
 0,
Pj−1 si
,
yj−1 ≤ x < yj , j = 2, . . . , k
Ĥ(x) =
i=1 ri

 Pk si , xi ≥ yk
i=1 ri
(7.5)
Because the Nelson-Åalen estimate is a step function, its derivatives are not interesting.
Definition 7.11. Assuming that the ri ’s are fixed that si ’s follow a Poisson distribution and independence of events
then the variance of the Nelson Åalen estimator is
d Ĥ(yj )] =
Var[
j
X
si
2
r
i=1 i
(7.6)
q
The linear confidence interval would be: Ĥ(yj ) ± zα/2
d Ĥ(yj )]. And the log-transformed is
Var[
q

where U = exp ±
Ĥ(t)U,
7.3
zα/2
d Ĥ(yj )]
Var[
Ĥ(t)


Kaplan-Meier Estimator
Definition 7.12. Assuming the data
S(0) = 1. The general formula is




Sn (t) =



follows definition 7.6, then the Kaplan-Meier product-limit Estimator.
1,
Qj−1 ³ ri −si ´
i=1
³ ri ´
Q
k
i=1
ri −si
ri
0 ≤ t < y1
yj−1 ≤ t < yj , j = 2, . . . , k
or 0,
(7.7)
t ≥ yk
Definition 7.13. The Greenwood Approximation for the variance of the product limit estimator (defintion
7.12).
j
X
si
.
d n (yj )] =
(7.8)
Var[S
[Sn (yj )]2
r
i (ri − si )
i=1
26
7.4 Empirical Distribution for Grouped Data
7.4
7
ESTIMATION FOR COMPLETE DATA
Empirical Distribution for Grouped Data
In this section I will define a simple variable (a):
a=
x − cj−1
;
cj − cj−1
1−a=
cj − x
cj − cj−1
Definition 7.14. For grouped data, the distribution function obtained by connecting the values of the empirical
distribution function at the group boundaries with straight lines is called the ogive. The formula is
Fn (x) = (1 − a)Fn (cj−1 ) + aFn (cj ), cj−1 ≤ x ≤ cj
Definition 7.15. For grouped data, the empirical density function can be obtained by differentiating the ogive.
The resulting function is called a histogram. The formula is
fn (x) =
Fn (cj ) − Fn (cj−1 )
nj
=
, cj−1 ≤ x < cj
cj − cj−1
n(cj − cj−1 )
E(Sn (x)) = (1 − a)S(cj−1 ) + aS(cj )
(7.9)
2
Var(Sn (x)) = Var[S(cj−1 )] + a Var[S(cj )] − (2/n)a[1 − S(cj−1 )][S(cj−1 ) − S(cj )]
7.5
(7.10)
estimation for Large Data
I left this out of my first copies and added it after I took the exam. I found the time to understand the material
outweighs the benifits of knowing it, but I was able to pick up enough to add this material to this summary.
These techniques uses “death intervals” instead of “death points” as a result we need to specify when in the
interval certain events occur. In order to accomodate these changes we re-define rj and create a new variable
Pj that is the the number of people under observation at cj .
Because we are now using intervals we need to clarify what the new definitions of uj , dj and xj . Following
the definition provided in loss models. It’s worth noting that the table below assumes that all observations
between cj and cj+1 occur at time cj .
enter @
left @
died @
7.5.1
dj
uj
xj
left truncated observations between
right censored observations between
uncensored observations between
[cj , cj+1 )
(cj , cj+1 ]
(cj , cj+1 ]
Kaplan-Meier type
Definition 7.16. Let
Pj =
j−1
X
(di − ui − xi )
(7.11)
i=0
be the number of people under observation at age cj . Continue to assume that all the uncensored observations
(deaths) occur at a fixed point in the interval. Further assume that at this time 100α% of those who enter between
cj and cj+1 or dj have done so at cj (the remaining 100(1 − α)% enter at cj+1 ). Also, that 100β% of those who
will be censored (uj ) have done so at cj (the remaining 100(1 − β)% are censored at cj+1 ). We can re-define rj as
follows:
rj0 = Pj + αdj − βuj
(7.12)
There are two possible options for α and β that the SOA might use.
entrants and exits occur on birthdays
entrants and exits are spread uniformly
27
α
1
0.5
β
0
0.5
7.5 estimation for Large Data
7
ESTIMATION FOR COMPLETE DATA
Keep in mind that equation (7.11) only goes up to j − 1. For risk set rj we then also include α percent of the
entrants from cj to cj+1 and β percent of the censored observations that occur between cj to cj+1 .
Example 7.1. The next table comes from Loss Model’s “Data Set D2” (page 334 of Loss Models). The dj , uj and
xj columns should be quite straight forward for most readers. However a quick comment on P0 .
Technically speaking the equation for Pj at j = 0 is undefined as the sum doesn’t make sense:
Pj =
j−1
−1
X
X
(di − ui − xi ) =
(di − ui − xi )
i=0
i=0
It therefor depends on the situation how we define Pj . The textbook’s example uses a situation where a study starts
with 30 lives it therefor makes sense to define P0 = 30. If we don’t know how many lives a study starts with we
would have to assume that Pj = 0.
In summary, normally by definition Pj is 0. However, in the event that a study “starts with” n lives we can state
that Pj is n. In order to clarify what Loss Models have done I added time “-1” or c−1 to c0 . In theory it doesn’t
make sense, but Loss Models choose to state that there are 30 observations between c−1 to c0 when they stated
P0 = 30 or that there are 30 lives added to the study prior to the study starting.
j
-1
0
1
2
3
4
5
dj
30
2
2
3
3
0
uj
0
3
2
3
3
21
xj
0
1
0
2
3
2
Pj
0
30
28
28
26
23
0(d)
rj0
qj
29.5
28.0
28.0
26.0
21.0
0.0339
0.0000
0.0714
0.1145
0.0952
F̂ (j)
0.0000
0.0339
0.0339
0.1029
0.2064
0.2820
This table is constructed with the assumption that α = 0.5 and β = 0.5.
r00 = P0 + 0.5d0 − 0.5u0 = 30 + 0.5 × 2 − 0.5 × 3 = 29.5
29.5 = 0.5(P0 + P1 + x0 ) = 0.5(30 + 28 + 1)
q00 = 1/29.5 = 0.0339
r10 = P1 + 0.5d1 − 0.5u1 = 28 + 0.5 × 2 − 0.5 × 2 = 28.0
28.0 = 0.5(P1 + P2 + x1 ) = 0.5(28 + 28 + 0)
7.5.2
Multiple-decriment tables
I’m not sure why the SOA wants to include this material on exam C. MLC does a much better job with
the material and those who pass MLC know it already. I’ve extracted the following proof from Actuarial
Mathematics for those who have forgotten. I’ve kept te notation that Actuarial Mathematics uses to avoid
typos (the reader should be able to see how they compare).
Let us examine specific assumptions concerning the incidence of decrements. First let us use an assumption
of a constant force for decrement j and for the total decrement over the interval over (x, x + 1). This implies.
)
(τ )
µ(τ
x (t) = µx (0)
0≤t<1
Then, for 0 ≤ t < 1, we have
Z
(j)
s qx
=
=
s
(τ ) (j)
s px µx (t)dt
0
Z
(j)
µx (0) s (τ ) (τ )
s px µx (t)dt
(τ )
µx (0) 0
(j)
=
µx (0)
(τ )
µx (0)
(τ )
s qx
But also for any r in (0,1), under the constant force assumption,
³
´
)
)
(τ )
(τ )
−rµ(τ
x (0)
rµ(τ
or
r px = e
x (0) = − ln r px
28
(7.13)
7.5 estimation for Large Data
and
7
rµ(j)
x (0) = − ln
³
0(j)
r px
´
³
(j)
s qx
=
0(j)
r px
or
so that from (7.13)
ln
0(j)
r px
³
ln
(τ )
r px
ESTIMATION FOR COMPLETE DATA
(j)
= e−rµx
(0)
´
´ s qx(τ )
Which is the equation found in loss Models:
(g)
qj
(T )
qj
³
´
0(g)
ln 1 − qj
´ qj(T )
= ³
(T )
ln 1 − qj
=1−
m ³
Y
0(g)
(7.14)
´
1 − qj
g=1
(T )
qj
=
m
X
(g)
qj
g=1
Note: all you have to do is memorize the above equation you don’t need to know the assumptions as this is the
only multi-decrement model Exam C can use.
Sinlge decrements are multiplicatively combined and are considered to come from a world where are other
decrements are not possible (other “deaths” are considered to be withdrawls). Multiple decrements are additively
combined (this makes them easier to work with) and come from a world where all other decrements are possible.
29
8
8
KERNEL DENSITY MODELS
Kernel Density Models
For notation let p(yj ) be the probability assigned to the value yj (j = 1, . . . , k) by the empirical distribution.
Let Ky (x) be a distribution function for a continuous distribution such that its mean is y. Let ky (x) be the
corresponding density function.
Definition 8.1. A kernel density estimator of a distribution function is
F̂ (x) =
k
X
p(yj )Kyj (x)
(8.1)
j=1
and the estimator of the density function is
fˆ(x) =
k
X
p(yj )kyj (x)
j=1
The function ky (x) is called the kernel. Three kernels will now be introduced.
Definition 8.2. The uniform kernel is given by

x<y−b
 0,
1
, y−b≤x≤y+b
ky (x) =
2b

0,
x>y+b

x<y−b
 0,
x−y+b
Ky (x) =
,
y−b≤x≤y+b
2b

1,
x>y+b
Definition 8.3. The triangular kernel is given by

0,
x<y−b


 x−y+b
,
y−b≤x≤y
b2
ky (x) =
y−x+b

,
y ≤x≤y+b
2

b

0,
x>y+b

0,
x<y−b



 (x−y+b)2
,
y−b≤x≤y
2b2
Ky (x) =
(y−x+b)2

1
−
,
y ≤x≤y+b

2b2


1,
x>y+b
Definition 8.4. The gamma kernel is given by letting the kernel have a gamma distribution with shape parameter
α and scale parameter y/α. That is
xα−1 e−xα/y
ky (x) =
(y/α)α Γ(α)
Note that the gamma distribution has a mean of α(y/α) = y and a variance of α(y/α)2 = y 2 /α.
30
9
9
9.1
PARAMETER ESTIMATION
Parameter Estimation
Method of Moments
Definition 9.1. A method-of-moments estimate of θ is any solution of the p equations
µ0k (θ) = µ̂0k , k = 1, 2, . . . , p
(9.1)
Definition 9.2. A percentile matching estimate of θ is any solution of the p equations
πgk (θ) = π̂gk , k = 1, 2, . . . , p
(9.2)
where g1 , g2 , . . . , gp are p arbitrary chosen percentiles. From the definition of percentile, the equations can also be
written
F (π̂gk | θ) = gk , k = 1, 2, . . . , p
(9.3)
Definition 9.3. The smoothed empirical estimate of a percentile is found by
π̂g = (1 − h)x(j) + hx(j+1) , where
(9.4)
j = b(n + 1)gc and h = (n + 1)g − j
Here b.c indicates the greatest integer function and x(1) ≤ x(2) ≤ . . . ≤ x(n) are the order statistics from the sample.
9.2
Maximum Likelihood
First I want to be clear about a couple different variables: Xj is a random variable with a distribution function
FXj (x) and has a density function of fXj (x). Secondly, we will define: Aj as an observation [where Aj could
be an exact observation (Aj = 4) or a censored (Aj > 4)] from the random variable Xj . In many cases
Xj = Xi ∀i, j (that is all observations are from the same distribution).
Definition 9.4. The likelihood function is
L(θ) =
n
Y
Pr(Xj ∈ Aj | θ)
(9.5)
j=1
and the loglikelihood function is
l(θ) = ln L(θ)
and the maximum likelihood estimate of θ is the vector that maximizes the likelihood function or the loglikelihood
function.
There is no guarantee that the function has a maximum at eligible parameter values. It is possible that as
various parameters become zero or infinite the likelihood function will continue to increase. A few examples for
the expression: Pr(Xj ∈ Aj | θ). Where a is the observed value(s).
Complete data: Pr(X = a)
Grouped data: Pr(cj−1 ≤ X ≤ cj )
Censored data: Pr(X > a)
A further complication is truncation, which can be dealt with in two ways: shifting and fitting the model
as is. Shifting means we model the distribution after a deductible has been applied. The other approach is to
fit the model using the total value (assuming no deductible was applied), but adjusting each observation by the
probability that we would not expect to be truncated: 1−F (d) (for example Pr(X = a) = f (X = a)/[1−F (d)]).
31
9.3 Variance and Interval Estimation
9.2.1
9
PARAMETER ESTIMATION
MLE of the Exponential Distribution
Complete Data:
L(θ) =
n
Y
f (xi ) =
i=1
n
Y
1
θ
i=1
e−xi /θ =
1 (− P xi )/θ
e
θn
1X
xi = x̄
n
Censored Data: m censored data points at u - n uncensored data points.
θ̂ =
L(θ) =
n
Y
f (xi )[1 − F (u)]m =
i=1
P
θ̂ =
1 −(P xi +mu)/θ
e
θn
xi + mu
n
Deductible Data: [fitting the model as is]
n
Y
f (xi )
=
L(θ) =
1
−
F (d)
i=1
P
θ̂ =
9.3
P
1 −( xi )/θ
θn e
(e−d/θ )n
xi − nd
=
n
P
=
1 −[P xi −nd]/θ
e
θn
(xi − d)
n
Variance and Interval Estimation
9.3.1
Information Method
Definition 9.5. The Information is defined as follows:
· 2
¸
∂
I(θ) = −E
ln
L(θ)
∂θ2
"µ
¶2 #
∂2
=E
ln L(θ)
∂θ2
(9.6)
The information matrix also forms the Cramé r-Rao lower bound. That is, under the usual conditions,
no unbiased estimator has a smaller variance than that given by the inverse of the information. Therefore, at
least asymptotically, no unbiased etimator is more accurate than the maximum likelihood estimator
For further clarification:
· 2
¸
∂
I(θ) = −nE
ln[f (x; θ)]
∂θ2
Z
∂2
= −n f (x; θ) 2 ln[f (x; θ)]dx
∂θ
−1
[I(θ)] is a useful approximation for V ar(θ̂n ). This result means that the maximum likelihood estimator is
asymptotically unbiased and consistent. To be clear the expectation is taken over all values of x. The likelihood
function used should not include the substituted values of x. This can be seen clearly in the example below.
32
9.3 Variance and Interval Estimation
9
PARAMETER ESTIMATION
Example 9.1. Suppose that X has an exponential distribution with a mean of θ. The mle of θ is found from the
following random sample of 12 data points:
7, 12, 15, 19, 26, 27, 29, 29, 30, 33, 38, 53
What is the approximate 95% confidence interval of θ?
The pdf of the exponential distribution is f (x; θ) = θ1 e−x/θ .
P
Q
1 −xj /θ
The likelihood function is L(θ) = 12
= θ112 e−( xj )/θ =
j=1 θ e
P
xj
θ
−12
θ
1
e−318/θ .
θ 12
= −12 ln(θ) − 318
.
θ
318
+ θ2 = 0 → θ̂ = 26.5
The loglikelihood function is l(θ) = ln L(θ) = −12 ln(θ) −
d
To find θ̂ using the mle we solve dθ
l(θ) = 0, which equals
= x̄.
The ”information” is (E(xj ) = θ)
· 2
¸
∂
I(θ) = −E
l(θ)
∂θ2
· 2 µ
P ¶¸
xj
∂
= −E
−12
ln
θ
−
∂θ2
θ
·
· ¸ · ¸ hX i
· ¸ · ¸
P ¸
2 xj
12
12
2
12
2
=− 2 + 3 E
xj = − 2 + 3 nE [xj ]
= −E 2 −
θ
θ3
θ
θ
θ
θ
2(12θ)
12
12
= − 2 +
= 2
θ
θ3
θ
Therefore the approximate variance of the mle is
√
is 26.5 ± 1.96 58.5 = (11.5, 41.5).
9.3.2
1
I(θ)
=
θ̂ 2
12
= 58.5. The approximate 95% confidence interval for θ
Delta Method
Theorem 9.1. Let θ̂n be a parameter estimated using a sample size of n. Assume that θ̂n is asymptotically normal
with mean of µ and variance of σ 2 /n, where neither µ nor σ 2 depend on n. Let g be a function that is differentiable.
Let Gn = g(θ̂n ). Then Gn is asymptotically normal with mean of g(θ) and variance [g 0 (θ)]2 σ 2 /n.
The multivariable version is presented below:
Theorem 9.2. Let θ̂ n = (θ̂1n , . . . , θ̂kn )T be a multivariate parameter vector of dimension k based on a sample size
of n. Assume that θ̂ n is asymptotically normal with mean of θ̂ and covariance matrix Ω/n, where neither θ̂ n nor
Ω depend on n. Let g be a function of k variables that is totally differentiable. Let Gn = g(θ1 , . . . , θk ). Then Gn is
asymptotically normal with mean of g(θ) and variance (∂g)T Ω(∂g)/n.
The whole point of the Delta method is to help us find the variance of distributional properties that may be
hard to find otherwise. If we can frame them as some function of the distribution’s parameters then we can find
the variance.
33
9.4 Non-normal Confidence Intervals
9
PARAMETER ESTIMATION
Example 9.2. Loss payments for a group health policy follow an exponential distribution with unknown mean. A
sample of losses is:
100
200
400
800
1400
3100
Solution...
θ̂ = x̄ = 1000
Var(θ̂) = 10002 /6 = 166667
Use the delta method to approximate the variance of the maximum likelihood estimator of S(1500).
g(θ) = exp(−1500/θ)
1500
g 0 (θ) = 2 exp(−1500/θ)
θ
15002
0
2
[g (θ)] =
exp(−3000/θ)
θ4
[g 0 (1000)]2 = 0.000000112
[g 0 (1000)]2 × 10002 /6 = 0.01867
9.4
Non-normal Confidence Intervals
Definition 9.6. Let l(θ) be the loglikelihood function for f (x; θ) and θ̂ is the maximum likelihood estimate; then
the 100(1 − α)% confidence region is the set values of θ that satisfy.
½
¾
χ2
θ : l(θ) ≥ l(θ̂) −
(9.7)
2
where the first term is the loglikelihood value at the maximum likelihood estimate and the second term is the 1 − α
percentile from the chi-square distribution with degrees of freedom equal to the number of estimated parameters.
9.5
Bayesian Estimation
It makes more sense to provide this material in section 12.
34
10
10
10.1
MODEL SELECTION
Model Selection
Representations
In order to keep things consistent and in agreement with the text, when a distribution function or density function is indicated, a subscript equal to the sample size indicates that it is the empirical model (from Kaplan-Meier,
Nelson-Åalen, the ogive, empirical etc.) while no adornment or use of a (∗ ), indicates the estimated parametric
model. There is no notation for the true, underlying distribution because it is unknown or unknowable.
10.2
Graphical Comparison
Definition 10.1. A p-p plot or probability plot in created by ordering the observations as x1 ≤ . . . ≤ xn . A
point is then plotted corresponding to each value
(Fn (xj ), F ∗ (xj )), where Fn (xj ) =
j
n+1
(10.1)
The points on the p-p plot should be close to the 45◦ line running from (0,0) to (1,1) to conclude that Fn is a
good estimate of F ∗ .
Definition 10.2. A D(x) plot in created by plotting
D(x) = Fn (x) − F ∗ (x), where Fn (xj ) =
10.3
j
n
(10.2)
Hypothesis Tests
Hypothesis Test:
H0
H1
:
:
The data came from a population with the stated model
The data did not come from such a population
It is more often the case that the null hypothesis (H0 ) states the name of the model but not its parameters.
When the parameters are estimated from the data, the test statistic tends to be smaller than it would have
been had the parameter values been pre-specified.
10.3.1
Kolmogorov-Smirnov Test
D = max |Fn (x) − F ∗ (x)|
t≤x≤u
(10.3)
Only for individual data
10.3.2
Anderson-Darling test
Z
u
[Fn (x) − F ∗ (x)]2 ∗
f (x)dx
∗
∗
t F (x)[1 − F (x)]
= −nF ∗ (u)
A2 = n
+n
k
X
(10.4)
[1 − Fn (yj )]2 (ln[1 − F ∗ (yj )] − ln[1 − F ∗ (yj+1 )])
j=0
+n
k
X
Fn (yj )2 [ln F ∗ (yj+1 ) − ln F ∗ (yj )]
j=1
35
(10.5)
10.4 Selecting a Model
10.3.3
10
MODEL SELECTION
Chi-Square Goodness-of-fit
χ2 =
=
k
X
n(p̂j − pnj )2
p̂j
j=1
k
X
(Ej − Oj )2
Ej
j=1
(10.6)
Where p̂j = F ∗ (cj ) − F ∗ (cj−1 ) is the probability a truncated observation falls into the interval from cj j − 1 to
cj . And, pnj = Fn (cj ) − F (cj−1 ) be the same probability according the empirical distribution. Finally, n is the
sample size. In equation (10.6) Ej = np̂j is the expected number of observations in the interval and Oj = npnj
is the actual number of observations. The critical values for this test comes from the chi-square distribution
with degrees of freedom of (k − 1 − r). Where k is the number of terms in the sum and r is the number of
estimated parameters values.
Compound Distributions applied to Chi-Square Goodness-of-fit test If the observed variable is the
result of a sum of independant and idential random variables. Then the central limit theorem indicates that a
normal approxemation is appropriate. The expected count (Ek ) is the exposure times the expected value for
one exposure unit while the variance (Vk ) is the exposure times the estimated vatriance for one exposure unit.
The test statistic becomes:
=
10.3.4
k
X
(Ej − Oj )2
Vj
j=1
(10.7)
Likelihood ratio test
Definition 10.3. The likelihood ratio test is conducted as follows. First, let the likelihood function be written
as L(θ). Let θ 0 be the value of the parameters that maximizes the likelihood function, However, only values of
the parameters that are within the null hypothesis may be considered. Let L0 = L(θ 0 ). Let θ 1 be the maximum
likelihood estimator where the parameters can vary over all possible values from the alternative hypothesis and then
let L1 = L(θ 1 ). The test statistic is
µ ¶
L1
T = 2 ln
= 2 (ln L1 − ln L0 )
(10.8)
L0
The null hypothesis is rejected if T > c, where c is calculated from α = Pr(T > c), where T has a chi-square
distribution with degrees of freedom equal to the number of free parameters in the model from the alternative
hypothesis less the number of free parameters in the model from the null hypothesis.
10.4
10.4.1
Selecting a Model
Score-based approaches
1. Lowest value of the Kolmogorov-Smirnov test statistic
2. Lowest value of the Anderson-Darling test statistic
3. Lowest value of the chi-square goodness-of-fit test statistic
4. Highest p-value for the chi-square goodness-of-fit test
5. Highest value of the likelihood function at its maximum
Definition 10.4. The Schwarz Bayesian Criterion recommends that when ranking models a deduction of
(r/2) ln n should be made from the loglikelihood value, where r is the number of estimated parameters and n is the
sample size.
36
11 FULL CREDIBILITY
11
11.1
Full Credibility
Introduction
Credibility attempts to answer the question: how much data would it take to make the information it provides
credible when compared to some already known existing standard (manual rate). Full credibility requires that
we have that much data. Partial credibility deals with situations in which we haven’t met the requirements for
full credibility. There are other techniques for finding a credible solution which will be discussed in the next
chapter.
We say that full credibility standard is satisfied if the probability relation below is satisfied.
¡
¢
Pr |X̄ − µ| ≤ rµ ≥ p
1. Where r is the percentage we are willing to be off in our approximation of µ (typically around 5%).
2. And p is the probability that x̄ is in within that r percentage range.
¯
ï
!
¯ X̄ − µ ¯
rµ
¯
¯
Pr ¯ p
≥p
¯≤ p 2
¯ σ 2 /n ¯
σ /n
X̄ − µ
Normal(0, 1) = p
σ 2 /n
A pth percent confidence interval would be: 0 ± z(1−p)/2 We want the above interval to include: √ rµ2
σ /n
Or we want to find n such that
rµ
z(1−p)/2 = p
σ 2 /n
n=
³z
(1−p)/2
r
´2 µ σ ¶ 2
µ
Here n is the number of observations of X needed to establish that x̄ is within r percent of µ p percent of
the time. To simplify things I will quickly define n0 :
³z
´2
(1−p)/2
n0 =
(11.1)
r
Example 11.1. µ = 10; r = 0.05; p = 0.95; σ 2 = 4;
Pr(|X̄ − 10| ≤ 0.5) ≥ 0.95
¯
ï
!
¯
¯
0.5
¯ X̄ − 10 ¯
Pr ¯ p
≥p
¯≤ p
¯ 4/n ¯
4/n
A pth percent confidence interval would be: 0 ± 1.96 We want the above interval to include: √0.5
4/n
0.5
p
= 1.96 → n = 61.5
4/n
We could also use the result from above:
µ
¶
σ2
µ2
µ
¶2 µ
¶
1.96
4
n=
= 61.5
0.05
102
n = n0
Here n is the number of observations of X needed to establish that x̄ is within 5% of the correct mean 95% of the
time.
37
11.2 Compound Distribution
11
FULL CREDIBILITY
Now, we are not always just looking for a credibility estimate for the number of observations required. We
may want to know how large the “total sum of observations” is required to be before we have a credible estimate
for x̄. Since this just makes everything bigger by a factor of µx we can just multiply the entire expression by
that:
µ ¶2
σ
n = µn0
(11.2)
µ
Here we have the sum of all observations of X needed to establish that x̄ is within r% of the correct mean p%
of the time.
11.2
Compound Distribution
As you may have become aware the language in these problems is critical to solving them. I will now throw a
further wrench in the language complexities. What is we wanted to know how many observations of a compound
distribution were required in order to establish credibility?
We can use the same equation as before:
µ
¶
µn σx2 + µ2x σn2
n = n0
(11.3)
µ2n µ2x
Here n is the number of observations of S needed to establish that s̄ is within r
As in the simple case, we can ask different questions such as:
1. What are the sum of all observations required...
¶
µ
µn σx2 + µ2x σn2
µn µx n0
µ2s
2. What are the total number of claims required...
µ
¶
µn σx2 + µ2x σn2
µn n0
µ2s
(11.4)
(11.5)
If you notice question 2 could be answered for both S and N. It’s important to find out if we are talking about
s̄ or n̄.
11.3
Poisson
In many cases we can simplify the above expressions by assuming a random variable has a Poisson distribution:
E(N ) = λ
Var(N ) = λ
The one variable case has
1. What are the number of observations of N...
n = n0
1
λ
= n0
λ2
λ
(11.6)
λ
= n0
λ2
(11.7)
2. What are the total sum of observations of N...
n = λn0
The two variable case can be summarized as
38
11.4 Partial Credibility
11
# of Si ’s
n
X
i=1
n
X
i=1
11.4
General Case
σn2 µ2y + µn σy2
n0
(µn µy )2
Si
n0
σn2 µ2y + µn σy2
µn µy
Si
µy
n0
σn2 µ2y + µn σy2
µn µ2y
N = Poisson(λ)
"
#
σy2
1
1+ 2
n0
λ
µy
#
"
σy2
n0 µy +
µy
"
#
σy2
n0 1 + 2
µy
FULL CREDIBILITY
Relationship
=A
= µn µy A
= µn A
Partial Credibility
This is a reasonably straightforward extension over what we’ve been doing. All you have to find
s
have
Z=
required
And make sure have/required are of the same type (sum/exposures/number of claims). Then the Credibility
premium is (M is manual rate - or given rate).
Q = Z X̄ + (1 − Z)M
Example 11.2. You are given the following information about a commercial auto liability book of business:
1. Each insured’s claim count has a Poisson distribution with mean λ, where λ has a gamma distribution with
α = 1.5 and θ = 0.2.
2. Individual claim size amounts are independent and exponentially distributed with mean 5000.
3. The full credibility standard is for aggregate losses to be within 5% of the expected with probability 0.90.
Using classical credibility, determine the expected number of claims required for full credibility.
Solution Few quick comments:
- we’re applying credibility to “aggregate losses” or S=sum(X).
- Gamma - Poisson mixture is Neg.Bin(r = α, β = θ).
¡
¢2
- n0 = 1.645
= 1082.4
0.05
µn = αθ
σn = αθ(1 + θ)
µs = µn µx = 5000αθ = 1500
σs2 = µn σx2 + µ2x σn2
2
(11.8)
2
= αθ5000 + 5000 αθ(1 + θ) = 16, 500, 000
³
Exposures of S required: n0
´
2
σs
µ2
s
(11.9)
= 1082.4 16,500,000
= 7937 Now we have to convert to what the problem is asking
15002
for: [b]expected number of claims[/b] There are µn claims per exposure: αθ7937 = 2381 So it would take 2381 claims
to get a credible estimate for S.
39
12
12
BAYESIAN ESTIMATION
Bayesian Estimation
12.1
Introduction
The texts I have used in the past go straight into the concept of discrete priors and posterior distributions etc.
I’m going to start with a discussion on the concept or purpose behind bayesian estimation. I always like to start
with one of the simplest examples:
Example 12.1. You have two dice, a normal one (D1 ) and one with 5 ones and 1 six (D2 ). Now imagine choosing
one of these two dice at random and rolling it X = value on dice (in number 6 we roll the same random dice twice).
Find the probability for a number of different possibilities:
1. Pr(X = 1)
2. Pr(X = 2)
3. Pr(X = 6)
4. Pr(Dice = D2 |X = 2)
5. Pr(Dice = D1 |X = 1)
6. Pr(Dice = D1 |X1 = 1 ∩ X2 = 1)
The solution for each follow:
1. Pr(X = 1) = 6/12 = Pr(Dice = D1 ) Pr(X = 1|D1 ) + Pr(Dice = D2 ) Pr(X = 1|D2 ) =
11
26
+
15
26
2. Pr(X = 2) = 1/12
3. Pr(X = 6) = 1/6
Pr(Dice=D2 ∩X=2)
0
= 1/12
4. Pr(Dice = D2 |X = 2) =
Pr(X=2)
In this situation, since there are no 2’s on dice 2, if you roll a 2 it has to have been rolled from dice 1. Therefore
if you roll a 2 you know it had to have been dice 2.
Pr(Dice=D1 ∩X=1)
5. Pr(Dice = D1 |X = 1) =
= 5/12
= 56
Pr(X=1)
1/2
In this situation, if you were asked the question which dice do you think rolled given you rolled a 1, you should
most certainly conclude you rolled the dice that has 5 ones rather than just 1 one.
Pr(Dice=D1 ∩X1 =1∩X2 =1)
6. Pr(Dice = D1 |X1 = 1 ∩ X2 = 1) = Pr Dice=D ∩X =1∩X =1 +Pr Dice=D ∩X =1∩X =1
(
) (
)
1
1
2
2
1
2
0.5(5/6)2
25
= 0.5(5/6)
2 +0.5(1/6)2 = 26
The Prior event in the above example is the choice of either dice 1 or dice 2: Pr(Dice = D1 ) = 0.5, Pr(Dice =
D2 ) = 0.5. The model events for the conditional probability are generally knows as P (X|Dice), which
would be {1, 2, 3, 4, 5, 6} for dice 1 and {1, 1, 1, 1, 1, 6} for dice 2. The Joint probabilities are the Pr(X ∩Dice).
The model events for the marginal probabilities are Pr(X). Finally, the reason we go through all these
hoops: the posterior distribution is Pr(Dice|X) similar to what we found in 4 and 5. The point of bayesian
credibility is to use the results (the X’s in the above example) along with what we know about the prior
distribution (the choice of dice) to determine which part of the prior distribution makes the most sense given
the data we available. In the above example it means that we use the roll of the dice (say X = 1) to determine
the probability Dice = D1 orD2 . Let me now try and formalize the above situation along with a few additional
concepts. I have added some simpler notations to make the expressions cleaner - these will be in brackets after
the name. It may be obvious to some readers, but I think considering the amount of conditional distributions
we are dealing with I want to just make sure it is understood: if there is a conditional distribution function
then everything on the left of the | is considered the random variable we are trying to model and everything on
the right is the existing information we already know.
12.2
Definition of Bayes’ Theorem
Definition 12.1. The prior distribution (πθ ) is a probability over the space of possible parameter values. It is
denoted π(θ) and represents our opinion concerning the relative chances that various values of θ are the true value
of the parameter.
40
12.2 Definition of Bayes’ Theorem
12
BAYESIAN ESTIMATION
This is generally an existing known distribution about the population in question.
Definition 12.2. The model distribution (mx|θ ) is the probability distribution for the data as collected given a
particular value for the parameter. Its pdf is denoted mx|θ = fX|Θ (x|θ).
This is the main distribution for the losses within the population. This states that once we know the parameter
for the model (from the prior) we know the distribution of losses from this distribution. In order to be able to
use the bayesian procedure we need to know (or be able to estimate) both the distribution for the population
(prior) and the distribution for the losses (model). If we assume that the observations are independent and
identically distributed random variables, then
mx|θ = fX|Θ (x|θ) =
n
Y
fX|Θ (xi |θ)
(12.1)
i=0
Definition 12.3. The joint distribution (jx,θ ) has pdf
jx,θ = fX,Θ (x, θ) = mx|θ π(θ) = fX|Θ (x|θ)π(θ)
(12.2)
We expand the model by adding in the parameter variable. In the model distribution we assume the parameter
equals θ, in the joint pdf we make no assumptions about the parameter and we look at all possible values to θ
along with all variables of xi .
Definition 12.4. The marginal distribution (gx ) of x has pdf
Z
Z
gx = fX (x) = jx,θ dθ = fX|Θ (x|θ)π(θ)dθ
(12.3)
The marginal distribution can be thought of as the average distribution once all possible parameter values have
been considered.
Definition 12.5. The posterior distribution (pθ|x ) is the conditional probability distribution of the parameters
given the observed data. Its pdf is
pθ|x = πΘ|X (θ|x) =
fX|Θ (x|θ)π(θ)
jx,θ
= R
gx
fX|Θ (x|θ)π(θ)dθ
(12.4)
The posterior distribution gives us the distribution of the parameter given a set of observed data {x1 , . . . , xn }.
Definition 12.6. The predictive distribution is the conditional probability distribution of a new observation y
given the data x = x1 , . . . , xn . Its pdf
Z
Z
fY |X (y|x) = fY |Θ (y|θ)pθ|x dθ = fY |Θ (y|θ)πΘ|x (θ|x)dθ
(12.5)
Finally, using the posterior distribution we can find the predictive distribution, which is the distribution of the
xn+1 th observation given the {x1 , . . . , xn } observations. We can also calculate E(xn+1 |X = x) as follows:
Z
E(xn+1 |X = x) = xn+1 fXn+1 |X (xn+1 |x)dxn+1
(12.6)
Z
= E(Xn+1 |Θ = θ)πΘ|X (θ|x)dθ
Example 12.2. You have been requested to find the probability of an individual driver (Driver A) has one or
more accidents in the next year. This company already knows that claims for an individual driver follow a Poisson
distribution with mean c. The mean for the general population is normally distributed with a mean of 0.2 and a
variance of 0.01. Over the last 10 years Driver A has had the following claim count: {0, 0, 1, 2, 0, 0, 0, 0, 1, 0} (this
driver is significantly worse than average as he had 4 claims in 10 years).
41
12.2 Definition of Bayes’ Theorem
12
BAYESIAN ESTIMATION
Solution:
1. Prior distribution: π(λ) =
2
√10 e−50(λ−0.2) .
2π
2. Model distribution: mx|θ =
3. Joint distribution: jx,λ =
eλ λx
.
x!
2
√10 e−50(λ−0.2)
2π
³
eλ λ0
0!
´7 ³
eλ λ1
1!
´2 ³
eλ λ2
2!
´
=
2 10λ 4
λ
√10 e−50(λ−0.2) e
.
2
2π
The joint dis-
tribution is a distribution over n + 1 variables ({x1 , . . . , xn } and θ) the x’s should not be inputted into the
expression until we are at the posterior distribution where θ|x, however doing so early makes the expression
cleaner and for our purposes works.
R∞
2 10λ 4
4. Marginal distribution: gx = 0 √10
e−50(λ−0.2) e 2 λ dλ.
2π
5. Posterior distribution: pθ|x =
R∞
0
2 10λ λ4
√10 e−50(λ−0.2) e
2
2π
2 10λ λ4
√10 e−50(λ−0.2) e
2
2π
dλ
=
2 10λ λ4
√10 e−50(λ−0.2) e
2
2π
0.00011
.
One thing the text highlights well is the fact that since the posterior distribution is a proper pdf, the integral
over all values will be equal to one. In many cases the posterior distribution will fit a known distribution and
we can match parameter values without having to actually calculate integrals as I did above. I’m not sure if
2
there’s a simple distribution for the equation above though (due to the fact that we have both eaλ and eaλ ).
And this is not the point of this exercise.
R ∞ −λ y
2
e10λ λ4
dλ.
6. The predictive distribution is: fy|x (y|x) = 0 e y!λ √10
e−50(λ−0.2) 2×0.00011
2π
R ∞ −λ 10 −50(λ−0.2)2 e10λ λ4
7. Predictive for y = 0: fy=0|x (0|x) = 0 e √2π e
dλ = 0.76877.
2×0.00011
8. Finally the predictive for y > 0: 1 − 0.76877 = 0.231223. So the bayesian estimate for the probability this
driver has 1 or more claims in the next year is 23%, which isn’t much higher than the 18% we would expect
from an average driver in this population.
Definition 12.7. The double-expectation formulas for any random variables X and Y are
E(Y ) = E[E(Y |X)]
(12.7)
Var(Y ) = E [Var(Y |X)] + Var [E(Y |X)]
(12.8)
Definition 12.8. The points a < b define a 100(1 − α)% credibility interval for θj provided that Pr(a ≤ Θj ≤
b|x) ≥ 1 − α.
The ≥ allows for a solution when θj is discrete, the following theorem provides a unique solution when θj is
unique:
Theorem 12.1. If the posterior random variable θj |x is continuous and unimodal, then the 100(1 − α)% credibility
interval with smallest width b − a is the unique solution to
Z b
(12.9)
πΘj |X (θj |x)dθj = 1 − α
a
πΘ|X (a|x)
=
πΘ|X (b|x)
This interval is a special case of th highest posterior density credibility set.
Definition 12.9. For any posterior distribution the 100(−1α)% HPD credibility set is the set of parameter
values C such that
Pr(θj ∈ C) ≥ 1 − α
(12.10)
and
C = {θj : πΘj |X (θj |x) ≥ c} for some c
where c is the largest value for which the inequality Pr(θj ∈ C) ≥ 1 − α holds.
42
12.3 Conjugate Prior: Common Examples
12
BAYESIAN ESTIMATION
Theorem 12.2. If π(θ) and fX|Θ (x|θ) are both twice differentiable in the elements of θ and other commonly satisfied
assumptions hold, then the posterior distribution of Θ given X = x is asymptotically normal
Definition 12.10. A prior distribution is said to be a conjugate prior distribution or a given model if the
resulting posterior distribution is of the same type as the prior (but perhaps with different parameter values).
Theorem 12.3. Suppose that given Θ = θ the random variables X1 . . . , Xn are independent and identically distributed with pf
p(xj )e−θxj
fXj |θ (xj |θ) =
(12.11)
q(θ)
where Θ has pf
π(θ) =
[q(θ)]−k e−θµk
c(µ, k)
where k and µ are parameters of the distribution and c(µ, k) is the normalizing constant. Then the posterior pf
πΘ|X (θ|x) is of the same form as π(θ).
12.3
Conjugate Prior: Common Examples
Prior
Model
1.
λ = gamma(α, θ)
x = Poisson(λ)
2.
λ = gamma−1 (α, θ)
x = exp(λ)
3.
q = beta(a, b, 1)
x = bin(m, q)
4.
λ = gamma(α, θ)
x = exp−1 (λ)
5.
λ = normal(µ, a2 )
x = normal(λ, σ 2 )
Posterior
´
P
θ
gamma α + xi , nθ+1
P
gamma−1 (α + n, θ + xi )
P
P
beta(a + xi , b + km − xi , 1)
µ
¶
h
P 1 i−1
gamma α + n, θ1 +
xi
³£
¤
£
¤
µ
normal σx2 + a2 / σ12 + a12 , 1
6.
λ = single.pareto(α, θ)
x = uniform(0, λ)
single.pareto(α + n, max(x, θ))
³
1
+ a12
σ2
´
An important related property of the “gamma-Poisson” (#1) is that the marginal distribution of X can be
shown to be negative binomial with r = α and β = θ. We can also find E[X] by the double expectation rule
E[X] = E[E(X|λ)] = E[λ] = αθ
Also, the predictive distribution
will be negative binomial with the same parameter as the gamma posterior
P
0
distribution: r0 = α + xi and βP
= θ/(nθ + 1). And the predictive mean is the same as the mean of the
0 0
posterior distribution: r β = (α + xi )(θ/[nθ + 1]). The variance would be r0 β 0 (1 + β 0 ). Finally, a special case
arises when α = 1 in the prior distribution. The prior distribution becomes an exponential distribution and the
marginal distribution of X becomes a geometric distribution. The posterior will be a gamma distribution.
43
13 LINEAR CREDIBILITY
13
13.1
Linear Credibility
Introduction
While bayesian credibility may provide excellent results; a problem that can quickly arise is that solutions
become impossible to compute. Imagine having a couple thousand xi ’s, the marginal distribution becomes
impossible to calculate as the values are too tiny for most computers to process properly. It would appear that
an alternative, more robust approach, would be needed to find a credibility estimate. In order to determine a
linear credibility estimate we need some way of determining what linear factors to use and the values for those
factors. This can be done by finding the values of ai that minimize Q and using those results to find µn+1 (Θ).
Where

2 


n


X


Q=E
µn+1 (Θ) − α0 −
αj Xj
(13.1)




j=1
The steps can be found in my linear credibility supplement. The resulting credibility premium would be
α̃0 +
n
X
α̃j Xj
(13.2)
j+1
13.2
Bühlmann Credibility
Figure 3: Tree structure of the Bühlmann-Straub model
For the purposes of this section I find it necessary to clarify what is meant by Xij in this section:
Definition 13.1. The claims ratio is defined as follows:
Xij = Sij /mij = Sij
(13.3)
Where Sij is the aggregate loss during the period (or exposure, time, weight, etc.) of size mij . Could also be called
the average claim size per unit of exposure.
44
13.3 Types of Bühlmann Problems
13
Also,
mi =
ni
X
mij ;
m=
j=1
ni
r X
X
LINEAR CREDIBILITY
mij
i=1 j=1
Definition 13.2. A Bühlmann-Straub model can be constructed as follows. Assume that for each policy-holder
(conditional on Θi = θ) past losses Xi1 , . . . , Xini have the same mean (µ(θ)) and variance that is proportional to its
weight (v(θ)/mij ) and are independent and identically distributed conditional on Θ, then
hypothetical mean or collective premium
µ(θ) = E(Xij |Θi = θ)
for risk i
process variance
v(θ) = mij V ar(Xij |Θi = θ)
for risk i
expected value of the hypothetical means
µ = E[µ(θ)]
collective/portfolio
expected value of the process variance
v = E[v(θ)]
collective/portfolio
variance of the hypothetical means
a = V ar[µ(θ)]
v
k=
a
mi
Zi =
mi + v/a
collective/portfolio
Bühlmann’s k or credibility coefficient
Bühlmann credibility factor
collective/portfolio
for risk i
The credibility premium is
Zi X̄i + (1 − Zi )µ
(13.4)
Definition 13.3. A Bühlmann model is a special case of the Bühlmann-Straub model where mij = 1 and
mi = n = mj for all i and j.
The larger v/a is the smaller Z will be. So, if a is much smaller than v we will find that the estimate
provided by the mean of the sample is less credible. To see how this works, imagine a population where there
is a huge gap in the means of two groups (say good and bad drivers), group A (50%) has 1 claim every 20
years and group B (50%) has 10 claims per year. (assume claim distribution is Poisson). This would mean
that a = V ar[µ(θ)] = 0.5((0.05 − 5.025)2 + (10 − 5.025)2 ) = 24.75 and v = E[v(θ)] = (10 + 0.05)0.5 = 5.025,
even with just 1 year of data we would have a estimate that we could call (1/(1 + 0.2)) = 83% credible. A
quick second example would have two almost identical drivers (driver A is 1 claim every 20 years, driver B is
1 claim every 19 years). The variance of the hypothetical means would be a = V ar[µ(θ)] = 0.0000017 and the
expected value of the process variance: v = E[v(θ)] = 0.051. In this example it would take over 29000 years of
data to come to a 50% credible estimate for the population (Z = n/[n + (0.051/0.0000017)]). In other words:
a measures the amount of variance there is within the population and v measures the variance expected by the
individual observations for a specific insured. If there is a lot of variance within the population it is easy to tell
insured’s apart, however, if there is a lot of variance within the results then it is hard to determine if a result
is from population A or population B.
A useful relationship between a, v and Var(X) is
Var(X) = Var[E(Xi |θ)] + E[Var(Xi |θ)] = a + v
¡ ¢
Var X̄ = Var[E(X̄i |θ)] + E[Var(X̄i |θ)]
·
¸
v(Θi )
v
= Var [µ(Θi )] + E
=a+
n
n
13.3
(13.5)
(13.6)
Types of Bühlmann Problems
You should find that all of the problems found on this exam should fit into one of three categories. The basic is
by far the easiest to solve, the other two (semi-parametric and non-parametric) are more challenging to make
the estimates for a and v unbiased.
13.3.1
Basic
• You’re given fX|θ (X|θ) (model)
45
13.4 Empirical Bayes Parameter Estimation
13
LINEAR CREDIBILITY
• You’re given π(θ) (prior)
• You’re given a set of n data points.
• Model, fX|θ (X|θ), provides the µ(θ) and v(θ) distributions.
• Prior, π(θ), provides u, v, a
13.3.2
Semi-Parametric
• You’re given fX|θ (X|θ)
• You’re not given π(θ) (prior)
• Model provides unbiased estimates for µ(θ), v(θ)
• Sample statistics are used to estimate θi .
• Distribution of µ(θi ) and v(θi ) allow us to find u, v, a
13.3.3
Non-parametric
• You’re given a set of data points: Xij for insured i = 1 . . . r and observations j = 1 . . . ni .
• Using the data alone find a, v, u using the equations provided by Loss Models.
13.4
Empirical Bayes Parameter Estimation
Previously, we have been provided with the distribution of µ(θ) and v(θ). In many cases we will want to estimate
the true model distribution and the distributions will not be known. In which case we will have to estimate
a, v and µ using only the observations. The credibility premium for next year’s losses (per exposure unit) for
policyholder i is
13.4.1
Non-Parametric Estimation
µ=X
(13.7)
ni
r X
X
¡
¢2
1
mij Xij − X i
(13.8)
i=1 (ni − 1) i=1 j=1
" r
#−1
r − 1 X mi ³
mi ´
c=
1−
(13.9)
r
m
m
i=1
" r
!
#
Ã
X mi
vr
r
2
(X i − X) −
(13.10)
a=c
r − 1 i=1 m
m
³
vr ´
a = c Var(X i ) −
(13.11)
m
Pr
If mi = mj for all i and j then c = 1 (because: [ i=1 1/r(1 − 1/r)]−1 = [1 − 1/r]−1 = r/(r − 1)). Loss Models
expands and simplifies the above expression into something that isn’t as intuitive. I have chosen to use the
above expression. The above expression for a is from “A Course in Credibility Theory and its Application” by
Hans Bühlmann himself.
v = Pr
Definition 13.4. We can also define the method that preserves total losses (or the credibility-weighted
average) estimate of µ by
Pr
i=1 Zi Xi
µ= P
r
i=1 Zi
46
13.4 Empirical Bayes Parameter Estimation
13.4.2
13
LINEAR CREDIBILITY
Known µ
If µ is known then the we can calculate a as
a=
r
X
mi
i=1
m
(X i − µ)2 −
r
v
m
(13.12)
If µ is known and there is only data available is for policy holder i:
Pni
2
vi
j=1 mij (Xij − X i )
vi =
;
ai = (X i − µ)2 −
ni − 1
mi
13.4.3
(13.13)
Semi-parametric estimation
Assuming that the random variable Xij have a particular distribution can simplify some of the calculations.
For example, the probability distribution for Xij might be Poisson or Binomial. Then use the properties of the
distribution to find µ and v. I have put a few examples in the table below (where µ = X̄).
Distribution
Poisson(θ)
Binomial(n, θ)
Exponential(θ)
Gamma(α, θ)
13.4.4
µ(θ)
θ
nθ
θ
αθ
µ(x̄)
= x̄
= x̄
= x̄
= x̄
v(θ)
θ
nθ(1 − θ)
θ2
αθ2
v(x̄)
= x̄
= x̄(1 − x̄/n)
= x̄2
= x̄2 /α
Known Var(X)
One additional technique for finding a when all individual observations are known or when Var(X) is given.
Using equation (13.5).
a = Var(X) − v
I haven’t mentioned it elsewhere but if Var(X) < v then a < 0 if this is the case (doesn’t make sense) then we
assume a = 0 or that the variance between groups is zero - groups are indistinguishable.
Example 13.1. (SOA 233)
1. A region is comprised of three territories. Claims experience for Year 1 is as follows
Territory
A
B
C
Number of Insureds
10
20
30
Number of Claims
4
5
3
2. The number of claims for each insured each year has a Poisson distribution.
3. Each insured in a territory has the same expected claim frequency.
4. The number of insureds is constant over time for each territory.
Determine the Bühlmann-Straub empirical Bayes estimate of the credibility factor Z for Territory A.
47
13.4 Empirical Bayes Parameter Estimation
13
LINEAR CREDIBILITY
Solution...
Territory
A
B
C
Total
Number of Insureds
10
20
30
60
Number of Claims
4
5
3
12
X̄ = λ̂i
0.40
0.25
0.10
0.20
The problem asks for the “empirical Bayes” estimate, which would suggest we should find the v’s using equation
(13.8). However, we are not given mij or Xij so we need to find v another way. We are given the model distribution
(Poisson) we can find v using v = µ = λ, where λ = X̄i as per the maximum likelihood estimate of λ (note: I would
consider this a semi-parametric problem - the SOA does not required to explicitly state that). We do the same for
µ and can find a using the usual definition.
µ = E[µ(λi )] = E[λi ] =
3
X
X̄i
i=1
mi
10
20
30
12
= 0.4
+ 0.25
+ 0.2
=
= 0.2
m
60
60
60
60
v = E[v(λi )] = E[λi ] = µ = 0.2
As we do not have a known distribution for the λ’s we will need to use the data instead to estimate a using (13.10)
¶−1
15
12
11
12
+
+
=
66
33
22
11
· µ
¶
¸
12 3 1 2 1
1
0.2
a=
0.2 + 0.052 + 0.12 − 3
= 0.0094545
11 2 6
3
2
60
mA
10
ZA =
=
= .3230769231
0.2
mA + v/a
10 + 0.0095
2
c=
3
µ
Example 13.2. SOA 18
1. Two risks have the following severity distributions:
Probability of Claim Probability of Claim
Amount of Claim
Amount for Risk 1
Amount for Risk 2
250
0.5
0.7
2,500
0.3
0.2
60,000
0.2
0.1
2. Risk 1 is twice as likely to be observed as Risk 2.
A claim of 250 is observed.
Determine the Bühlmann credibility estimate of the second claim amount from the same risk.
Solution...
Point 1 is the “Model” distribution given “θ” (f (X|θ)).
Point 2 is the “Prior” distribution (π(θ)).
Since the table values are “given” it implies they are fixed and therefor we calculate “empirical” moments. There is
no need to adjust estimates for bias.
Risk
Risk 1
Risk 2
Average
Variance
mean
12,875
6,675
µ = 10, 808
a = 8, 542, 222
variance
556,140,625
316,738,125
v = 476, 339, 791
Prior probability
2/3
1/3
Note: the above can be quickly calculated using the Ti-30X MultiViewTM . I actually suggest trying to create the
tables and use the “stat” feature to generate the values. It’s a good practice. σx2 = Variance.
k = 476, 339, 791/8, 542, 222 = 55.76
250
1
55.76
+ 10, 808
= 10, 622
1 + 55.76
1 + 55.76
48
14
14
14.1
SIMULATION
Simulation
Basics
For the purposes of simulation I will be using a common programming function called rand(min, max), where
the function returns a random (decimal) value between min and max using the uniform distribution. Typically
I will use rand(0, 1) as we will be dealing with probabilities and it will have a subscript “i” if we are generating
random integers. One of the most important equations when doing simulation is the inverse of the cdf
Definition 14.1. The inverse transformed method of simulation is defined as
−1
x = FX
(rand(0, 1))
(Continuous)
F (xj−1 ) ≤ rand(0, 1) < F (xj )
(Discrete)
The above equation is very simple, but very powerful, for example if we are dealing with a uniform distribution
on (0, 1), then the cdf is FX (x) = x and rand(0, 1) = x, so if rand(0, 1) = 0.3 then x = 0.3. The normal
tables can help us find x if X is normal. There are millions of ways of doing problems using just the inverse
transformed method. Generally speaking it is to avoid doing intermediate calculations as they complicate the
FX (x) function and make finding an inverse more difficulty. Below you will find a few techniques the SOA
might use.
14.2
14.2.1
Simulation for some common models
Exponential Distribution
FX (x) = 1 − e−λx
1
x = − ln(1 − rand(0, 1))
λ
14.2.2
(14.1)
Poisson Distribution
Keep in mind that the time between successive Poisson events are distributed exponentially, meaning we can
model the Poisson Distribution is two ways:
e−λ λx
px =
x!
The standard solution is:
½ X
¾
e−λ λi
x = max n :
i = 0n
≤ rand(0, 1)
(14.2)
i
However, we can also use a method that simulates the time between events individually, we could up events
until we “run out of time” or
(
)
n
Y
−λ
N = max n :
randi (0, 1) ≥ e
i=1
14.3
Estimating Mean or a Probability
This is where simulation becomes useful. Once we can generate values of xi from a distribution we are able to
calculate statistics; some of which might not be easily or even possible to solve using the standard mathematical
techniques. It is often easier to simulate the problem by creating a sample that is large enough to produce a
reasonable solution. We can use a equations we used for full credibility to determine the correct number of
simulations:
Pr[|X − µ| ≤ 0.05µ] = 0.9
p
The standard error for the above problem is σ/ (n) and we want the standard error multiplied by the Z score
Z0.9 σ/sqrt(n) to be below 0.05µ. Or Z0.9 σ/sqrt(n) < 0.05µ, so [Zx is the inverse of the standard normal
function at x].
¶2
µ
Z0.9 σ
(14.3)
n≥
0.05µ
49
14.4 Compound Models
14
SIMULATION
If we wan to calculate a probability the same logic would apply (Pn is some of all positive outcomes and
Qn = Pn /n).
Pr[|Qn − p| ≤ 0.05p] = 0.9
¶2
µ
n − Pn
Z0.9
(14.4)
n≥
0.05
Pn
14.4
Compound Models
I have found that the SOA likes to use compound distributions in their questions and as such I am including
a short explanation of these simulated models. For example if we have a claim count distribution and a claim
severity distribution.
Example 14.1. Assume that the claim count follows a Poisson distribution (λ = 1) and the claim severity follows
a normal distribution (µ = 10, 000; σ = 2500). What is the sum of all claims given these random variables:
{0.765, 0.01, 0.256, 0.912, 0.562}.
½ X
¾
e−1
n = 1 = max n :
i = 0n
≤ 0.765
i!
x1 = 4184 = 10, 000 − 2.33 × 5, 000
n
X
4184 = S =
xi
i=1
The extra random variables can be tossed as they are unnecessary.
14.5
Bootstrap Method
In statistics, bootstrapping is a modern, computer-intensive, general purpose approach to statistical
inference, falling within a broader class of re-sampling methods. Bootstrapping is the practice of
estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution
is the empirical distribution of the observed data. In the case where a set of observations can be
assumed to be from an independent and identically distributed population, this can be implemented
by constructing a number of re-samples of the observed dataset (and of equal size to the observed
dataset), each of which is obtained by random sampling with replacement from the original dataset.
It may also be used for constructing hypothesis tests. It is often used as an alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric
inference is impossible or requires very complicated formulas for the calculation of standard errors.
The advantage of bootstrapping over analytical methods is its great simplicity - it is straightforward
to apply the bootstrap to derive estimates of standard errors and confidence intervals for complex
estimators of complex parameters of the distribution, such as percentile points, proportions, odds
ratio, and correlation coefficients. The disadvantage of bootstrapping is that while (under some
conditions) it is asymptotically consistent, it does not provide general finite-sample guarantees, and
has a tendency to be overly optimistic.[citation needed] The apparent simplicity may conceal the
fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples) where these would be more formally stated in other approaches. [Wikipedia,
July 29, 2009]
50
14.5 Bootstrap Method
14
SIMULATION
Definition 14.2. Assume that we have a data set of y = {y1 , . . . , yn } and a statistic, θ, from the empirical
distribution function. The bootstrap estimate of the mean squared error of the estimator θ̂ = g(y) is (there
are m samples - where m could be all possible samples, or a subset of all possible samples).
xij = yrandi (1,n)
i = 1, . . . , m;
j = 1, . . . , n
θ̂i = g(xi )
M SE(θ̂) =
m
´2
1 X³
θ̂i − θ = Var(θ̂) + bias2θ̂
m i=1
(14.5)
biasθ̂ = E[θ̂|θ] − θ
Some common function of g(y):
Pn
1. Mean: g(y) = n1 i=1 yi
Pn
1
2. Variance: g(y) = n−1
i=1 (yi − y)
3. Probability X > c: g(y) = 1−F (c), where F is determined using a MLE procedure or method of moments
etc.
4. If Bias = 0, Then MSE = Var.
Technically g(y) could be any function of all the observations in the sample. Also, if we are finding the MSE
of the mean, we will also know the variance: M SE(µ) = σ 2 /m. If you want an example please refer to SOA
sample #144 (note: g(X) = E(X ∧ d)/E(X)).
51
B CALCULATOR TIPS
A
Key Equations to Memorize
Z
∞
Γ(a + 1)
ca+1
a!
= a+1 ‘a’ is an integer
c
xa e−cx dx =
0
Z
0
∞
−c
Γ(k − 1)
ex
dx =
xk
ck−1
(k − 2)!
=
ck−1
∞
X
xk
k=0
B
(A.1)
k!
= ex
k>1
(A.2)
k≥2
(A.3)
Calculator Tips
I wont mention it else where, but when it doubt to proceed to the next screen or accept a value you must hit
“enter”.
B.1
Table
Get used to this feature. You get the ability to type any function and the table with create a table of values
for the function. Useful for getting Poisson/Binomial values quickly without retyping the equation. In order
to enter in the input or x into the function you need to use the xyzt
abc found near the lower left corner of the
calculator. You’ll also have to choose a initial value and spacing between points. Note: there is also a “Ask”
option will allow you to enter in specific x values of your chosing.
Example B.1. Ok, say we are looking at SOA problem 172, where we want to do a K-S test with the null hypothesis
that
1
F (x) = 1 −
(1 + x)4
and our data values are 0.2, 0.7, 0.9, 1.1, and 1.3. On the Multiview, you can push the Table button, it asks for y
as a function of x, so you type in y = F (x), and then the next thing that comes up is it asks for the increments. Set
that to manual, and then on the table you just type in the 5 x values and it tells you the 5 values of F (x). There
is an old exam problem with 10 data values, so you save even more time there. You can do the same thing with the
inversion method.
David Revelle from TIA
B.2
Data or Mini-Spread Sheet
This is where the calculator gets the most power. The basic features are the same as most standard scientific
caclulators: you enter values and then use the “stat” option to get general statistics about the datasets, but the
Multiview has the added benifit that you can see what you’re doing while you’re doing it. You can also create
functions of the data you entered.
The ability to create functions of your data is really what sets this calculator apart. Try making two columsn:
L1, L2, with any data you like. Now slide the “cursor” over to column L3. Hit the “data” key will get you
to a new screen with options to clear data. If you tab over to the “FORMULA” tab you’ll get access to the
“function” features. Generally speaking you’ll just use the first options: “Add/Edit...”. Now if you select that
you’ll get a line on the bottom of the screen that shows: “L3=”. To get access to the “L1” and “L2” column
data you have to hit the “data” key again (yes I know we’ve hit it 3x now). You now have access to the three
column variables: “L1”, “L2” and “L3”. Enter into the equation “L1+L2” and hit “Enter”. Then the result
52
C CALCULATOR INTEGRATION
will show up in L3. The calculator computes the table quite slowly. Each time you add one value it will update
the columns that depends on that value (this can take a few seconds).
Also, you can use a column as a frequency variable when calculating statistics in the “stat” screen. So if
you get a table like:
x
0
1
2
3
P (X = x)
0.1
0.2
0.3
0.4
Just enter the two column in L1 and L2 and calculate the “1-Var stat” with DATA as “L1” and FRQ as “L2”
the calculator will calculate the mean and variance for you.
Example B.2. Instead of entering a long equation that might include typos, I use the [data] lists to calculate and
sum my Chi-square test statistic.
1. [data] → L1 enter your expected occurrences
2. L2 enter your observed occurrences
3. With your cursor in L3, [data] →FORMULA→Add/Edit Frmla
4. Enter: ([data] → L2 − [data] → L1) ∧ 2/[data] → L1.
When you are done it should look like L3=(L2-L1)∧ 2/L1
5. [2nd][stat] → 1-Var Stats → DATA=L3 FRQ=ONE → CALC
P
6. Find
x and compare it to the Chi-square table
no driver @ Actuarial Outpost
Example B.3. This feature really shines on SOA 144. I can solve this problem very quickly with this calculator.
First, find E[(X − 100)]/E[X] for each of the 10 simulations. You can do this in your head pretty easily. Next, input
these values into column L1. In L2, use the formula ((1 − L1) − 0.125)2 . Then go to stat and find the average of L2
and that is your answer.
uclatommy @ Actuarial Outpost
C
Calculator Integration
One thing I love to do with integrals is to use a approximating technique to solve rather than say integration
by part (which I always make small mistakes on). Table mode allows me to get a lot of function results without
having to edit the values all the time so I can find
Z ∞ 3 −x/100
x e
dx
1003
0
Using the Simpsons Rule: step size = 125 Go from 0 to 1250.
1. Type in the X values you want to use in L1 (find out what function looks like using table)
2. Type in the sequence 1, 4, 2, 4, 2, . . . 2, 4, 1 into L2.
3. Click [data] → goto column L3 [data] → Formula → Add Edit Frmla
4. Type function, when you need a “x” value click [data] → L1
5. When done hit [enter]
6. Go to the Stat screen ”1-Var stats’ and select the function values as DATA: L3 and FRQ as the 1, 4, 2, . . .
values or L2
53
D EXACT CREDIBILITY
Then
R∞
X
3 0
=
x
step size
P
Z ∞
step size × x
=
3
0
Be careful with this technique as you have to make sure you’re spacing so that you’re going through about
10% of the function dominant area each step... If your steps are too large you’ll get the wrong answer... [For
example a step size of 156.25 instead would get an answer of 608, step size=208.3 would give 626]
Of course the solution to the above integral should be memorized...
3!1004
= 600
1003
D
Exact Credibility
The term exact crediblity is used to describe the situation when the crediblity premium equals the Bayesian
premium.
Buhlmann credibility is an approximation of the Bayesian. Bayesian quantities are calculated with integrals
and sums which are often times difficult or impossible to do without a computer. When the model is a member
of the linear exponential and a conjugate prior is used as the prior, then Bühlmann=Bayesian. (If that sounds
like something I memorized out of a study manual, it probably is.)
Since the majority of the conjugate priors commonly used by the SOA are also actually examples of situations
where Bayesian = Bühlmann. This can simplify Bayesian problems or Bühlmann problems by using either
method.
The point here is to show that the Bühlmann crediblity calculations produce the same results as the Bayesian
(in specific cases). They provide good examples to practice with. But also, if you forget the conjugate priors
for any of the combinations below you can generate them resonably easily using the Bühlmann technique (these
are easier if you’re given the variable in advance by the way).
Poisson-Gamma
E(X|λ) = λ
Var(X|λ) = λ
µ = E [E(X|λ)] = E[λ] = αθ
v = E [Var(X|λ)] = E[λ] = αθ = µ
a = Var [E(X|λ)] = Var(λ) = αθ2
1
k=
θ
n
Z=
n + 1/θ
n
1/θ
E(xn+1 ) = x̄
+ αθ
= θ 0 α0
n + 1/θ
n + 1/θ
x̄n + αθ/θ
=
n + 1/θ
θ[x̄n + α]
=
θn + 1
= θ 0 α0
θ
New Gamma Dist = θ0 =
α0 = nx̄ + α
θn + 1
this looks quite time consuming, but can probably be done in under 30 seconds....
54
D EXACT CREDIBILITY
Exponential(λ)-Inverse.Gamma(α, θ)
E(X|λ) = λ
Var(X|λ) = λ2
µ = E [E(X|λ)] = E[λ] =
θ
α−1
θ2
(α − 1)(α − 2)
θ2
a = Var [E(X|λ)] = Var(λ) =
(α − 1)2 (α − 2)
k =α−1
n
Z=
n+α−1
n
θ
θ0
α−1
E(xn+1 ) = x̄
+
= 0
n+α−1 α−1 n+α−1
α −1
x̄n + θ
=
n+α−1
New Inverse.Gamma Dist = θ0 = θ + x̄n
α0 = α + n
v = E [Var(X|λ)] = E[λ2 ] =
Normal(λ, a2 ) - λ=Normal(µ, σ 2 )
E(X|λ) = λ
Var(X|λ) = a2
µ = E [E(X|λ)] = E[λ] = µ
v = E [Var(X|λ)] = E[a2 ] = a2 E(1) = a2
a = Var [E(X|λ)] = Var(λ) = σ 2
k=
Z=
a2
σ2
n
n + a2 /σ 2
µ0 = E(xn+1 ) = x̄
n
a2 /σ 2
+µ
2
2
n + a /σ
n + a2 /σ 2
x̄n + µa2 /σ 2
n + a2 /σ 2
x̄n
1
2 + σ2
= an
1
a2 + σ 2
=
Note: σ 02 cannot be solved using this method as E(X) does not depend on σ.
Comment about Inv.Exponential-Gamma - You cannot use Buhlmann to solve these problem as E(X 1 |λ)
does not exist. However, the two functions in question are just inverse transformed functions. So by transforming
θ0 = 1/θ and x0i = 1/xi ’s you will actually get Exponential-Inv.Gamma, which is what the solution is (and
transform back when you’re done).
55
D EXACT CREDIBILITY
Binomial(m, q)-q=Beta(α, β)
E(X|q) = mq
Var(X|q) = mq(1 − q)
α
α+β
v = E [Var(X|q)] = E[mq(1 − q)] = mE(q) − mE(q 2 )
µ = E [E(X|q)] = E[mq] = m
=m
α
α(α + 1)
−m
α+β
(α + β)(α + β + 1)
.. ..
.=.
=m
αβ
(α + β)(α + β + 1)
a = Var [E(X|q)] = Var(mq) = m2 E(q 2 ) − m2 E(q)2
= m2
α(α + 1)
(α)2
− m2
(α + β)(α + β + 1)
(α + β)2
.. ..
.=.
αβ
(α + β)2 (α + β + 1)
α+β
k=
m
n
Z=
n + (α + β)/m
n
(α + β)/m
mq = E(xn+1 ) = x̄
+µ
n + (α + β)/m
n + (α + β)/m
= m2
=
α
x̄n + m α+β
α+β
m
n + (α + β)/m
x̄n + α
=
n + (α + β)/m
mx̄n + mα
=
mn + (α + β)
α0
x̄n + α
m 0
=m
α + β0
mn + α + β
It should be clear at least that: α0 = x̄n + α, which results in a denominator that looks like:
x̄n + α + β 0 = mn + α + β
β 0 = mn + α + β − (x̄n + α)
= mn + β − x̄n
.
the .. means I’m too lazy to post the steps (actually I cheated).
56
E BETA DISTRIBUTION
E
Beta Distribution
Γ(a + b) a
1
u (1 − u)b−1 ,
0 < x < θ,
u = x/θ
Γ(a)Γ(b)
x
1
(a + b − 1)! a
=
u (1 − u)b−1 ,
0 < x < θ,
u = x/θ
(a − 1)!(b − 1)!
x
(a + b − 1)! a−1
=
x
(1 − x)b−1 ,
0<x<1
(a − 1)!(b − 1)!
f (x) =
In general moments of the beta distribution are a nuisance to calculate because of the number of factorials or
gamma functions. Here’s special case for a and b as integers.
E(X) = θ
a
a+b
a(a + 1)
(a + b)(a + b + 1)
a(a + 1)
a2
Var(X) = θ2
− θ2
(a + b)(a + b + 1)
(a + b)2
ab
= θ2
(a + b)2 (a + b + 1)
E(X 2 ) = θ2
The beta distribution is extremely flexible. I would suggest playing around in some graphing program so you
can see what sorts of function you can create. If you ever get a Bayesian problem you’ll probably have a beta
prior. However some simple beta functions aren’t recognized that quickly:
E.1
beta (b = 1).
axa−1 ,
0<x<1
SOA likes this function.
Example E.1. You are given:
1. The annual number of claims for a policyholder has a binomial distribution with probability function:
¶
µ
2
f (x|q) =
q x (1 − q)( 2 − x)
(E.1)
x
The prior distribution is:
π(q) = 4q 3 ,
0<q<1
2. This policyholder had one claim in each of Years 1 and 2.
Determine the Bayesian estimate of the number of claims
P in Year 3.
Solution... Using the conjugage prior: m = 2, n = 2,
xi = 2, a = 4, b = 1. (only to be used when θ = 1).
X
X
f (q|x) = beta(a0 = a +
xi , b0 = b + nm −
xi )
= beta(a0 = 4 + 2, b0 = 1 + 4 − 2) = beta(a0 = 6, b0 = 3)
The mean of q:
E[q] =
(a + b − 1)!(a)!
a
=
(a − 1)!(a + b)!
a+b
= 6/9 = 2/3
The actual average is mE(q) or 4/3 = 1.33̄.
57
E.2 beta (b = 1, a = 1).
E.2
E BETA DISTRIBUTION
beta (b = 1, a = 1).
A rather special case of the beta distribution that many might not realize is beta. The uniform distribution is
simplified beta distribution with a = 1 and b = 1.
Γ(a + b) ³ x ´a ³
x ´b−1 1
1−
Γ(a)Γ(b) θ
θ
x
1
= ,
0<x<θ
θ
f (x) =
With moments:
1
a
=θ
a+b
2
a(a
+
1)
a2
Var(X) = θ2
− θ2
(a + b)(a + b + 1)
(a + b)2
ab
θ2
2 1
= θ2
=
θ
=
(a + b)2 (a + b + 1)
22 3
12
E(X) = θ
Example E.2. In a portfolio of insured, each insured will have either 0 or 1 claim in a year, with independence
from one year to another. The probability that an individual will have a claim in a given year is x. The portfolio of
insured is such that for a randomly chosen individual from the portfolio, the probability x is uniformly distributed
on (0,1). A randomly chosen individual is found to have no claims in n consecutive years, where n ≥ 1. Determine
the expected number of claims that the individual will have in the n + 1-st year.
1
1
1
1
A) n−2
; B) n−1
; C) n1 ; D) n+1
; E) n+2
Solution...
X
X
f (q|x) = beta(a0 = a +
xi , b0 = b + nm −
xi )
= beta(a0 = 1 + 0, b0 = 1 + n1 − 0) = beta(a0 = 1, b0 = 1 + n)
The mean of q
a
a+b
1
=
n+2
E[q] =
58
Index
(a,b,0) class of distributions, 15
(a,b,1) class of distributions, 15
aggregate loss random variable, 18
asymptotically unbiased, 23
data summary, 25
data-dependent distribution, 8, 25
death time, 25
Delta Method, 33
density function, 4
distribution function, 4
double-expectation formulas, 42
Bühlmann credibility factor, 45
Bühlmann model, 45
Bühlmann’s k, 45
Bühlmann-Straub model, 45
beta distribution, 57
bias, 23
binomial distribution, 15
bootstrap estimate, 51
empirical distribution, 25
empirical distribution function, 25
empirical model, 5
empirical survival function, 25
entry time, 25
equations to memorize, 52
equilibrium distribution, 9
excess loss variable, 6
expected value of the hypothetical means, 45
expected value of the process variance, 45
calculator tips, 52
canonical parameter, 13
cdf, 4
censored
censored observation, 5
left censored and shifted, 6
right censored, 5
time, 25
truncated observation, 5
central moment, 5
claim count distribution, 18
claim count random variable, 18
claims ratio, 44
coefficient of variation, 5
coherent risk measure, 9
collective premium, 45
collective risk model, 18
complete data, 32
complete expectation of life, 6
compound distribution, 18
confidence interval, 23
confidence region, 34
conjugage prior
Binomial-Beta, 56
Exponential-Inverse.Gamma, 55
Inv.Exponential-Gamma, 55
Normal-Normal, 55
Poisson-Gamma, 54
conjugate prior distribution, 43
consistent, 23
credibility coefficient, 45
credibility interval, 42
credibility-weighted average, 46
critical values, 23
cumulative distribution function, 4
cumulative hazard rate function, 26
failure rate, 4
force of mortality, 4
frailty model, 11
frailty random variable, 11
franchise deductible, 16
frequency distribution, 18
Full Credibility, 37
gamma function, 10
gamma kernel, 30
geometric distribution, 14
Greenwood Approximation, 26
hazard rate, 4
histogram, 27
HPD credibility set, 42
hypothesis test, 23
hypothetical mean, 45
incomplete gamma function, 10
individual risk model, 18
individual-loss random variables, 18
inverse, 10
inverse transformed, 10
inverse transformed method, 49
is not rejected, 23
is rejected, 23
joint distribution, 41
k-component spliced distribution, 12
k-point mixture, 8
Kaplan-Meier product-limit Estimator, 26
kernel density estimator, 30
kernel smoothed distribution, 25
D(x) plot, 35
data set, 25
59
INDEX
INDEX
kurtosis, 5
probability mass function, 4
process variance, 45
likelihood function, 31
likelihood ratio test, 36
limited expected value, 5
limited loss variable, 5
linear exponential family, 13
log-transformed confidence interval, 24
loglikelihood function, 31
loss elimination ratio, 17
quantity at risk, 25
quantity of deaths, 25
raw moment, 5
rejection region, 23
scale distribution, 8
scale parameter, 8
Schwarz Bayesian Criterion, 36
severity distribution, 18
shifted, 6
significance level, 24
single-loss random variables, 18
skewness, 5
smoothed empirical estimate, 31
standard deviation, 5
stop-loss insurance, 19
survival function, 4
marginal distribution, 41
marginal probabilities, 40
maximum likelihood estimate, 31
censored data, 32
deductible data, 32
Information (Variance), 32
mean, 5
mean excess loss function, 6
mean residual life function, 6
mean-squared error (MSE), 23
median, 6
method that preserves total losses, 46
method-of-moments estimate, 31
mixture distribution, 11
model distribution, 41
model events, 40
moment generating function, 6
Tail-Value-at-Risk, 10
test statistic, 23
total loss random variable, 18
transformed, 10
triangular kernel, 30
truncated
left truncated, 6
truncated with zeros, 15
zero-truncated, 15
n-fold convolution, 20
negative binomial distribution, 14
Nelson-Åalen estimate, 26
net stop-loss premium, 19
normalizing constant, 13
number of claims, 18
unbiased, 23
uniform kernel, 30
uniformly minimum variance unbiased estimator, 23
uniformly most powerful, 24
ogive, 27
ordinary deductible, 16
Value-at-Risk, 9
variable-component mixture distribution, 8
variance, 5
variance of the hypothetical means, 45
p-p plot, 35
p-value, 24
parameters, 8, 25
parametric distribution, 8, 25
parametric distribution family, 8
Partial Credibility, 39
pdf, 4
per-loss, 16
per-payment, 16
percentile, 6
percentile matching estimate, 31
posterior distribution, 40, 41
predictive distribution, 41
prior distribution, 40
probability density function, 4
probability function, 4
probability generating function, 6
weekly consistent, 23
zero-modified, 15
60