Download STATISTICS 450/850 Estimation and Hypothesis Testing

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
STATISTICS 450/850
Estimation and Hypothesis
Testing
Supplementary Lecture Notes
Don L. McLeish and Cyntha A. Struthers
Dept. of Statistics and Actuarial Science
University of Waterloo
Waterloo, Ontario, Canada
Winter 2013
Contents
1 Properties of Estimators
1.1 Prerequisite Material . . . . . . . . . . .
1.2 Introduction . . . . . . . . . . . . . . . .
1.3 Unbiasedness and Mean Square Error
1.4 Sufficiency . . . . . . . . . . . . . . . . . .
1.5 Minimal Sufficiency . . . . . . . . . . . .
1.6 Completeness . . . . . . . . . . . . . . . .
1.7 The Exponential Family . . . . . . . . .
1.8 Ancillarity . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Maximum Likelihood Estimation
2.1 Maximum Likelihood Method
- One Parameter . . . . . . . . . . . . . . . .
2.2 Principles of Inference . . . . . . . . . . . . .
2.3 Properties of the Score and Information
- Regular Model . . . . . . . . . . . . . . . . .
2.4 Maximum Likelihood Method
- Multiparameter . . . . . . . . . . . . . . . .
2.5 Incomplete Data and The E.M. Algorithm
2.6 The Information Inequality . . . . . . . . . .
2.7 Asymptotic Properties of M.L.
Estimators - One Parameter . . . . . . . . .
2.8 Interval Estimators . . . . . . . . . . . . . . .
2.9 Asymptotic Properties of M.L.
Estimators - Multiparameter . . . . . . . .
2.10 Nuisance Parameters and
M.L. Estimation . . . . . . . . . . . . . . . . .
2.11 Problems with M.L. Estimators . . . . . . .
2.12 Historical Notes . . . . . . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
4
10
16
18
23
33
39
. . . . . .
. . . . . .
39
50
. . . . . .
52
. . . . . .
. . . . . .
. . . . . .
54
67
75
. . . . . .
. . . . . .
79
82
. . . . . .
92
. . . . . . 108
. . . . . . 109
. . . . . . 111
0
CONTENTS
3 Other Methods of Estimation
3.1 Best Linear Unbiased Estimators
3.2 Equivariant Estimators . . . . . . .
3.3 Estimating Equations . . . . . . . .
3.4 Bayes Estimation . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
113
113
115
119
124
4 Hypothesis Tests
4.1 Introduction . . . . . . . . . . . . . . . .
4.2 Uniformly Most Powerful Tests . . . .
4.3 Locally Most Powerful Tests . . . . . .
4.4 Likelihood Ratio Tests . . . . . . . . . .
4.5 Score and Maximum Likelihood Tests
4.6 Bayesian Hypothesis Tests . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
135
135
138
144
146
153
155
5 Appendix
5.1 Inequalities and Useful
5.2 Distributional Results
5.3 Limiting Distributions
5.4 Proofs . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
157
157
159
168
171
Results
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Properties of Estimators
1.1
Prerequisite Material
The following topics should be reviewed:
1. Tables of special discrete and continuous distributions including the
multivariate normal distribution. Location and scale parameters.
2. Distribution of a transformation of one or more random variables
including change of variable(s).
3. Moment generating function of one or more random variables.
4. Multiple linear regression.
5. Limiting distributions: convergence in probability and convergence in
distribution.
1.2
Introduction
Before beginning a discussion of estimation procedures, we assume that
we have designed and conducted a suitable experiment and collected data
X1 , . . . , Xn , where n, the sample size, is fixed and known. These data
are expected to be relevant to estimating a quantity of interest θ which
we assume is a statistical parameter, for example, the mean of a normal
distribution. We assume we have adopted a model which specifies the link
between the parameter θ and the data we obtained. The model is the
framework within which we discuss the properties of our estimators. Our
model might specify that the observations X1 , . . . , Xn are independent with
1
2
CHAPTER 1. PROPERTIES OF ESTIMATORS
a normal distribution, mean θ and known variance σ2 = 1. Usually, as here,
the only unknown is the parameter θ. We have specified completely the joint
distribution of the observations up to this unknown parameter.
1.2.1
Note:
We will sometimes denote our data more compactly by the random vector
X = (X1 , . . . , Xn ).
The model, therefore, can be written in the form {f (x; θ) ; θ ∈ Ω} where
Ω is the parameter space or set of permissible values of the parameter and
f (x; θ) is the probability (density) function.
1.2.2
Definition
A statistic, T (X), is a function of the data X which does not depend on
the unknown parameter θ.
Note that although a statistic, T (X), is not a function of θ, its distribution can depend on θ.
An estimator is a statistic considered for the purpose of estimating a
given parameter. It is our aim to find a “good” estimator of the parameter
θ.
In the search for good estimators of θ it is often useful to know if θ is a
location or scale parameter.
1.2.3
Location and Scale Parameters
Suppose X is a continuous random variable with p.d.f. f (x; θ).
Let F0 (x) = F (x; θ = 0) and f0 (x) = f (x; θ = 0). The parameter θ is
called a location parameter of the distribution if
F (x; θ) = F0 (x − θ) ,
θ∈<
or equivalently
f (x; θ) = f0 (x − θ),
θ ∈ <.
Let F1 (x) = F (x; θ = 1) and f1 (x) = f (x; θ = 1). The parameter θ is
called a scale parameter of the distribution if
³x´
F (x; θ) = F1
, θ>0
θ
1.2. INTRODUCTION
3
or equivalently
f (x; θ) =
1.2.4
1
x
f1 ( ),
θ
θ
θ > 0.
Problem
(1) If X ∼ EXP(1, θ) then show that θ is a location parameter of the
distribution. See Figure 1.1
(2) If X ∼ EXP(θ) then show that θ is a scale parameter of the distribution.
See Figure 1.2
1
0.9
0.8
0.7
θ=1
θ=-1
θ=0
0.6
f(x)
0.5
0.4
0.3
0.2
0.1
0
-1
0
1
2
x
3
4
5
Figure 1.1: EXP(1, θ) p.d.f.’s
1.2.5
Problem
(1) If X ∼ CAU(1, θ) then show that θ is a location parameter of the distribution.
(2) If X ∼ CAU(θ, 0) then show that θ is a scale parameter of the distribution.
4
CHAPTER 1. PROPERTIES OF ESTIMATORS
2
1.8
θ=0.5
1.6
1.4
1.2
f(x)
1
θ=1
0.8
0.6
θ=2
0.4
0.2
0
0
0.5
1
1.5
x
2
2.5
3
Figure 1.2: EXP(θ) p.d.f.’s
1.3
Unbiasedness and Mean Square Error
How do we ensure that a statistic T (X) is estimating the correct parameter?
How do we ensure that it is not consistently too large or too small, and
that as much variability as possible has been removed? We consider the
problem of estimating the correct parameter first.
We begin with a review of the definition of the expectation of a random
variable.
1.3.1
Definition
If X is a discrete random variable with p.f. f (x; θ) and support set A then
P
E [h (X) ; θ] =
h (x) f (x; θ)
x∈A
provided the sum converges absolutely, that is, provided
P
E [|h (X)| ; θ] =
|h (x)| f (x; θ)dx < ∞.
x∈A
1.3. UNBIASEDNESS AND MEAN SQUARE ERROR
5
If X is a continuous random variable with p.d.f. f (x; θ) then
E[h(X); θ] =
Z∞
h(x)f (x; θ) dx,
−∞
provided the integral converges absolutely, that is, provided
E[|h(X)| ; θ] =
Z∞
−∞
|h(x)| f (x; θ) dx < ∞.
If E [|h (X)| ; θ] = ∞ then we say that E [h (X) ; θ] does not exist.
1.3.2
Problem
Suppose that X has a CAU(1, θ) distribution. Show that E(X; θ) does not
exist and that this implies E(X k ; θ) does not exist for k = 2, 3, . . ..
1.3.3
Problem
Suppose that X is a random variable with probability density function
f (x; θ) =
θ
,
xθ+1
x ≥ 1.
For what values of θ do E(X; θ) and V ar(X; θ) exist?
1.3.4
Problem
If X ∼ GAM(α, β) show that
E(X p ; α, β) = β p
Γ(α + p)
.
Γ(α)
For what values of p does this expectation exist?
1.3.5
Problem
Suppose X is a non-negative continuous random variable with moment
generating function M (t) = E(etX ) which exists for t ∈ <. The function
M (−t) is often called the Laplace Transform of the probability density
function of X. Show that
Z∞
¡ −p ¢
1
=
M (−t)tp−1 dt, p > 0.
E X
Γ(p)
0
6
1.3.6
CHAPTER 1. PROPERTIES OF ESTIMATORS
Definition
A statistic T (X) is an unbiased estimator of θ if E[T (X); θ] = θ for all
θ ∈ Ω.
1.3.7
Example
Suppose Xi ∼ POI(iθ) i = 1, ..., n independently. Determine whether the
following estimators are unbiased estimators of θ:
T1 =
n X
1 P
i
,
n i=1 i
T2 =
µ
2
n+1
¶
X̄ =
n
P
2
Xi .
n(n + 1) i=1
Is unbiased estimation preserved under transformations? For example,
if T is an unbiased estimator of θ, is T 2 an unbiased estimator of θ2 ?
1.3.8
Example
Suppose X1 , . . . , Xn are uncorrelated random variables with E(Xi ) = μ
and V ar(Xi ) = σ 2 , i = 1, 2, . . . , n. Show that
T =
n
P
ai Xi
i=1
is an unbiased estimator of μ if
n
P
ai = 1. Find an unbiased estimator of
i=1
σ 2 assuming (i) μ is known (ii) μ is unknown.
If (X1 , . . . , Xn ) is a random sample from the N(μ, σ 2 ) distribution then
show that S is not an unbiased estimator of σ where
¸
∙n
n
P 2
1 P
1
(Xi − X̄)2 =
Xi − nX̄ 2
S2 =
n − 1 i=1
n − 1 i=1
is the sample variance. What happens to E (S) as n → ∞?
1.3.9
Example
Suppose X ∼ BIN(n, θ). Find an unbiased estimator, T (X), of θ. Is
[T (X)]−1 an unbiased estimator of θ−1 ? Does there exist an unbiased
estimator of θ−1 ?
1.3. UNBIASEDNESS AND MEAN SQUARE ERROR
1.3.10
7
Problem
Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Find
´
³
E X (k) ; θ = E[X(X − 1) · · · (X − k + 1); θ],
the kth factorial moment of X, and thus find an unbiased estimator of θk ,
k = 1, 2, . . . ,.
We now consider the properties of an estimator from the point of view
of Decision Theory. In order to determine whether a given estimator or
statistic T = T (X) does well for estimating θ we consider a loss function
or distance function between the estimator and the true value which we
denote L(θ, T ). This loss function is averaged over all possible values of the
data to obtain the risk:
Risk = E[L(θ, T ); θ].
A good estimator is one with little risk, a bad estimator is one whose risk
2
is high. One particular loss function is L(θ, T ) = (T − θ) which is called
the squared error loss function. Its corresponding risk, called mean squared
error (M.S.E.), is given by
h
i
2
M SE(T ; θ) = E (T − θ) ; θ .
Another loss function is L(θ, T ) = |T − θ| which is called the absolute error
loss function. Its corresponding risk, called the mean absolute error, is
given by
Risk = E (|T − θ|; θ) .
1.3.11
Problem
Show
M SE(T ; θ) = V ar(T ; θ) + [Bias(T ; θ)]2
where
Bias(T ; θ) = E (T ; θ) − θ.
1.3.12
Example
Let X1 , . . . , Xn be a random sample from a UNIF(0, θ) distribution. Compare the M.S.E.’s of the following three estimators of θ:
T1 = 2X,
T2 = X(n) ,
T3 = (n + 1)X(1)
8
CHAPTER 1. PROPERTIES OF ESTIMATORS
where
X(n) = max(X1 , . . . , Xn ) and X(1) = min(X1 , . . . , Xn ).
1.3.13
Problem
Let X1 , . . . , Xn be a random sample from a UNIF(θ, 2θ) distribution with
θ > 0. Consider the following estimators of θ:
T1 =
1
1
1
5
2
X(n) , T2 = X(1) , T3 = X(n) + X(1) , T4 =
X(n) + X(1) .
2
3
3
14
7
(a) Show that all four estimators can be written in the form
Za = aX(1) +
1
(1 − a) X(n)
2
(1.1)
for suitable choice of a.
(b) Find E(Za ; θ) and thus show that T3 is the only unbiased estimator of
θ of the form (1.1).
(c) Compare the M.S.E.’s of these estimators and show that T4 has the
smallest M.S.E. of all estimators of the form (1.1).
Hint: Find V ar(Za ; θ), show
Cov(X(1) , X(n) ; θ) =
θ2
(n + 1)2 (n + 2)
,
and thus find an expression for M SE(Za ; θ).
1.3.14
Problem
Let X1 , . . . , Xn be a random sample from the N(μ, σ2 ) distribution. Consider the following estimators of σ 2 :
S2,
T1 =
n−1 2
S ,
n
T2 =
n−1 2
S .
n+1
Compare the M.S.E.’s of these estimators by graphing them as a function
of σ 2 for n = 5.
1.3.15
Example
Let X ∼N(θ, 1). Consider the following three estimators of θ:
T1 = X,
T2 =
X
,
2
T3 = 0.
1.3. UNBIASEDNESS AND MEAN SQUARE ERROR
9
Which estimator is better in terms of M.S.E.?
Now
h
i
M SE (T1 ; θ) = E (X − θ)2 ; θ = V ar (X; θ) = 1
"µ
µ
¶2 #
¶ ∙ µ
¶
¸2
X
X
X
M SE (T2 ; θ) = E
− θ ; θ = V ar
;θ + E
;θ − θ
2
2
2
µ
¶2
¢
1
1¡ 2
θ
=
+
−θ =
θ +1
4
2
4
h
i
2
M SE (T3 ; θ) = E (0 − θ) ; θ = θ2
The M.S.E.’s can be compared by graphing them as functions of θ. See
Figure 1.3.
4
3.5
3
2.5
MSE(T )
3
MSE
2
1.5
1
MSE(T 1)
MSE(T )
2
0.5
0
-2
-1.5
-1
-0.5
0
θ
0.5
1
1.5
2
Figure 1.3: Comparison of M.S.E.’s for Example 1.3.15
One of the conclusions of the above example is that there is no estimator,
even the natural one, T1 = X which outperforms all other estimators.
One is better for some values of the parameter in terms of smaller risk,
while another, even the trivial estimator T3 , is better for other values of
the parameter. In order to achieve a best estimator, it is unfortunately
10
CHAPTER 1. PROPERTIES OF ESTIMATORS
necessary to restrict ourselves to a specific class of estimators and select
the best within the class. Of course, the best within this class will only be
as good as the class itself, and therefore we must ensure that restricting
ourselves to this class is sensible and not unduly restrictive. The class of
all estimators is usually too large to obtain a meaningful solution. One
possible restriction is to the class of all unbiased estimators.
1.3.16
Definition
An estimator T = T (X) is said to be a uniformly minimum variance unbiased estimator (U.M.V.U.E.) of the parameter θ if (i) it is an unbiased
estimator of θ and (ii) among all unbiased estimators of θ it has the smallest
M.S.E. and therefore the smallest variance.
1.3.17
Problem
Suppose X has a GAM(2, θ) distribution and consider the class of estimators {aX; a ∈ <+ }. Find the estimator in this class which minimizes the
mean absolute error for estimating the scale parameter θ. Hint: Show
E (|aX − θ|; θ) = θE (|aX − 1|; θ = 1) .
Is this estimator unbiased? Is it the best estimator in the class of all
functions of X?
1.4
Sufficiency
A sufficient statistic is one that, from a certain perspective, contains all the
necessary information for making inferences about the unknown parameters in a given model. By making inferences we mean the usual conclusions
about parameters such as estimators, significance tests and confidence intervals.
Suppose the data are X and T = T (X) is a sufficient statistic. The
intuitive basis for sufficiency is that if X has a conditional distribution given
T (X) that does not depend on θ, then X is of no value in addition to T in
estimating θ. The assumption is that random variables carry information on
a statistical parameter θ only insofar as their distributions (or conditional
distributions) change with the value of the parameter. All of this, of course,
assumes that the model is correct and θ is the only unknown. It should
be remembered that the distribution of X given a sufficient statistic T
may have a great deal of value for some other purpose, such as testing the
validity of the model itself.
1.4. SUFFICIENCY
1.4.1
11
Definition
A statistic T (X) is sufficient for a statistical model {f (x; θ) ; θ ∈ Ω} if the
distribution of the data X1 , . . . , Xn given T = t does not depend on the
unknown parameter θ.
To understand this definition suppose that X is a discrete random variable and T = T (X) is a sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}.
Suppose we observe data x with corresponding value of the sufficient statistic T (x) = t. To Experimenter A we give the observed data x while
to Experimenter B we give only the value of T = t. Experimenter A
can obviously calculate T (x) = t as well. Is Experimenter A “better off”
than Experimenter B in terms of making inferences about θ? The answer
is no since Experimenter B can generate data which is “as good as” the
data which Experimenter A has in the following manner. Since T (X) is
a sufficient statistic, the conditional distribution of X given T = t does
not depend on the unknown parameter θ. Therefore Experimenter B can
use this distribution and a randomization device such as a random number
generator to generate an observation y from the random variable Y such
that
P (Y = y|T = t) = P (X = y|T = t)
(1.2)
and such that X and Y have the same unconditional distribution. So Experimenter A who knows x and experimenter B who knows y have equivalent
information about θ. Obviously Experimenter B did not gain any new information about θ by generating the observation y. All of her information
for making inferences about θ is contained in the knowledge that T = t.
Experimenter B has just as much information as Experimenter A, who
knows the entire sample x.
Now X and Y have the same unconditional distribution because
=
=
=
=
=
=
=
P (X = x; θ)
P [X = x, T (X) = T (x) ; θ]
since the event {X = x} is a subset of the event {T (X) = T (x)}
P [X = x|T (X) = T (x)] P [T (X) = T (x) ; θ]
P (X = x|T = t) P (T = t; θ) where t = T (x)
P (Y = x|T = t) P (T = t; θ) using (1.2)
P [Y = x|T (X) = T (x)] P [T (X) = T (x) ; θ]
P [Y = x, T (X) = T (x) ; θ]
P (Y = x; θ)
since the event {Y = x} is a subset of the event {T (X) = T (x)} .
12
CHAPTER 1. PROPERTIES OF ESTIMATORS
The use of a sufficient statistic is formalized in the following principle:
1.4.2
The Sufficiency Principle
Suppose T (X) is a sufficient statistic for a model {f (x; θ) ; θ ∈ Ω}. Suppose
x1 , x2 are two different possible observations that have identical values of
the sufficient statistic:
T (x1 ) = T (x2 ).
Then whatever inference we would draw from observing x1 we should draw
exactly the same inference from x2 .
If we adopt the sufficiency principle then we partition the sample space
(the set of all possible outcomes) into mutually exclusive sets of outcomes
in which all outcomes in a given set lead to the same inference about θ.
This is referred to as data reduction.
1.4.3
Example
Let (X1 , . . . , Xn ) be a random sample from the POI(θ) distribution. Show
n
P
that T =
Xi is a sufficient statistic for this model.
i=1
1.4.4
Problem
Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) distribution and
n
P
let T =
Xi .
i=1
(a) Find the conditional distribution of (X1 , . . . , Xn ) given T = t and thus
show T is a sufficient statistic for this model.
(b) Explain how you would generate data with the same distribution as the
original data using the value of the sufficient statistic and a randomization
device.
(c) Let U = U (X1 ) = 1 if X1 = 1 and 0 otherwise. Find E(U ) and
E(U |T = t).
1.4.5
Problem
Let X1 , . . . , Xn be a random sample from the GEO(θ) distribution and let
n
P
T =
Xi .
i=1
1.4. SUFFICIENCY
13
(a) Find the conditional distribution of (X1 , . . . , Xn ) given T = t and thus
show T is a sufficient statistic for this model.
(b) Explain how you would generate data with the same distribution as the
original data using the value of the sufficient statistic and a randomization
device.
(d) Find E(X1 |T = t).
1.4.6
Problem
Let X1 , . . . , Xn be a random sample from the EXP(1, θ) distribution and
let T = X(1) .
(a) Find the conditional distribution of (X1 , . . . , Xn ) given T = t and thus
show T is a sufficient statistic for this model.
(b) Explain how you would generate data with the same distribution as the
original data using the value of the sufficient statistic and a randomization
device.
(c) Find E [(X1 − 1) ; θ] and E [(X1 − 1) |T = t].
1.4.7
Problem
Let X1 , . . . , Xn be a random sample from the distribution with probability
density function f (x; θ) . Show that the order statistic
T (X) = (X(1) , . . . , X(n) ) is sufficient for the model {f (x; θ) ; θ ∈ Ω}.
The following theorem gives a straightforward method for identifying
sufficient statistics.
1.4.8
Factorization Criterion for Sufficiency
Suppose X has probability (density) function {f (x; θ) ; θ ∈ Ω} and T (X)
is a statistic. Then T (X) is a sufficient statistic for {f (x; θ) ; θ ∈ Ω} if and
only if there exist two non—negative functions g(.) and h(.) such that
f (x; θ) = g(T (x); θ)h(x),
for all x, θ ∈ Ω.
Note that this factorization need only hold on a set A of possible values of
X which carries the full probability, that is,
f (x; θ) = g(T (x); θ)h(x),
where P (X ∈ A; θ) = 1, for all θ ∈ Ω.
for all x ∈ A, θ ∈ Ω
14
CHAPTER 1. PROPERTIES OF ESTIMATORS
Note that the function g(T (x); θ) depends on both the parameter θ and
the sufficient statistic T (X) while the function h(x) does not depend on
the parameter θ.
1.4.9
Example
Let X1 , . . . , Xn be a random sample from the N(μ, σ2 ) distribtion. Show
n
n
P
P
that ( Xi ,
Xi2 ) is a sufficient statistic for this model. Show that
i=1
i=1
(X, S 2 ) is also a sufficient statistic for this model.
1.4.10
Example
Let X1 , . . . , Xn be a random sample from the WEI(1, θ) distribution. Find
a sufficient statistic for this model.
1.4.11
Example
Let X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Show
that T = X(n) is a sufficient statistic for this model. Find the conditional
probability density function of (X1 , . . . , Xn ) given T = t.
1.4.12
Problem
Let X1 , . . . , Xn be a random sample from the EXP(1, θ) distribution. Show
that X(1) is a sufficient statistic for this model and find the conditional
probability density function of (X1 , . . . , Xn ) given X(1) = t.
1.4.13
Problem
Use the Factorization Criterion for Sufficiency to show that if T (X) is
a sufficient statistic for the model {f (x; θ) ; θ ∈ Ω} then any one-to-one
function of T is also a sufficient statistic.
We have seen above that sufficient statistics are not unique. One-toone functions of a statistic contain the same information as the original
statistic. Fortunately, we can characterise all one-to-one functions of a
statistic in terms of the way in which they partition the sample space. Note
that the partition induced by the sufficient statistic provides a partition of
the sample space into sets of observations which lead to the same inference
about θ. See Figure 1.4.
1.4. SUFFICIENCY
1.4.14
15
Definition
The partition of the sample space induced by a given statistic T (X) is the
partition or class of sets of the form {x; T (x) = t} as t ranges over its
possible values.
SAMPLE SPACE
{x;T(x)=1}
{x;T(x)=2}
{x;T(x)=3}
{x;T(x)=4}
{x;T(x)=5}
Figure 1.4: Partition of the sample space induced by T
From the point of view of statistical information on a parameter, a statistic is sufficient if it contains all of the information available in a data set
about a parameter. There is no guarantee that the statistic does not contain
more information than is necessary. For example, the data (X1 , . . . , Xn ) is
always a sufficient statistic (why?), but in many cases, there is a further
data reduction possible. For example, for independent observations from
a N(θ, 1) distribution, the sample mean X is also a sufficient statistic but
it is reduced as much as possible. Of course, T = (X̄)3 is a sufficient statistic since T and X̄ are one-to-one functions of each other. From X̄ we
can obtain T and from T we can obtain X̄ so both of these statistics are
equivalent in terms of the amount of information they contain about θ.
Now suppose the function g is a many-to-one function, which is not invertible. Suppose further that g (X1 , . . . , Xn ) is a sufficient statistic. Then
the reduction from (X1 , . . . , Xn ) to g (X1 , . . . , Xn ) is a non-trivial reduction of the data. Sufficient statistics that have experienced as much data
reduction as is possible without losing the sufficiency property are called
minimal sufficient statistics.
16
1.5
CHAPTER 1. PROPERTIES OF ESTIMATORS
Minimal Sufficiency
Now we wish to consider those circumstances under which a given statistic
(actually the partition of the sample space induced by the given statistic)
allows no further real reduction. Suppose g(.) is a many-to-one function
and hence is a real reduction of the data. Is g(T ) still sufficient? In some
cases, as in the example below, the answer is “no”.
1.5.1
Problem
Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) distribution.
n
P
Show that T (X) =
Xi is sufficient for this model. Show that if g is not
i=1
a one-to-one function, (g(t1 ) = g(t2 ) = g0 for some integers t1 and t2 where
0 ≤ t1 < t2 ≤ n) then g(T ) cannot be sufficient for {f (x; θ) ; θ ∈ Ω}.
Hint: Find P (T = t1 |g(T ) = g0 ).
1.5.2
Definition
A statistic T (X) is a minimal sufficient statistic for {f (x; θ) ; θ ∈ Ω} if it
is sufficient and if for any other sufficient statistic U (X), there exists a
function g(.) such that T (X) = g(U (X)).
This definition says that a minimal sufficient statistic is a function of
every other sufficient statistic. In terms of the partition induced by the
minimal sufficient this implies that the minimal sufficient statistic induces
the coarsest partition possible of the sample space among all sufficient
statistics. This partition is called the minimal sufficient partition.
1.5.3
Problem
Prove that if T1 and T2 are both minimal sufficient statistics, then they
induce the same partition of the sample space.
The following theorem is useful in showing a statistic is minimally sufficient.
1.5. MINIMAL SUFFICIENCY
1.5.4
17
Theorem - Minimal Sufficient Statistic
Suppose the model is {f (x; θ) ; θ ∈ Ω} and let A = support of X. Partition
A into the equivalence classes defined by
½
¾
f (x; θ)
Ay = x;
= H(x, y) for all θ ∈ Ω , y ∈ A.
f (y; θ)
This is a minimal sufficient partition. The statistic T (X) which induces
this partition is a minimal sufficient statistic.
The proof of this theorem is given in Section 5.4.2 of the Appendix.
1.5.5
Example
Let (X1 , . . . , Xn ) be a random sample from the distribution with probability
density function
f (x; θ) = θxθ−1 ,
0 < x < 1,
θ > 0.
Find a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}.
1.5.6
Example
Let X1 , . . . , Xn be a random sample from the N(θ, θ2 ) distribution. Find a
minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}.
1.5.7
Problem
Let X1 , . . . , Xn be a random sample from the LOG(1, θ) distribution. Prove
that the order statistic (X(1) , . . . , X(n) ) is a minimal sufficient statistic for
the model {f (x; θ) ; θ ∈ Ω}.
1.5.8
Problem
Let X1 , . . . , Xn be a random sample from the CAU(1, θ) distribution. Find
a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}.
1.5.9
Problem
Let X1 , . . . , Xn be a random sample from the UNIF(θ, θ + 1) distribution.
Find a minimal sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}.
18
CHAPTER 1. PROPERTIES OF ESTIMATORS
1.5.10
Problem
Let Ω denote the set of all probability density functions. Let (X1 , . . . , Xn )
be a random sample from a distribution with probability density function
f ∈ Ω. Prove that the order statistic (X(1) , . . . , X(n) ) is a minimal sufficient statistic for the model {f (x); f ∈ Ω}. Note that in this example the
unknown “parameter” is f .
1.5.11
Problem - Linear Regression
Suppose E(Y ) = Xβ where Y = (Y1 , . . . , Yn )T is a vector of independent
and normally distributed random variables with V ar(Yi ) = σ 2 , i = 1, . . . , n,
X is a n × k matrix of known constants of rank k and β = (β1 , . . . , βk )T is
a vector of unknown parameters. Let
¡
¢−1 T
β̂ = X T X
X Y and Se2 = (Y − X β̂)T (Y − X β̂)/(n − k).
Show that (β̂, Se2 ) is a minimal sufficient statistic for this model.
Hint: Show
(Y − Xβ)T (Y − Xβ) = (n − k)Se2 + (β̂ − β)T X T X(β̂ − β).
1.6
Completeness
The property of completeness is one which is useful for determining the
uniqueness of estimators, for verifying, in some cases, that a minimal sufficient statistic has been found, and for finding U.M.V.U.E.’s.
Let X1 , . . . , Xn denote the observations from a distribution with probability (density) function {f (x; θ) ; θ ∈ Ω}. Suppose T (X) is a statistic and
u(T ), a function of T , is an unbiased estimator of θ so that E[u(T ); θ] = θ
for all θ ∈ Ω. Under what circumstances is this the only unbiased estimator which is a function of T ? To answer this question, suppose u1 (T )
and u2 (T ) are both unbiased estimators of θ and consider the difference
h(T ) = u1 (T ) − u2 (T ). Since u1 (T ) and u2 (T ) are both unbiased estimators we have E[h(T ); θ] = 0 for all θ ∈ Ω. Now if the only function h(T )
which satisfies E[h(T ); θ] = 0 for all θ ∈ Ω is the function h(t) = 0, then the
two unbiased estimators must be identical. A statistic T with this property
is said to be complete. The property of completeness is really a property of
the family of distributions of T generated as θ varies.
1.6. COMPLETENESS
1.6.1
19
Definition
The statistic T = T (X) is a complete statistic for {f (x; θ) ; θ ∈ Ω} if
E[h(T ); θ] = 0, for all θ ∈ Ω
implies
P [h(T ) = 0; θ] = 1 for all θ ∈ Ω.
1.6.2
Example
Let X1 , . . . , Xn be a random sample from the N(θ, 1) distribution. Consider
n
P
T = T (X) = (X1 ,
Xi ). Prove that T is a sufficient statistic for the model
i=2
{f (x; θ) ; θ ∈ Ω} but not a complete statistic.
1.6.3
Example
Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) distribution.
n
P
Prove that T = T (X) =
Xi is a complete sufficient statistic for the
i=1
model {f (x; θ) ; θ ∈ Ω}.
1.6.4
Example
Let X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Show
that T = T (X) = X(n) is a complete statistic for the model
{f (x; θ) ; θ ∈ Ω}.
1.6.5
Problem
Prove that any one-to-one function of a complete sufficient statistic is a
complete sufficient statistic.
1.6.6
Problem
Let X1 , . . . , Xn be a random sample from the N(θ, aθ2 ) distribution where
a > 0 is a known constant and θ > 0. Show that the minimal sufficient
statistic is not a complete statistic.
20
1.6.7
CHAPTER 1. PROPERTIES OF ESTIMATORS
Theorem
If T (X) is a complete sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}
then T (X) is a minimal sufficient statistic for {f (x; θ) ; θ ∈ Ω}.
The proof of this theorem is given in Section 5.4.3 of the Appendix.
1.6.8
Problem
The converse to the above theorem is not true. Let X1 , . . . , Xn be a
random sample from the UNIF(θ − 1, θ + 1) distribution. Show that T =
T (X) = (X(1) , X(n) ) is a minimal sufficient statistic for the model. Show
also that for the non-zero function
h(T ) =
X(n) − X(1)
n−1
−
,
2
n+1
E[h(T ); θ] = 0 for all θ ∈ Ω and therefore T is not a complete statistic.
1.6.9
Example
Let X = X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Prove that T = T (X) = X(n) is a minimal sufficient statistic for
{f (x; θ) ; θ ∈ Ω}.
1.6.10
Problem
Let X = X1 , . . . , Xn be a random sample from the EXP(1, θ) distribution. Prove that T = T (X) = X(1) is a minimal sufficient statistic for
{f (x; θ) ; θ ∈ Ω}.
1.6.11
Theorem
For any random variables X and Y ,
E(X) = E[E(X|Y )]
and
V ar(X) = E[V ar(X|Y )] + V ar[E(X|Y )]
1.6. COMPLETENESS
1.6.12
21
Theorem
If T = T (X) is a complete statistic for the model {f (x; θ) ; θ ∈ Ω}, then
there is at most one function of T that provides an unbiased estimator of
the parameter τ (θ).
1.6.13
Problem
Prove Theorem 1.6.12.
1.6.14
Theorem (Lehmann-Scheffé)
If T = T (X) is a complete sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}
and E [g (T ) ; θ] = τ (θ), then g(T ) is the unique U.M.V.U.E. of τ (θ).
1.6.15
Example
Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) distribution.
Find the U.M.V.U.E. of τ (θ) = θ2 .
1.6.16
Example
Let X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Find
the U.M.V.U.E. of τ (θ) = θ.
1.6.17
Problem
Let X1 , . . . , Xn be a random sample from the Bernoulli(θ) dsitribution.
Find the U.M.V.U.E. of τ (θ) = θ(1 − θ).
1.6.18
Problem
Suppose X has a Hypergeometric distribution with p.f.
µ ¶µ
¶
Nθ N − Nθ
x
n−x
µ ¶
f (x; θ) =
, x = 0, 1, . . . , min (N θ, N − N θ) ;
N
n
¾
½
1 2
θ ∈ Ω = 0, , , . . . , 1
N N
Show that X is a complete sufficient statistic. Find the U.M.V.U.E. of θ.
22
1.6.19
CHAPTER 1. PROPERTIES OF ESTIMATORS
Problem
Let X1 , . . . , Xn be a random sample from the EXP(β, μ) distribution where
β is known. Show that T = X(1) is a complete sufficient statistic for this
model. Find the U.M.V.U.E. of μ and the U.M.V.U.E. of μ2 .
1.6.20
Problem
Suppose X1 , ..., Xn is a random sample from the UNIF(a, b) distribution.
Show that T = (X(1) , X(n) ) is a complete sufficient statistic for this model.
Find the U.M.V.U.E.’s of a and b. Find the U.M.V.U.E. of the mean of Xi .
1.6.21
Problem
Let T (X) be an unbiased estimator of τ (θ). Prove that T (X) is a U.M.V.U.E.
of τ (θ) if and only if E(U T ; θ) = 0 for all U (X) such that E(U ) = 0 for all
θ ∈ Ω.
1.6.22
Theorem (Rao-Blackwell)
If T = T (X) is a complete sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}
and U = U (X) is any unbiased estimator of τ (θ), then E(U |T ) is the
U.M.V.U.E. of τ (θ).
1.6.23
Problem
Let X1 , . . . , Xn be a random sample from the EXP(β, μ) distribution where
β is known. Find the U.M.V.U.E. of τ (μ) = P (X1 > c; μ) where c ∈ < is
a known constant. Hint: Let U = U (X1 ) = 1 if X1 ≥ c and 0 otherwise.
1.6.24
Problem
Let X1 , . . . , Xn be a random sample from the DU(θ) distribution. Show
that T = X(n) is a complete sufficient statistic for this model. Find the
U.M.V.U.E. of θ.
1.7. THE EXPONENTIAL FAMILY
1.7
1.7.1
23
The Exponential Family
Definition
Suppose X = (X1 , . . . , Xp ) has a (joint) probability (density) function of
the form
"
#
k
P
f (x; θ) = C(θ) exp
qj (θ)Tj (x) h(x)
(1.3)
j=1
for functions qj (θ), Tj (x), h(x), C(θ). Then we say that f (x; θ) is a member of the exponential family of densities. We call (T1 (X), . . . , Tk (X)) the
natural sufficient statistic.
It should be noted that the natural sufficient statistic is not unique.
Multiplication of Tj by a constant and division of qj by the same constant
results in the same function f (x; θ). More generally linear transformations
of the Tj and the qj can also be used.
1.7.2
Example
Prove that T (X) = (T1 (X), . . . , Tk (X)) is a sufficient statistic for the model
{f (x; θ) ; θ ∈ Ω} where f (x; θ) has the form (1.3).
1.7.3
Example
Show that the BIN(n, θ) distribution has an exponential family distribution
and find the natural sufficient statistic.
One of the important properties of the exponential family is its closure
under repeated independent sampling.
1.7.4
Theorem
Let X1 , . . . , Xn be a random sample from the distribution with probability
(density) function given by (1.3). Then (X1 , . . . , Xn ) also has an exponential family form, with joint probability (density) function
n
f (x1 , . . . xn ; θ) = [C (θ)] exp
(
k
P
j=1
qj (θ)
∙
n
P
i=1
¸) n
Q
Tj (xi )
h (xi ) .
i=1
24
CHAPTER 1. PROPERTIES OF ESTIMATORS
In other words, C is replaced by C n and Tj (x) by
n
P
Tj (xi ). The natural
i=1
sufficient statistic is
µ
1.7.5
n
P
T1 (Xi ), . . . ,
i=1
n
P
i=1
¶
Tk (Xi ) .
Example
Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Show
that X1 , . . . , Xn is a member of the exponential family.
1.7.6
Canonical Form of the Exponential Family
It is usual to reparameterize equation (1.3) by replacing qj (θ) by a new
parameter ηj . This results in the canonical form of the exponential family
f (x; η) = C(η) exp
"
k
P
#
ηj Tj (x) h(x).
j=1
The natural parameter space in this form is the set of all values of η for
which the above function is integrable; that is
{η;
Z∞
−∞
f (x; η)dx < ∞}.
If X is discrete the intergral is replaced by the sum over all x such that
f (x; η) > 0.
If the statistic satisfies a linear constraint, for example,
P
Ã
k
P
j=1
Tj (X) = 0; η
!
= 1,
then the number of terms k can be reduced. Unless this is done, the parameters ηj are not all statistically meaningful. For example the data may
permit us to estimate η1 + η2 but not allow estimation of η1 and η2 individually. In this case we call the parameter “unidentifiable”. We will need to
assume that the exponential family representation is minimal in the sense
that neither the ηj nor the Tj satisfy any linear constraints.
1.7. THE EXPONENTIAL FAMILY
1.7.7
25
Definition
We will say that X has a regular exponential family distribution if it is in
canonical form, is of full rank in the sense that neither the Tj nor the ηj
satisfy any linear constraints, and the natural parameter space contains a
k−dimensional rectangle. By Theorem 1.7.4 if Xi has a regular exponential
family distribution then X = (X1 , . . . , Xn ) also has a regular exponential
family distribution.
1.7.8
Example
Show that X ∼ BIN(n, θ) has a regular exponential family distribution.
1.7.9
Theorem
If X has a regular exponential family distribution with natural sufficient
statistic T (X) = (T1 (X), . . . , Tk (X)) then T (X) is a complete sufficient
statistic. Reference: Lehmann and Ramano (2005), Testing Statistical
Hypotheses (3rd edition), pp. 116-117.
1.7.10
Differentiating under the Integral
In Chapter 2, it will be important to know if a family of models has the
property that differentiation under the integral is possible. We state that
for a regular exponential family, it is possible to differentiate under the
integral, that is,
#
"
#
"
Z
Z
k
k
P
P
∂m
∂m
ηj Tj (x) h(x)dx =
C(η) exp
ηj Tj (x) h(x)dx
C(η) exp
∂ηim
∂ηim
j=1
j=1
for any m = 1, 2, . . . and any η in the interior of the natural parameter
space.
1.7.11
Example
Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution. Find
a complete sufficient statistic for this model. Find the U.M.V.U.E.’s of μ
and σ 2 .
1.7.12
Example
Show that X ∼ N(θ, θ2 ) does not have a regular exponential family distribution.
26
CHAPTER 1. PROPERTIES OF ESTIMATORS
1.7.13
Example
Suppose (X1 , X2 , X3 ) have joint density
f (x1 , x2 , x3 ; θ1 , θ2 , θ3 ) = P (X1 = x1 , X2 = x2 , X3 = x3 ; θ1 , θ2 , θ3 )
n!
=
θ x1 θ x2 θ x3
x1 !x2 !x3 ! 1 2 3
xi = 0, 1, . . . ; i = 1, 2, 3; x1 + x2 + x3 = n
0 < θi < 1; i = 1, 2, 3; θ1 + θ2 + θ3 = 1
Find the U.M.V.U.E. of θ1 , θ2 , and θ1 θ2 .
Since
f (x1 , x2 , x3 ; θ1 , θ2 , θ3 ) = exp
"
3
P
#
qj (θ1 , θ2 , θ3 ) Tj (x1 , x2 , x3 ) h (x1 , x2 , x3 )
j=1
where
qj (θ1 , θ2 , θ3 ) = log θj , Tj (x1 , x2 , x3 ) = xj , j = 1, 2, 3 and h (x1 , x2 , x3 ) =
(X1 , X2 , X3 ) is a member of the exponential family. But
3
P
Tj (x1 , x2 , x3 ) = n and θ1 + θ2 + θ3 = 1
j=1
and thus (X1 , X2 , X3 ) is not a member of the regular exponential family.
However by substituting X3 = n − X1 − X2 and θ3 = 1 − θ1 − θ2 we can
show that (X1 , X2 ) has a regular exponential family distribution.
Let
µ
µ
¶
¶
θ1
θ2
η1 = log
, η2 = log
1 − θ1 − θ2
1 − θ1 − θ2
then
θ1 =
eη1
,
1 + eη1 + eη2
θ2 =
eη2
.
1 + eη1 + eη2
Let
C (η1 , η2 ) =
µ
T1 (x1 , x2 ) = x1 , T2 (x1 , x2 ) = x2 ,
¶n
1
n!
, and h (x1 , x2 ) =
.
1 + eη1 + eη2
x1 !x2 ! (n − x1 − x2 )!
In canonical form (X1 , X2 ) has p.f.
f (x1 , x2 ; η1 , η2 ) = C (η1 , η2 ) exp [η1 T1 (x1 , x2 ) + η2 T2 (x1 , x2 )] h (x1 , x2 )
n!
,
x1 !x2 !x3 !
1.7. THE EXPONENTIAL FAMILY
27
with natural parameter space {(η1 , η2 ) ; η1 ∈ <, η2 ∈ <} which contains a
two-dimensional rectangle. The ηj0 s and the Tj0 s do not satisfy any linear
constraints. Therefore (X1 , X2 ) has a regular exponential family distribution with natural sufficient statistic T (X1 , X2 ) = (X1 , X2 ) and thus
T (X1 , X2 ) is a complete sufficient statistic.
By the properties of the multinomial distribution (see Section 5.2.2)
we have X1 v BIN (n, θ1 ) , X2 v BIN (n, θ2 ) and Cov (X1 , X2 ) = −nθ1 θ2 .
Since
µ
µ
¶
¶
X1
X2
nθ1
nθ2
E
; θ1 , θ2 =
= θ1 and E
; θ1 , θ2 =
= θ2
n
n
n
n
then by the Lehmann-Scheffé Theorem X1 /n is the U.M.V.U.E. of θ1 and
X2 /n is the U.M.V.U.E. of θ2 .
Since
−nθ1 θ2
= Cov (X1 , X2 ; θ1 , θ2 )
= E (X1 X2 ; θ1 , θ2 ) − E (X1 ; θ1 , θ2 ) E (X2 ; θ1 , θ2 )
= E (X1 X2 ; θ1 , θ2 ) − n2 θ1 θ2
or
E
µ
X1 X2
; θ1 , θ2
n (n − 1)
¶
= θ1 θ2
then by the Lehmann-Scheffé Theorem X1 X2 / [n (n − 1)] is the U.M.V.U.E.
of θ1 θ2 .
1.7.14
Example
Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Find the
U.M.V.U.E. of τ (θ) = e−θ . Show that the U.M.V.U.E. is also a consistent
estimator of τ (θ).
Since (X1 , . . . , Xn ) is a member of the regular exponential family with
n
P
natural sufficient statistic T =
Xi therefore T is a complete sufficient
i=1
statistic. Consider the random variable U (X1 ) = 1 if X1 = 0 and 0 otherwise. Then
E [U (X1 ); θ] = 1 · P (X1 = 0; θ) = e−θ ,
θ>0
and U (X1 ) is an unbiased estimator of τ (θ) = e−θ . Therefore by the RaoBlackwell Theorem E (U |T ) is the U.M.V.U.E. of τ (θ) = e−θ .
28
CHAPTER 1. PROPERTIES OF ESTIMATORS
Since X1 , . . . , Xn is a random sample from the POI(θ) distribution,
X1 v POI (θ) , T =
Thus
n
P
i=1
Xi v POI (nθ) and
n
P
i=2
Xi v POI ((n − 1) θ) .
E (U |T = t) = 1 · P (X1 = 0|T = t)
µ
¶
n
P
P X1 = 0,
Xi = t; θ
i=1
=
P (T = t; θ)
µ
¶
n
P
P X1 = 0,
Xi = t − 0; θ
i=2
=
P (T = t; θ)
(nθ)t −nθ
[(n − 1) θ]t e−(n−1)θ
= e−θ
÷
e
t!
t!
µ
¶t
1
=
1−
, t = 0, 1, . . .
n
¢T
¡
is the U.M.V.U.E. of θ.
Therefore E (U |T ) = 1 − n1
Since X1 , . . . , Xn is a random sample from the POI(θ) distribution then
by the W.L.L.N. X̄ →p θ and by the Limit Theorems (see Section 5.3)
µ
¶T ∙µ
¶n ¸X̄
1
1
=
1−
→p e−θ
E (U |T ) = 1 −
n
n
and therefore E (U |T ) a consistent estimator of e−θ .
1.7.15
Example
Let X1 , . . . , Xn be a random sample from the N(θ, 1) distribution. Find the
U.M.V.U.E. of τ (θ) = Φ(c − θ) = P (Xi ≤ c; θ) for some constant c where
Φ is the standard normal cumulative distribution function. Show that the
U.M.V.U.E. is also a consistent estimator of τ (θ).
Since (X1 , . . . , Xn ) is a member of the regular exponential family with
n
P
natural sufficient statistic T =
Xi therefore T is a complete sufficient
i=1
statistic. Consider the random variable U (X1 ) = 1 if X1 ≤ c and 0 otherwise. Then
E [U (X1 ); θ] = 1 · P (X1 ≤ c; θ) = Φ(c − θ),
θ∈<
1.7. THE EXPONENTIAL FAMILY
29
and U (X1 ) is an unbiased estimator of τ (θ) = Φ(c − θ). Therefore by the
Rao-Blackwell Theorem E (U |T ) is the U.M.V.U.E. of τ (θ) = Φ(c − θ).
Since X1 , . . . , Xn is a random sample from the N(θ, 1) distribution,
X1 v N(θ, 1), T =
n
P
i=1
Xi v N(nθ, n) and
The conditional p.d.f. of X1 given T = t is
n
P
i=2
Xi v N((n − 1) θ, n − 1).
f (x1 |T = t)
¸
∙
1
1
2
= √ exp − (x1 − θ)
2
2π
(
(
)
)
[t − x1 − (n − 1) θ]2
1
[t − nθ]2
1
exp −
×p
exp −
÷√
2 (n − 1)
2n
2πn
2π (n − 1)
(
"
#)
2
1 2 (t − x1 )
1
t2
exp
−
= q ¡
+
x
−
¢
2 1
n−1
n
2π 1 − n1
"
#
µ
¶2
1
t
1
¡
¢
x1 −
= q ¡
¢ exp − 2 1 − 1
n
n
2π 1 − 1
n
which is the p.d.f. of a N( nt , 1 − n1 ) random variable. Since X1 |T = t has a
N( nt , 1 − n1 ) distribution,
E (U |T ) = 1 · P (X1 ≤ c|T )
⎞
⎛
c
−
T
/n
⎠
= Φ ⎝ q¡
¢
1
1− n
is the U.M.V.U.E. of τ (θ) = Φ(c − θ).
Since X1 , . . . , Xn is a random sample from the N(θ, 1) distribution then
by the W.L.L.N. X̄ →p θ and by the Limit Theorems
⎞
⎛
⎞
⎛
c − T /n ⎠
⎝ c − X̄ ⎠ →p Φ (c − θ)
E (U |T ) = Φ ⎝ q¡
¢ = Φ q¡
¢
1
1− n
1 − n1
and therefore E (U |T ) a consistent estimator τ (θ) = Φ(c − θ).
30
1.7.16
CHAPTER 1. PROPERTIES OF ESTIMATORS
Problem
Let X1 , . . . , Xn be a random sample from the distribution with probability
density function
f (x; θ) = θxθ−1 , 0 < x < 1, θ > 0.
µn
¶1/n
Q
Show that the geometric mean of the sample
Xi
is a complete
i=1
sufficient statistic and find the U.M.V.U.E. of θ.
Hint: − log Xi ∼ EXP(1/θ).
1.7.17
Problem
Let X1 , . . . , Xn be a random sample from the EXP(β, μ) distribution where
n
P
μ is known. Show that T =
Xi is a complete sufficient statistic. Find
i=1
the U.M.V.U.E. of β 2 .
1.7.18
Problem
Let X1 , . . . , Xn be a random sample from the GAM(α, β) distribution and
θ = (α, β). Find the U.M.V.U.E. of τ (θ) = αβ.
1.7.19
Problem
Let X ∼ NB(k, θ). Find the U.M.V.U.E. of θ.
Hint: Find E[(X + k − 1)−1 ; θ].
1.7.20
Problem
Let X1 , . . . , Xn be a random sample from the N(θ, 1) distribution. Find
the U.M.V.U.E. of τ (θ) = θ2 .
1.7.21
Problem
Let X1 , . . . , Xn be a random sample from the N(0, θ) distribution. Find
the U.M.V.U.E. of τ (θ) = θ2 .
1.7.22
Problem
Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Find
the U.M.V.U.E. for τ (θ) = (1 + θ)e−θ .
Hint: Find P (X1 ≤ 1; θ).
1.7. THE EXPONENTIAL FAMILY
Member of the REF
31
Complete Sufficient Statistic
n
P
POI (θ)
Xi
i=1
n
P
BIN (n, θ)
Xi
i=1
n
P
NB (k, θ)
Xi
i=1
¡
¢
N μ, σ2
¡
¢
N μ, σ2
Xi
i=1
n
P
μ known
i=1
µ
¢
¡
N μ, σ2
GAM (α, β)
n
P
σ 2 known
n
P
(Xi − μ)2
Xi ,
i=1
α known
n
P
i=1
n
P
Xi2
¶
Xi
i=1
GAM (α, β)
n
Q
β known
µ
GAM (α, β)
n
P
Xi ,
i=1
EXP (β, μ)
μ known
Xi
i=1
n
Q
Xi
i=1
n
P
¶
Xi
i=1
Not a Member of the REF
Complete Sufficient Statistic
UNIF (0, θ)
X(n)
¡
¢
X(1) , X(n)
UNIF (a, b)
EXP (β, μ)
EXP (β, μ)
β known
X(1)
µ
¶
n
P
X(1) ,
Xi
i=1
32
1.7.23
CHAPTER 1. PROPERTIES OF ESTIMATORS
Problem
Let X1 , . . . , Xn be a random sample form the POI(θ) distribution. Find
the U.M.V.U.E. for τ (θ) = e−2θ . Hint: Find E[(−1)X1 ; θ] . Show that this
estimator has some undesirable properties when n = 1 and n = 2 but when
n is large, it is approximately equal to the maximum likelihood estimator.
1.7.24
Problem
Let X1 , . . . , Xn be a random sample from the GAM(2, θ) distribution. Find
the U.M.V.U.E. of τ1 (θ) = 1/θ and the U.M.V.U.E. of
τ2 (θ) = P (X1 > c; θ) where c > 0 is a constant.
1.7.25
Problem
In Problem 1.5.11 show that β̂ is the U.M.V.U.E. of β and Se2 is the
U.M.V.U.E. of σ 2 .
1.7.26
Problem
A Brownian Motion process is a continuous-time stochastic process X (t)
which is often used to describe the value of an asset. Assume X (t) represents the market price of a given asset such as a portfolio of stocks at
time t and x0 is the value of the portfolio at the beginning of a given time
period (assume that the analysis is conditional on x0 so that x0 is fixed
and known). The distribution of X (t) for any fixed time t is assumed to be
N(x0 + μt, σ2 t) for 0 < t ≤ 1. The parameter μ is the drift of the Brownian
motion process and the parameter σ is the diffusion coefficient. Assume
that t = 1 corresponds to the end of the time period so X (1) is the closing
price.
Suppose that we record both the period high max X (t) and the close
X (1). Define random variables
{0≤t≤1}
M = max X (t) − x0
{0≤t≤1}
and
Y = X (1) − x0 .
The joint probability density function of (M, Y ) can be shown to be
½
i¾
2 (2m − y)
1 h
2
2
f (m, y; μ, σ 2 ) = √
2μy
−
μ
,
exp
−
(2m
−
y)
2σ2
2πσ 3
m > 0,
− ∞ < y < m, μ ∈ < and σ 2 > 0.
1.8. ANCILLARITY
33
(a) Show that (M, Y ) has a regular exponential family distribution.
¡
¢
¢
¡
(b) Let Z = M (M − Y ). Show that Y v N μ, σ2 and Z v EXP σ2 /2
independently.
(c) Suppose we record independent pairs of observations (Mi , Yi ),
i = 1, . . . n on the portfolio for a total of n distinct time periods. Find the
U.M.V.U.E.’s of μ and σ 2 .
(d) Show that the estimators
V1 =
and
V2 =
n ¡
¢2
1 P
Yi − Ȳ
n − 1 i=1
n
n
2 P
2 P
Zi =
Mi (Mi − Yi )
n i=1
n i=1
are also unbiased estimators of σ 2 . How do we know that neither of these
estimators is the U.M.V.U.E. of σ 2 ? Show that the U.M.V.U.E. of σ2 can
be written as a weighted average of V1 and V2 . Compare the variances of
all three estimators.
(e) An up-and-out call option on the portfolio is an option with exercise
price ξ (a constant) which pays a total of (X (1) − ξ) dollars at the end of
one period provided that this quantity is positive and provided that X (t)
never exceeded the value of a barrier throughout this period of time, that
is, provided that M < a. Thus the option pays
g(M, Y ) = max {Y − (ξ − x0 ), 0}
if M < a
and otherwise g(M, Y ) = 0. Find the expected value of such an option,
that is, find the expected value of g(M, Y ).
1.8
Ancillarity
Let X = (X1 , . . . , Xn ) denote observations from a distribution with probability (density) function {f (x; θ) ; θ ∈ Ω} and let U (X) be a statistic. The
information on the parameter θ is provided by the sensitivity of the distribution of a statistic to changes in the parameter. For example, suppose a
modest change in the parameter value leads to a large change in the expected value of the distribution resulting in a large shift in the data. Then
the parameter can be estimated fairly precisely. On the other hand, if a
statistic U has no sensitivity at all in distribution to the parameter, then
it would appear to contain little information for point estimation of this
parameter. A statistic of the second kind is called an ancillary statistic.
34
1.8.1
CHAPTER 1. PROPERTIES OF ESTIMATORS
Definition
U (X) is an ancillary statistic if its distribution does not depend on the
unknown parameter θ.
Ancillary statistics are, in a sense, orthogonal or perpendicular to minimal sufficient statistics. Ancillary statistics are analogous to the residuals in
a multiple regression, while the complete sufficient statistics are analogous
to the estimators of the regression coefficients. It is well-known that the
residuals are uncorrelated with the estimators of the regression coefficients
(and independent in the case of normal errors). However, the “irrelevance”
of the ancillary statistic seems to be limited to the case when it is not part
of the minimal (preferably complete) sufficient statistic as the following
example illustrates.
1.8.2
Example
Suppose a fair coin is tossed to determine a random variable N = 1 with
probability 1/2 and N = 100 otherwise. We then observe a Binomial random variable X with parameters (N, θ). Show that the minimal sufficient
statistic is (X, N ) but that N is an ancillary statistic. Is N irrelevant to
inference about θ?
In this example it seems reasonable to condition on an ancillary component of the minimal sufficient statistic. Conducting inference conditionally
on the ancillary statistic essentially means treating the observed number of
trials as if it had been fixed in advance instead of the result of the toss of
a fair coin. This example also illustrates the use of the following principle:
1.8.3
The Conditionality Principle
Suppose the minimal sufficient statistic can be written in the form
T = (U, A) where A is an ancillary statistic. Then all inference should be
conducted using the conditional distribution of the data given the value of
the ancillary statistic, that is, using the distribution of X|A.
Some difficulties arise from the application of this principle since there
is no general method for constructing the ancillary statistic and ancillary
statistics are not necessarily unique.
1.8. ANCILLARITY
35
The following theorem allows us to use the properties of completeness
and ancillarity to prove the independence of two statistics without finding
their joint distribution.
1.8.4
Basu’s Theorem
Consider X with probability (density) function {f (x; θ) ; θ ∈ Ω}. Let T (X)
be a complete sufficient statistic. Then T (X) is independent of every ancillary statistic U (X).
1.8.5
Proof
We need to show
P [U (X) ∈ B, T (X) ∈ C; θ] = P [U (X) ∈ B; θ] · P [T (X) ∈ C; θ]
for all sets B, C and all θ ∈ Ω.
Let
g(t) = P [U (X) ∈ B|T (X) = t] − P [U (X) ∈ B]
for all t ∈ A where P (T ∈ A; θ) = 1. By sufficiency, P [U (X) ∈ B|T (X) = t]
does not depend on θ, and by ancillarity, P [U (X) ∈ B] also does not
depend on θ. Therefore g(T ) is a statistic.
Let
½
1 if U (X) ∈ B
I{U (X) ∈ B} =
0
else.
Then
E[I{U (X) ∈ B}] = P [U (X) ∈ B],
E[I{U (X) ∈ B}|T = t] = P [U (X) ∈ B|T = t],
and
g(t) = E[I{U (X) ∈ B}|T (X) = t] − E[I{U (X) ∈ B}].
This gives
E[g(T )] = E[E[I{U (X) ∈ B}|T ]] − E[I{U (X) ∈ B}]
= E[I{U (X) ∈ B}] − E[I{U (X) ∈ B}]
= 0 for all θ ∈ Ω,
and since T is complete this implies P [g(T ) = 0; θ] = 1 for all θ ∈ Ω.
Therefore
P [U (X) ∈ B|T (X) = t] = P [U (X) ∈ B] for all t ∈ A and all B. (1.4)
36
CHAPTER 1. PROPERTIES OF ESTIMATORS
Suppose T has probability density function h(t; θ). Then
Z
P [U (X) ∈ B, T (X) ∈ C; θ] = P [U (X) ∈ B|T = t]h(t; θ)dt
C
=
Z
P [U (X) ∈ B]h(t; θ)dt
C
= P [U (X) ∈ B] ·
Z
by (1.4)
h(t; θ)dt
C
= P [U (X) ∈ B] · P [T (X) ∈ C; θ]
true for all sets B, C and all θ ∈ Ω as required.¥
1.8.6
Example
Let X1 , . . . , Xn be a random sample from the EXP(θ) distribution. Show
n
P
that T (X1 , . . . , Xn ) =
Xi and U (X1 , . . . , Xn ) = (X1 /T, . . . , Xn /T ) are
i=1
independent random variables. Find E(X1 /T ).
1.8.7
Example
Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution. Prove
that X̄ and S 2 are independent random variables.
1.8.8
Problem
Let X1 , . . . , Xn be a random sample from the distribution with p.d.f.
f (x; β) =
2x
,
β2
0 < x ≤ β.
(a) Show that β is a scale parameter for this model.
(b) Show that T = T (X1 , . . . , Xn ) = X(n) is a complete sufficient statistic
for this model.
(c) Find the U.M.V.U.E. of β.
(d) Show that T and U = U (X) = X1 /T are independent random variables.
(e) Find E (X1 /T ).
1.8. ANCILLARITY
1.8.9
37
Problem
Let X1 , . . . , Xn be a random sample from the GAM(α, β) distribution.
(a) Show that β is a scale parameter for this model.
n
P
(b) Suppose α is known. Show that T = T (X1 , . . . , Xn ) =
Xi is a
i=1
complete sufficient statistic for the model.
(c) Show that T and U = U (X1 , . . . , Xn ) = (X1 /T, . . . , Xn /T ) are independent random variables.
(d) Find E(X1 /T ).
1.8.10
Problem
In Problem 1.5.11 show that β̂ and Se2 are independent random variables.
1.8.11
Problem
Let X1 , . . . , Xn be a random sample from the EXP(β, μ) distribution.
(a) Suppose β is known. Show that T1 = X(1) is a complete sufficient
statistic for the model.
n ¡
¢
P
(b) Show that T1 and T2 =
Xi − X(1) are independent random varii=1
ables.
(c) Find the p.d.f. of T2 . Hint: Show
n
P
i=1
(Xi − μ) = n (T1 − μ) + T2 .
(d) Show that (T1 , T2 ) is a complete sufficient statistic for the model
{f (x1 , . . . , xn ; μ, β) ; μ ∈ <, β > 0}.
(e) Find the U.M.V.U.E.’s of β and μ.
38
1.8.12
CHAPTER 1. PROPERTIES OF ESTIMATORS
Problem
Let X1 , . . . , Xn be a random sample from the distribution with p.d.f.
f (x; α, β) =
αxα−1
,
βα
α > 0, 0 < x ≤ β.
(a) Show that if α is known then T1 = X(n) is a complete sufficient statistic
for the model.
n
Q
Xi
(b) Show that T1 and T2 =
T1 are independent random variables.
i=1
(c) Find the p.d.f. of T2 . Hint: Show
n
X
i=1
log
µ
Xi
β
¶
= log T2 + n log
µ
T1
β
¶
.
(d) Show that (T1 , T2 ) is a complete sufficient statistic for the model.
(e) Find the U.M.V.U.E. of α.
Chapter 2
Maximum Likelihood
Estimation
2.1
Maximum Likelihood Method
- One Parameter
Suppose we have collected the data x (possibly a vector) and we believe that
these data are observations from a distribution with probability function
P (X = x; θ) = f (x; θ)
where the scalar parameter θ is unknown and θ ∈ Ω. The probability of
observing the data x is equal to f (x; θ). When the observed value of x is
substituted into f (x; θ), then f (x; θ) is a function of the parameter θ only.
In the absence of any other information, it seems logical that we should
estimate the parameter θ using a value most compatible with the data. For
example we might choose the value of θ which maximizes the probability
of the observed data.
2.1.1
Definition
Suppose X is a random variable with probability function
P (X = x; θ) = f (x; θ), where θ ∈ Ω is a scalar and suppose x is the observed data. The likelihood function for θ is
L(θ) = P (observing the data x; θ)
= P (X = x; θ)
= f (x; θ), θ ∈ Ω.
39
40
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
If X = (X1 , . . . , Xn ) is a random sample from the probability function
P (X = x; θ) = f (x; θ) and x = (x1 , . . . , xn ) are the observed data then the
likelihood function for θ is
L(θ) = P (observing the data x; θ)
= P (X1 = x1 , . . . , Xn = xn ; θ)
n
Q
=
f (xi ; θ), θ ∈ Ω.
i=1
The value of θ which maximizes the likelihood L (θ) also maximizes
the logarithm of the likelihood function. (Why?) Since it is easier to find
the derivative of the sum of n terms rather than the product, we usually
determine the maximum of the logarithm of the likelihood function.
2.1.2
Definition
The log likelihood function is defined as
l(θ) = log L(θ),
θ∈Ω
where log is the natural logarithmic function.
2.1.3
Definition
The value of θ that maximizes the likelihood function L(θ) or equivalently
the log likelihood function l(θ) is called the maximum likelihood (M.L.)
estimate. The M.L. estimate is a function of the data x and we write
θ̂ = θ̂ (x). The corresponding M.L. estimator is denoted θ̂ = θ̂(X).
2.1.4
Example
Suppose in a sequence of n Bernoulli trials with P (Success) = θ we have
observed x successes. Find the likelihood function L (θ), the log likelihood
function l (θ), the M.L. estimate of θ and the M.L. estimator of θ.
2.1.5
Example
Suppose we have collected data x1 , . . . , xn and we believe these observations are independent observations from a POI(θ) distribution. Find the
likelihood function, the log likelihood function, the M.L. estimate of θ and
the M.L. estimator of θ.
2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 41
2.1.6
Problem
Suppose we have collected data x1 , . . . , xn and we believe these observations are independent observations from the DU(θ) distribution. Find the
likelihood function, the M.L. estimate of θ and the M.L. estimator of θ.
2.1.7
Definition
The score function is defined as
S(θ) =
2.1.8
d
d
l (θ) =
log L (θ) ,
dθ
dθ
θ ∈ Ω.
Definition
The information function is defined as
I(θ) = −
d2
d2
l(θ) = − 2 log L(θ),
2
dθ
dθ
θ ∈ Ω.
I(θ̂) is called the observed information.
In Section 2.7 we will see how the observed information I(θ̂) can be used
to construct approximate confidence intervals for the unknown parameter
θ. I(θ̂) also tells us about the concavity of the log likelihood function.
Suppose in Example 2.1.5 the M.L. estimate of θ was θ̂ = 2. If n = 10
then I(θ̂) = 10/2 = 5. If n = 25 then I(θ̂) = 25/2 = 12.5. See Figure
2.1. The log likelihood function is more concave down for n = 25 than for
n = 10 which reflects the fact that as the number of observations increases
we have more “information” about the unknown parameter θ.
2.1.9
Finding M.L. Estimates
If X1 , . . . , Xn is a random sample from a distribution whose support set does
not depend on θ then we usually find θ̂ by solving S(θ) = 0. It is important
to verify that θ̂ is the value of θ which maximizes L (θ) or equivalently l (θ).
This can be done using the First Derivative Test. Note that the condition
I(θ̂) > 0 only checks for a local maximum.
Although we view the likelihood, log likelihood, score and information functions as functions of θ they are, of course, also functions of the observed
42
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
2
0
n=10
-2
R(θ)
n=25
-4
-6
-8
-10
1
1.5
2
θ
2.5
3
3.5
Figure 2.1: Poisson Log Likelihoods for n = 10 and n = 25
data x. When it is important to emphasize the dependence on the data
x we will write L(θ; x), S(θ; x), etc. Also when we wish to determine the
sampling properties of these functions as functions of the random variable
X we will write L(θ; X), S(θ; X), etc.
2.1.10
Definition
If θ is a scalar then the expected or Fisher information (function) is given
by
¸
∙
∂2
J(θ) = E [I(θ; X); θ] = E − 2 l(θ; X); θ , θ ∈ Ω.
∂θ
Note:
If X1 , . . . , Xn is a random sample from f (x; θ) then
∙
¸
∙
¸
∂2
∂2
J (θ) = E − 2 l(θ; X); θ = nE − 2 log f (X; θ); θ
∂θ
∂θ
where X has probability function f (x; θ).
2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 43
2.1.11
Example
Find the Fisher information based on a random sample X1 , . . . , Xn from the
POI(θ) distribution and compare it to the variance of the M.L. estimator
θ̂. How does the Fisher information change as n increases?
The Poisson model is used to model the number of events occurring in
time or space. Suppose it is not possible to observe the number of events
but only whether or not one or more events has occurred. In other words
it is only possible to observe the outcomes “X = 0” and “X > 0”. Let
Y be the number of times the outcome “X = 0” is observed in a sample
of size n. Find the M.L. estimator of θ for these data. Compare the
Fisher information for these data with the Fisher information based on
(X1 , . . . , Xn ). See Figure 2.2
1
0.9
0.8
0.7
Ratio of
Information
Functions0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
θ
5
6
7
8
Figure 2.2: Ratio of Fisher Information Functions
2.1.12
Problem
Suppose X ∼ BIN(n, θ) and we observe X. Find θ̂, the M.L. estimator of
θ, the score function, the information function and the Fisher information.
Compare the Fisher information with the variance of θ̂.
44
2.1.13
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
Problem
Suppose X ∼ NB(k, θ) and we observe X. Find the M.L. estimator of θ,
the score function and the Fisher information.
2.1.14
Problem - Randomized Sampling
A professor is interested estimating the unknown quantity θ which is the
proportion of students who cheat on tests. She conducts an experiment in
which each student is asked to toss a coin secretly. If the coin comes up a
head the student is asked to toss the coin again and answer “Yes” if the
second toss is a head and “No” if the second toss is a tail. If the first toss
of the coin comes up a tail, the student is asked to answer “Yes” or “No”
to the question: Have you ever cheated on a University test? Students
are assumed to answer more honestly in this type of randomized response
survey because it is not known to the questioner whether the answer “Yes”
is a result of tossing the coin twice and obtaining two heads or because the
student obtained a tail on the toss of the coin and then answered “Yes” to
the question about cheating.
(a) Find the probability that x students answer “Yes” in a class of n students.
(b) Find the M.L. estimator of θ based on X students answering “Yes” in
a class of n students. Be sure to verify that your answer corresponds to a
maximum.
(c) Find the Fisher information for θ.
(d) In a simpler experiment n students could be asked to answer “Yes”
or “No” to the question: Have you ever cheated on a University test? If
we could assume that they answered the question honestly then we would
expect to obtain more information about θ from this simpler experiment.
Determine the amount of information lost in doing the randomized response
experiment as compared to the simpler experiment.
2.1.15
Problem
Suppose (X1 , X2 ) ∼ MULT(n, θ2 , 2θ (1 − θ)). Find the M.L. estimator of
θ, the score function and the Fisher information.
2.1.16
Likelihood Functions for Continuous Models
Suppose X is a continuous random variable with probability density function f (x; θ). We will often observe only the value of X rounded to some
2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 45
degree of precision (say one decimal place) in which case the actual observation is a discrete random variable. For example, suppose we observe X
correct to one decimal place. Then
1.15
Z
P (we observe 1.1) =
f (x; θ)dx ≈ (1.15 − 1.05) · f (1.1; θ)
1.05
assuming the function f (x; θ) is quite smooth over the interval. More generally, if we observe X rounded to the nearest ∆ (assumed small) then the
likelihood of the observation is approximately ∆f (observation; θ). Since
the precision ∆ of the observation does not depend on the parameter, then
maximizing the discrete likelihood of the observation is essentially equivalent to maximizing the probability density function f (observation; θ) over
the parameter.
Therefore if X = (X1 , . . . , Xn ) is a random sample from the probability
density function f (x; θ) and x = (x1 , . . . , xn ) are the observed data then
we define the likelihood function for θ as
L(θ) = L (θ; x) =
n
Q
i=1
See also Problem 2.8.12.
2.1.17
f (xi ; θ),
θ ∈ Ω.
Example
Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function
f (x; θ) = θxθ−1 , 0 ≤ x ≤ 1, θ > 0.
Find the score function, the M.L. estimator, and the information function
of θ. Find the observed information. Find the mean and variance of θ̂.
Compare the Fisher infomation and the variance of θ̂.
2.1.18
Example
Suppose X1 , . . . , Xn is a random sample from the UNIF(0, θ) distribution.
Find the M.L. estimator of θ.
2.1.19
Problem
Suppose X1 , . . . , Xn is a random sample from the UNIF(θ, θ + 1) distribution. Show the M.L. estimator of θ is not unique.
46
2.1.20
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
Problem
Suppose X1 , . . . , Xn is a random sample from the DE(1, θ) distribution.
Find the M.L. estimator of θ.
2.1.21
Problem
Show that if θ̂ is the unique M.L. estimator of θ then θ̂ must be a function
of the minimal sufficient statistic.
2.1.22
Problem
The word information generally implies something that is additive. Suppose X has probability (density) function f (x; θ) , θ ∈ Ω and independently
Y has probability (density) function g(y; θ), θ ∈ Ω. Show that the Fisher
information in the joint observation (X, Y ) is the sum of the Fisher information in X plus the Fisher information in Y .
Often S(θ) = 0 must be solved numerically using an iterative method
such as Newton’s Method.
2.1.23
Newton’s Method
Let θ(0) be an initial estimate of θ. We may update that value as follows:
θ(i+1) = θ(i) +
S(θ(i) )
,
I(θ(i) )
i = 0, 1, . . .
Notes:
(1) The initial estimate, θ(0) , may be determined by graphing L (θ) or l (θ).
(2) The algorithm is usually run until the value of θ(i) no longer changes
to a reasonable number of decimal places. When the algorithm is stopped
it is always important to check that the value of θ obtained does indeed
maximize L (θ).
(3) This algorithm is also called the Newton-Raphson Method.
(4) I (θ) can be replaced by J (θ) for a similar algorithm which is called the
method of scoring or Fisher’s method of scoring.
(5) The value of θ̂ may also be found by maximizing L(θ) or l(θ) using
the maximization (minimization) routines available in various statistical
software packages such as Maple, S-Plus, Matlab, R etc.
(6) If the support of X depends on θ (e.g. UNIF(0, θ)) then θ̂ is not found
by solving S(θ) = 0.
2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 47
2.1.24
Example
Suppose X1 , . . . , Xn is a random sample from the WEI(1, β) distribution.
Explain how you would find the M.L. estimate of β using Newton’s Method.
How would you find the mean and variance of the M.L. estimator of β?
2.1.25
Problem - Likelihood Function for Grouped Data
Suppose X is a random variable with probability (density) function f (x; θ)
and P (X ∈ A; θ) = 1. Suppose A1 , A2 , . . . , Am is a partition of A and let
pj (θ) = P (X ∈ Aj ; θ),
j = 1, . . . , m.
Suppose n independent observations are collected from this distribution but
it is only possible to determine to which one of the m sets, A1 , A2 , . . . , Am ,
the i’th observation belongs. The observed data are:
Outcome
Frequency
A1
f1
A2
f2
...
...
Am
fm
Total
n
(a) Show that the Fisher information for these data is given by
J(θ) = n
Hint: Since
m
P
pj (θ) = 1,
j=1
m [p0 (θ)]2
P
j
.
j=1 pj (θ)
¸
∙m
d P
pj (θ) = 0.
dθ i=1
(b) Explain how you would find the M.L. estimate of θ.
2.1.26
Definition
The relative likelihood function R(θ) is defined by
R(θ) = R (θ; x) =
L(θ)
L(θ̂)
,
θ ∈ Ω.
The relative likelihood function takes on values between 0 and 1.and can
be used to rank possible parameter values according to their plausibilities
in light of the data. If R(θ1 ) = 0.1, say, then θ1 is rather an implausible
parameter value because the data are ten times more probable when θ = θ̂
than they are when θ = θ1 . However, if R(θ1 ) = 0.5, say, then θ1 is a fairly
plausible value because it gives the data 50% of the maximum possible
probability under the model.
48
2.1.27
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
Definition
The set of θ values for which R(θ) ≥ p is called a 100p% likelihood region
for θ. If the region is an interval of real values then it is called a 100p%
likelihood interval (L.I.) for θ.
Values inside the 10% L.I. are referred to as plausible and values outside
this interval as implausible. Values inside a 50% L.I. are very plausible and
outside a 1% L.I. are very implausible in light of the data.
2.1.28
Definition
The log relative likelihood function is the natural logarithm of the relative
likelihood function:
r(θ) = r (θ; x) = log[R(θ)] = log[L(θ] − log[L(θ̂)] = l(θ) − l(θ̂),
θ ∈ Ω.
Likelihood regions or intervals may be determined from a graph of R(θ)
or r(θ) and usually it is more convenient to work with r(θ). Alternatively,
they can be found by solving r(θ) − log p = 0. Usually this must be done
numerically.
2.1.29
Example
Plot the relative likelihood function for θ in Example 2.1.5 if n = 15 and
θ̂ = 1. Find the 15% L.I.’s for θ. See Figure 2.3
2.1.30
Problem
Suppose X ∼ BIN(n, θ). Plot the relative likelihood function for θ if x = 3
is observed for n = 100. On the same graph plot the relative likelihood
function for θ if x = 6 is observed for n = 200. Compare the graphs as well
as the 10% L.I. and 50% L.I. for θ.
2.1.31
Problem
Suppose X1 , . . . , Xn is a random sample from the EXP(1, θ) distribution.
Plot the relative likelihood function for θ if n = 20 and x(1) = 1. Find 10%
and 50% L.I.’s for θ.
2.1. MAXIMUM LIKELIHOOD METHOD- ONE PARAMETER 49
1
0.9
0.8
0.7
R(θ)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
θ
2
2.5
Figure 2.3: Relative Likelihood Function for Example 2.1.29
2.1.32
Problem
The following model is proposed for the distribution of family size in a large
population:
P (k children in family; θ) = θk , for k = 1, 2, . . .
1 − 2θ
P (0 children in family; θ) =
.
1−θ
The parameter θ is unknown and 0 < θ < 12 . Fifty families were chosen at
random from the population. The observed numbers of children are given
in the following table:
No. of children
Frequency observed
0
17
1
22
2
7
3
3
4
1
Total
50
(a) Find the likelihood, log likelihood, score and information functions for
θ.
50
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
(b) Find the M.L. estimate of θ and the observed information.
(c) Find a 15% likelihood interval for θ.
(d) A large study done 20 years earlier indicated that θ = 0.45. Is this
value plausible for these data?
(e) Calculate estimated expected frequencies. Does the model give a reasonable fit to the data?
2.1.33
Problem
The probability that k different species of plant life are found in a randomly
chosen plot of specified area is
¢k+1
¡
1 − e−θ
pk (θ) =
,
(k + 1) θ
k = 0, 1, . . . ; θ > 0.
The data obtained from an examination of 200 plots are given in the table
below:
No. of species
Frequency observed
0
147
1
36
2
13
3
4
≥4
0
Total
200
(a) Find the likelihood, log likelihood, score and information functions for
θ.
(b) Find the M.L. estimate of θ and the observed information.
(c) Find a 15% likelihood interval for θ.
(d) Is θ = 1 a plausible value of θ in light of the observed data?
(e) Calculate estimated expected frequencies. Does the model give a reasonable fit to the data?
2.2
Principles of Inference
In Chapter 1 we discussed the Sufficiency Principle and the Conditionality
Principle. There is another principle which is equivalent to the Sufficiency
Principle. The likelihood ratios generate the minimal sufficient partition.
In other words, two likelihood ratios will agree
f (x1 ; θ)
f (x2 ; θ)
=
f (x1 ; θ0 )
f (x2 ; θ0 )
if and only if the values of the minimal sufficient statistic agree, that is,
T (x1 ) = T (x2 ). Thus we obtain:
2.2. PRINCIPLES OF INFERENCE
2.2.1
51
The Weak Likelihood Principle
Suppose for two different observations x1 , x2 , the likelihood ratios
f (x2 ; θ)
f (x1 ; θ)
=
f (x1 ; θ0 )
f (x2 ; θ0 )
for all values of θ, θ0 ∈ Ω. Then the two different observations x1 , x2 should
lead to the same inference about θ.
A weaker but similar principle, the Invariance Principle follows. This
can be used, for example, to argue that for independent identically distributed observations, it is only the value of the observations (the order
statistic) that should be used for inference, not the particular order in
which those observations were obtained.
2.2.2
Invariance Principle
Suppose for two different observations x1 , x2 ,
f (x1 ; θ) = f (x2 ; θ)
for all values of θ ∈ Ω. Then the two different observations x1 , x2 should
lead to the same inference about θ.
There are relationships among these and other principles. For example, Birnbaum proved that the Conditionality Principle and the Sufficiency
Principle above imply a stronger version of a Likelihood Principle. However, it is probably safe to say that while probability theory has been quite
successfully axiomatized, it seems to be difficult if not impossible to derive most sensible statistical procedures from a set of simple mathematical
axioms or principles of inference.
2.2.3
Problem
Consider the model {f (x; θ) ; θ ∈ Ω} and suppose that θ̂ is the M.L. estimator based on the observation X. We often draw conclusions about the
plausibility of a given parameter value θ based on the relative likelihood
L(θ)
. If this is very small, for example, less than or equal to 1/N , we regard
L(θ̂)
the value of the parameter θ as highly unlikely. But what happens if this
test declares every value of the parameter unlikely?
Suppose f (x; θ) = 1 if x = θ and f (x; θ) = 0 otherwise, where
θ = 1, 2, . . . N . Define f0 (x) to be the discrete uniform distribution on the
52
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
integers {1, 2, . . . , N }. In this example the parameter space is
Ω = {θ; θ = 0, 1, ..., N }. Show that the relative likelihood
1
f0 (x)
≤
f (x; θ)
N
no matter what value of x is observed. Should this be taken to mean that
the true distribution cannot be f0 ?
2.3
Properties of the Score and Information
- Regular Model
Consider the model {f (x; θ) ; θ ∈ Ω}. The following is a set of sufficient
conditions which we will use to determine the properties of the M.L. estimator of θ. These conditions are not the most general conditions but
are sufficiently general for most applications. Notable exceptions are the
UNIF(0, θ) and the EXP(1, θ) distributions which will be considered separately.
For convenience we call a family of models which satisfy the following
conditions a regular family of distributions. (See 1.7.9.)
2.3.1
Regular Model
Consider the model {f (x; θ) ; θ ∈ Ω}. Suppose that:
(R1) The parameter space Ω is an open interval in the real line.
(R2) The densities f (x; θ) have common support, so that the set
A = {x; f (x; θ) > 0} , does not depend on θ.
(R3) For all x ∈ A, f (x; θ) is a continuous, three times differentiable function of θ.
R
(R4) The integral f (x; θ) dx can be twice differentiated with respect to θ
A
under the integral sign, that is,
Z
Z
∂k
∂k
f
(x;
θ)
dx
=
f (x; θ) dx, k = 1, 2 for all θ ∈ Ω.
∂θk
∂θk
A
A
(R5) For each θ0 ∈ Ω there exist a positive number c and function M (x)
(both of which may depend on θ0 ), such that for all θ ∈ (θ0 − c, θ0 + c)
¯
¯ 3
¯ ∂ log f (x; θ) ¯
¯ < M (x)
¯
¯
¯
∂θ3
2.3. PROPERTIES OF THE SCORE AND INFORMATION- REGULAR MODEL53
holds for all x ∈ A, and
E [M (X) ; θ] < ∞ for all θ ∈ (θ0 − c, θ0 + c) .
(R6) For each θ ∈ Ω,
0<E
(∙
∂ 2 log f (X; θ)
∂θ2
¸2
;θ
)
<∞
If these conditions hold with X a discrete random variable and the
integrals replaced by sums, then we shall also call this a regular family of
distributions.
Condition (R3) insures that the function ∂ log f (x; θ) /∂θ has, for each
x ∈ A, a Taylor expansion as a function of θ.
The following lemma provides one method of determining whether differentiation under the integral sign (condition (R4)) is valid.
2.3.2
Lemma
Suppose ∂g (x; θ) /∂θ exists for all θ ∈ Ω, and all x ∈ A. Suppose also that
for each θ0 ∈ Ω there exist a positive number c, and function G (x) (both
of which may depend on θ0 ), such that for all θ ∈ (θ0 − c, θ0 + c)
¯
¯
¯ ∂g (x; θ) ¯
¯
¯
¯ ∂θ ¯ < G(x)
holds for all x ∈ A, and
Z
G (x) dx < ∞.
A
Then
∂
∂θ
Z
A
2.3.3
g(x, θ)dx =
Z
∂
g(x, θ)dx.
∂θ
A
Theorem - Expectation and Variance of the Score
Function
If X = (X1 , . . . , Xn ) is a random sample from a regular model
{f (x; θ) ; θ ∈ Ω} then
E[S(θ; X); θ] = 0
54
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
and
V ar[S(θ; X); θ] = E{[S(θ; X)]2 ; θ} = E[I(θ; X); θ] = J(θ) < ∞
for all θ ∈ Ω.
2.3.4
Problem - Invariance Property of M.L.
Estimators
Suppose X1 , . . . , Xn is a random sample from a distribution with probability (density) function f (x; θ) where {f (x; θ) ; θ ∈ Ω} is a regular family.
Let S(θ) and J(θ) be the score function and Fisher information respectively
based on X1 , . . . , Xn . Consider the reparameterization τ = h(θ) where h
is a one-to-one differentiable function with inverse function θ = g(τ ). Let
S ∗ (τ ) and J ∗ (τ ) be the score function and Fisher information respectively
under the reparameterization.
(a) Show that τ̂ = h(θ̂) is the M.L. estimator of τ where θ̂ is the M.L.
estimator of θ.
(b) Show that E[S ∗ (τ ; X); τ ] = 0 and J ∗ (τ ) = [g 0 (τ )]2 J[g(τ )].
2.3.5
Problem
It is natural to expect that if we compare the information available in the
original data X and the information available in some statistic T (X), the
latter cannot be greater than the former since T can be obtained from X.
Show that in a regular model the Fisher information calculated from the
marginal distribution of T is less than or equal to the Fisher information
for X. Show that they are equal for all values of the parameter if and only
if T is a sufficient statistic for {f (x; θ) ; θ ∈ Ω}.
2.4
Maximum Likelihood Method
- Multiparameter
The case of several parameters is exactly analogous to the one parameter
case. Suppose θ = (θ1 , . . . , θk )T . The log likelihood function l (θ1 , . . . , θk ) =
log L (θ1 , . . . , θk ) is a function of k parameters. The M.L. estimate of θ,
∂l
θ̂ = (θ̂1 , . . . , θ̂k )T is usually found by solving ∂θ
= 0, j = 1, . . . , k simultaj
neously.
The invariance property of the M.L. estimator also holds in the multiparameter case.
2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER55
2.4.1
Definition
If θ = (θ1 ,. . . , θk )T then the score vector is defined as
⎡ ∂l ⎤
⎢
⎢
S(θ) = ⎢
⎣
2.4.2
∂θ1
..
.
∂l
∂θk
⎥
⎥
⎥,
⎦
θ ∈ Ω.
Definition
If θ = (θ1 , . . . , θk )T then the information matrix I(θ) is a k × k symmetric
matrix whose (i, j) entry is given by
−
∂2
l(θ),
∂θi ∂θj
θ ∈ Ω.
I(θ̂) is called the observed information matrix.
2.4.3
Definition
If θ = (θ1 , . . . , θk )T then the expected or Fisher information matrix J(θ) is
a k × k symmetric matrix whose (i, j) entry is given by
∙
¸
∂2
E −
l(θ; X); θ , θ ∈ Ω.
∂θi ∂θj
2.4.4
Expectation and Variance of the Score Vector
For a regular family of distributions
⎤
0
⎢ ⎥
E[S(θ; X); θ] = ⎣ ... ⎦
0
⎡
and
V ar[S(θ; X); θ] = E[S(θ; X)S(θ; X)T ; θ] = E [I (θ; X) ; θ] = J(θ).
2.4.5
Likelihood Regions
The set of θ values for which R(θ) ≥ p is called a 100p% likelihood region
for θ.
56
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
2.4.6
Example:
Suppose X1 , ..., Xn is a random sample from the N(μ, σ 2 ) distribution. Find
the score vector, the information matrix, the Fisher information matrix and
¢T
¡
the
estimator of θ = μ,¡σ2 .¢Find the observed information
¡ M.L.
¢
¡
¢matrix
I μ̂, σ̂2 and thus verify that μ̂,¡σ̂ 2 is¢the M.L. estimator of μ, σ 2 . Find
the Fisher information matrix J μ, σ 2 .
¡
¢
Since X1 , ..., Xn is a random sample from the N μ, σ 2 distribution the
likelihood function is
¸
∙
n
¡
¢
Q
−1
1
2
2
√
L μ, σ
(xi − μ)
=
exp
2σ 2
2πσ
i=1
∙
¸
n
−1 P
−n/2 ¡ 2 ¢−n/2
2
exp
(xi − μ)
σ
= (2π)
2σ2 i=1
∙
¸
n ¡
¡ ¢−n/2
¢
−1 P
2
2
exp
−
2μx
−
μ
x
= (2π)−n/2 σ 2
i
2σ2 i=1 i
∙
µn
¶¸
n
¡ ¢−n/2
P
−1 P
2
2
exp
x
−
2μ
x
+
nμ
= (2π)−n/2 σ 2
i
2σ2 i=1 i
i=1
∙
¸
¡
¢
¡
¢
−1
−n/2
−n/2
2
exp
−
2μt
+
nμ
σ2
t
, μ ∈ <, σ 2 > 0
= (2π)
1
2
2σ2
where
t1 =
n
P
i=1
x2i and t2 =
The log likelihood function is
¢
¡
l μ, σ 2 =
=
=
where
−n
log (2π) −
2
−n
log (2π) −
2
−n
log (2π) −
2
¡ ¢
n
log σ 2 −
2
¡ ¢
n
log σ 2 −
2
¡ ¢
n
log σ 2 −
2
s2 =
Now
n
P
xi .
i=1
n
1 P
2
(xi − μ)
2
2σ i=1
∙n
¸
1 ¡ 2 ¢−1 P
2
2
(xi − x̄) + n (x̄ − μ)
σ
2
i=1
i
1 ¡ 2 ¢−1 h
(n − 1) s2 + n (x̄ − μ)2 , μ ∈ <, σ 2 > 0
σ
2
n
1 P
(xi − x̄)2 .
n − 1 i=1
¡ ¢−1
n
∂l
(x̄ − μ)
= 2 (x̄ − μ) = n σ 2
∂μ
σ
2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER57
and
i
n ¡ 2 ¢−1 1 ¡ 2 ¢−2 h
∂l
2
2
=
−
+
+
n
(x̄
−
μ)
(n
−
1)
s
.
σ
σ
∂σ 2
2
2
The equations
∂l
∂l
=0
= 0,
∂μ
∂σ 2
are solved simultaneously for
μ = x̄ and σ2 =
Since
∂2l
∂μ2
∂2l
−
−
∂ (σ 2 )
2
n
1 P
(n − 1) 2
(xi − x̄)2 =
s .
n i=1
n
n
∂2l
n (x̄ − μ)
,
−
=
σ2
∂σ 2 ∂μ
σ4
h
i
n 1
1
2
2
= −
+
+
n
(x̄
−
μ)
(n
−
1)
s
2 σ4 σ6
=
the information matrix is
⎡
n/σ 2
¡
¢
2
⎣
I μ, σ =
n (x̄ − μ) /σ 4
− n2 σ14
Since
⎤
n (x̄ − μ) /σ 4
h
i ⎦
, μ ∈ <, σ 2 > 0.
+ σ16 (n − 1) s2 + n (x̄ − μ)2
¡
¡
¢
¢
n
n2
I11 μ̂, σ̂ 2 = 2 > 0 and det I μ̂, σ̂ 2 = 6 > 0
σ̂
2σ̂
then by the Second Derivative Test the M.L. estimates of of μ and σ2 are
μ̂ = x̄ and σ̂ 2 =
and the M.L. estimators are
μ̂ = X̄ and σ̂2 =
The observed information is
¡
¢
I μ̂, σ̂2 =
Now
´
³n
n
2
;
μ,
σ
= 2,
E
2
σ
σ
n
1 P
(n − 1) 2
(xi − x̄)2 =
s
n i=1
n
n ¡
¢2 (n − 1) 2
1 P
Xi − X̄ =
S .
n i=1
n
"
n/σ̂ 2
0
0
¡
¢
1
4
n/σ̂
2
#
.
" ¡
#
¢
n X̄ − μ
2
E
; μ, σ = 0,
σ4
58
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
and
¾
½
¡
¢2 i
1 h
n 1
2
2
(n − 1) S + n X̄ − μ
; μ, σ
+
E −
2 σ4 σ6
h¡
io
¢2
n 1
1 n
2
2
2
= −
+
;
μ,
σ
)
+
nE
X̄
−
μ
;
μ,
σ
(n
−
1)
E(S
2 σ4
σ6
¤
n 1
1 £
= −
+ 6 (n − 1) σ 2 + σ 2
2 σ4
σ
n
=
2σ 4
since
¤
£¡
¢
E X̄ − μ ; μ, σ 2 = 0,
h¡
i
¢2
¡
¢ σ2
¢
¡
E X̄ − μ ; μ, σ2
= V ar X̄; μ, σ2 =
and E S 2 ; μ, σ 2 = σ 2 .
n
Therefore the Fisher information matrix is
#
" n
0
¡
¢
2
σ
J μ, σ2 =
0 2σn4
and the inverse of the Fisher information matrix is
⎡ 2
⎤
σ
0
¢¤
£ ¡
−1
n
⎦.
=⎣
J μ, σ 2
4
0 2σn
Now
¡ ¢
V ar X̄
=
and
σ2
n ∙
¸
n ¡
¡ 2¢
¢2
2σ 4
2(n − 1)σ 4
1 P
V ar σ̂
≈
=
= V ar
Xi − X̄
n i=1
n2
n
Cov(X̄, σ̂ 2 ) =
since X̄ and
n ¡
¢2
P
1
Xi − X̄ ) = 0
Cov(X̄,
n
i=1
n ¡
¢2
P
Xi − X̄ are independent random variables. Inferences
i=1
for μ and σ 2 are usually made using
X̄ − μ
(n − 1)S 2
√ v t (n − 1) and
v χ2 (n − 1) .
σ2
S/ n
2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER59
The relative likelihood function is
¢ µ ¶n/2
¡
nn
io
¡
¢ L μ, σ 2
σ̂ 2
n h 2
2
2
R μ, σ =
exp
+
(x̄
−
μ)
σ̂
, μ ∈ <, σ 2 > 0.
=
−
L (μ̂, σ̂ 2 )
σ2
2
2σ 2
¡
¢
See Figure 2.4 for a graph of R μ, σ 2 for n = 350, μ̂ = 160 and σ̂ 2 = 36.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
42
161
40
38
160.5
36
160
34
159.5
32
30
σ
159
2
μ
Figure 2.4: Normal Likelihood function for n = 350, μ̂ = 160 and σ̂ 2 = 36
2.4.7
Problem - The Score Equation and the
Exponential Family
Suppose X has a regular exponential family distribution of the form
"
#
k
P
f (x; η) = C(η) exp
ηj Tj (x) h(x).
j=1
where η = (η1 , . . . , ηk )T . Show that
E[Tj (X); η] =
−∂ log C (η)
, j = 1, . . . , k
∂ηj
60
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
and
Cov(Ti (X), Tj (X); η) =
−∂ 2 log C (η)
, i, j = 1, . . . , k.
∂ηi ∂ηj
Suppose that (x1 , . . . , xn ) are the observed data for a random sample
from fη (x). Show that the score equations
∂
l (η) = 0,
∂ηj
j = 1, ..., k
can be written as
¸
∙n
n
P
P
Tj (Xi ); η =
Tj (xi ),
E
i=1
2.4.8
j = 1, . . . , k.
i=1
Problem
Suppose X1 , . . . , Xn is a random sample from the N(μ, σ 2 ) distribution.
Use the result of Problem 2.4.7 to find the score equations for μ and σ2 and
verify that these are the same equations obtained in Example 2.4.7.
2.4.9
Problem
Suppose (X1 , Y1 ) , . . . , (Xn , Yn ) is a random sample from the BVN(μ, Σ)
distribution. Find the M.L. estimators of μ1 , μ2 , σ12 , σ22 , and ρ. You do not
need to verify that your answer corresponds to a maximum. Hint: Use the
result from Problem 2.4.7.
2.4.10
Problem
Suppose (X1 , X2 ) ∼ MULT(n, θ1 , θ2 ). Find the M.L. estimators of θ1 and
θ2 , the score function and the Fisher information matrix.
2.4.11
Problem
Suppose X1 , . . . , Xn is a random sample from the UNIF(a, b) distribution.
Find the M.L. estimators of a and b. Verify that your answer corresponds
to a maximum. Find the M.L. estimator of τ (a, b) = E (Xi ).
2.4.12
Problem
Suppose X1 , . . . , Xn is a random sample from the UNIF(μ − 3σ, μ + 3σ)
distribution. Find the M.L. estimators of μ and σ.
2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER61
2.4.13
Problem
Suppose X1 , . . . , Xn is a random sample from the EXP(β, μ) distribution.
Find the M.L. estimators of β and μ. Verify that your answer corresponds
to a maximum. Find the M.L. estimator of τ (β, μ) = xα where xα is the α
percentile of the distribution.
2.4.14
Problem
In Problem 1.7.26 find the M.L. estimators of μ and σ 2 . Verify that your
answer corresponds to a maximum.
2.4.15
Problem
Suppose E(Y ) = Xβ where Y = (Y1 , . . . , Yn )T is a vector of independent
and normally distributed random variables with V ar(Yi ) = σ 2 , i = 1, . . . , n,
X is a n × k matrix of known constants of rank k and β = (β1 , . . . , βk )T is
a vector of unknown parameters. Show that the M.L. estimators of β and
σ 2 are given by
¡
¢−1 T
β̂ = X T X
X Y and σ̂ 2 = (Y − X β̂)T (Y − X β̂)/n.
2.4.16
Newton’s Method
In the multiparameter case θ = (θ1 , . . . , θk )T Newton’s method is given by:
θ(i+1) = θ(i) + [I(θ(i) )]−1 S(θ(i) ),
i = 0, 1, 2, . . .
I(θ) can also be replaced by the Fisher information J(θ).
2.4.17
Example
The following data are 30 independent observations from a BETA(a, b)
distribution:
0.2326,
0.3049,
0.5297,
0.1079,
0.0465,
0.4195,
0.1508,
0.0819,
0.2159, 0.2447, 0.0674, 0.3729, 0.3247, 0.3910, 0.3150,
0.3473, 0.2709, 0.4302, 0.3232, 0.2354, 0.4014, 0.3720,
0.4253, 0.0710, 0.3212, 0.3373, 0.1322, 0.4712, 0.4111,
0.3556
The likelihood function for observations x1 , x2 , ..., xn is
n Γ(a + b)
Q
(1 − xi )b−1 , a > 0, b > 0
xa−1
i
i=1 Γ(a)Γ(b)
¸b−1
∙
¸n ∙ n ¸a−1 ∙ n
Q
Q
Γ(a + b)
xi
(1 − xi )
.
=
Γ(a)Γ(b)
i=1
i=1
L(a, b) =
62
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
The log likelihood function is
l(a, b) = n [log Γ(a + b) − log Γ(a) − log Γ(b) + (a − 1)t1 + (b − 1)t2 ]
where
t1 =
n
n
1 P
1 P
log xi and t2 =
log(1 − xi ).
n i=1
n i=1
(T1 , T2 ) is a sufficient statistic for (a, b) where
T1 =
Why?
Let
n
n
1 P
1 P
log Xi and T2 =
log(1 − Xi ).
n i=1
n i=1
Ψ (z) =
d log Γ(z)
Γ0 (z)
=
dz
Γ(z)
which is called the digamma function. The score vector is
"
#
∙
¸
Ψ (a + b) − Ψ (a) + t1
∂l/∂a
S (a, b) =
=n
.
∂l/∂b
Ψ (a + b) − Ψ (b) + t2
S (a, b) = [0 0]T must be solved numerically to find the M.L. estimates of
a and b.
Let
d
Ψ0 (z) =
Ψ (z)
dz
which is called the trigamma function. The information matrix is
"
#
Ψ0 (a) − Ψ0 (a + b)
−Ψ0 (a + b)
I(a, b) = n
−Ψ0 (a + b)
Ψ0 (b) − Ψ0 (a + b)
which is also the Fisher or expected information matrix.
For the data above
t1 =
n
n
P
1 P
1
log xi = −1.3929 and t2 =
log(1 − xi ) = −0.3594.
log
30 i=1
30
i=1
The M.L. estimates of a and b can be found using Newton’s Method given
by
∙ (i+1) ¸ ∙ (i) ¸ h
i−1
a
a
=
+ I(a(i) , b(i) )
S(a(i) , b(i) )
(i+1)
(i)
b
b
2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER63
for i = 0, 1, ... until convergence. Newton’s Method converges after 8 interations beginning with the initial estimates a(0) = 2, b(0) = 2. The iterations
are given below:
∙
∙
∙
∙
∙
∙
∙
∙
0.6449
2.2475
1.0852
3.1413
1.6973
4.4923
2.3133
5.8674
2.6471
6.6146
2.7058
6.7461
2.7072
6.7493
2.7072
6.7493
¸
¸
¸
¸
¸
¸
¸
¸
=
=
=
=
=
=
=
=
∙
∙
∙
∙
∙
∙
∙
∙
2
2
¸
+
∙
0.6449
2.2475
1.0852
3.1413
1.6973
4.4923
2.3133
5.8674
2.6471
6.6146
2.7058
6.7461
2.7072
6.7493
¸−1 ∙
¸
10.8333 −8.5147
−16.7871
−8.5147 10.8333
14.2190
¸ ∙
¸−1 ∙
¸
84.5929 −12.3668
26.1919
+
−12.3668
4.3759
−1.5338
¸ ∙
¸−1 ∙
¸
35.8351 −8.0032
11.1198
+
−8.0032 3.2253
−0.5408
¸ ∙
¸−1 ∙
¸
18.5872 −5.2594
4.2191
+
−5.2594 2.2166
−0.1922
¸ ∙
¸−1 ∙
¸
12.2612 −3.9004
1.1779
+
−3.9004 1.6730
−0.0518
¸ ∙
¸−1 ∙
¸
10.3161 −3.4203
0.1555
+
−3.4203 1.4752
−0.0067
¸ ∙
¸−1 ∙
¸
10.0345 −3.3478
0.0035
+
−3.3478 1.4450
−0.0001
¸ ∙
¸−1 ∙
¸
10.0280 −3.3461
0.0000
+
−3.3461 1.4443
0.0000
The M.L. estimates are â = 2.7072 and b̂ = 6.7493.
The observed information matrix is
∙
¸
10.0280 −3.3461
I(â, b̂) =
−3.3461 1.4443
2
Note that since det[I(â, b̂)] = (10.0280) (1.4443) − (3.3461) > 0 and
[I(â, b̂)]11 = 10.0280 > 0 and then by the Second Derivative Test we have
found the M.L. estimates.
A graph of the relative likelihood function is given in Figure 2.5.
64
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
15
10
5
b
0
1
1.5
2.5
2
3
3.5
4
5
4.5
a
Figure 2.5: Relative Likelihood for Beta Example
A 100p% likelihood region for (a, b) is given by {(a, b) ; R (a, b) ≥ p}.
The 1%, 5% and 10% likelihood regions for (a, b) are shown in Figure 2.6.
Note that the likelihood contours are elliptical in shape and are skewed relative to the ab coordinate axes. Since this is a regular model and S(â, b̂) = 0
then by Taylor’s Theorem we have
"
#
¤
1£
L (a, b) ≈ L(â, b̂) + S(â, b̂)
+
â − a b̂ − b I(â, b̂)
2
b̂ − b
"
#
â − a
¤
1£
= L(â, b̂) +
â − a b̂ − b I(â, b̂)
2
b̂ − b
â − a
for all (a, b) sufficiently close to (â, b̂). Therefore
R (a, b) =
L (a, b)
L(â, b̂)
"
â − a
b̂ − b
#
2.4. MAXIMUM LIKELIHOOD METHOD- MULTIPARAMETER65
h
≈ 1 − 2L(â, b̂)
h
= 1 − 2L(â, b̂)
i−1 £
i−1 £
â − a b̂ − b
â − a b̂ − b
¤
¤
I(â, b̂)
∙
Iˆ11
Iˆ12
"
â − a
b̂ − b
¸"
Iˆ12
Iˆ22
#
â − a
#
b̂ − b
i−1 h
i
= 1 − 2L(â, b̂)
(a − â)2 Iˆ11 + 2 (a − â) (b − b̂)Iˆ12 + (b − b̂)2 Iˆ22 .
h
The set of points (a, b) which satisfy R (a, b) = p is approximately the set
of points (a, b) which satisfy
2
(a − â) Iˆ11 + 2 (a − â) (b − b̂)Iˆ12 + (b − b̂)2 Iˆ22 = 2 (1 − p) L(â, b̂)
which we recognize as the points on an ellipse centred at (â, b̂). The skewness of the likelihood contours relative to the ab coordinate axes is determined by the value of Iˆ12 . If this value is close to zero the skewness will be
small.
2.4.18
Problem
The following data are 30 independent observations from a GAM(α, β)
distribution:
15.1892, 19.3316, 1.6985, 2.0634, 12.5905, 6.0094,
13.6279, 14.7847, 13.8251, 19.7445, 13.4370, 18.6259,
2.7319, 8.2062, 7.3621, 1.6754, 10.1070, 3.2049,
21.2123, 4.1419, 12.2335, 9.8307, 3.6866, 0.7076,
7.9571, 3.3640, 12.9622, 12.0592, 24.7272, 12.7624
For these data t1 =
30
P
log xi = 61.1183 and t2 =
i=1
30
P
xi = 309.8601.
i=1
Find the M.L. estimates of α and β for these data, the observed information
I(α̂, β̂) and the Fisher information J(α, β). On the same graph plot the
1%, 5%, and 10% likelihood regions for (α, β). Comment.
2.4.19
Problem
Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function
f (x; α, β) =
αβ
(1 + βx)α+1
x > 0; α, β > 0.
Find the Fisher information matrix J(α, β).
66
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
14
1%
5%
12
10%
10
(1.5,8)
8
b
(2.7,6.7)
6
4
2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
a
Figure 2.6: Likelihood Regions for BETA(a,b) Example
The following data are 15 independent observations from this distribution:
9.53, 0.15, 0.77, 0.47, 4.10, 1.60, 0.42, 0.01, 2.30, 0.40,
0.80, 1.90, 5.89, 1.41, 0.11
Find the M.L. estimates of α and β for these data and the observed information I(α̂, β̂). On the same graph plot the 1%, 5%, and 10% likelihood
regions for (α, β). Comment.
2.4.20
Problem
Suppose X1 , . . . , Xn is a random sample from the CAU(β, μ) distribution.
Find the Fisher information for (β, μ).
2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM
2.4.21
67
Problem
A radioactive sample emits particles at a rate which decays with time,
the rate being λ(t) = λe−βt . In other words, the number of particles
emitted in an interval (t, t + h) has a Poisson distribution with parameter
t+h
R
λe−βs ds and the number emitted in disjoint intervals are independent
t
random variables. Find the M.L. estimate of λ and β, λ > 0, β > 0 if
the actual times of the first, second, ..., n’th decay t1 < t2 < . . . tn are
observed. Show that β̂ satisfies the equation
β̂tn
eβ̂tn − 1
2.4.22
= 1 − β̂ t̄ where t̄ =
n
1 P
ti .
n i=1
Problem
In Problem 2.1.23 suppose θ = (θ1 , . . . , θk )T . Find the Fisher information
matrix and explain how you would find the M.L. estimate of θ.
2.5
Incomplete Data and The E.M. Algorithm
The E.M. algorithm, which was popularized by Dempster, Laird and Rubin
(1977), is a useful method for finding M.L. estimates when some of the
data are incomplete but can also be applied to many other contexts such
as grouped data, mixtures of distributions, variance components and factor
analysis.
The following are two examples of incomplete data:
2.5.1
Censored Exponential Data
Suppose Xi ∼ EXP(θ), i = 1, . . . , n. Suppose we only observe Xi for m
observations and the remaining n − m observations are censored at a fixed
time c. The observed data are of the form Yi = min(Xi , c), i = 1, . . . , n.
Note that Y = Y (X) is a many-to-one mapping. (X1 , . . . , Xn ) are called
the complete data and (Y1 , . . . , Yn ) are called the incomplete data.
2.5.2
“Lumped” Hardy-Weinberg Data
A gene has two forms A and B. Each individual has a pair of these genes,
one from each parent, so that there are three possible genotypes: AA, AB
and BB. Suppose that, in both male and female populations, the proportion
of A types is equal to θ and the proportion of B types is equal to 1 − θ.
68
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
Suppose further that random mating occurs with respect to this gene pair.
Then the proportion of individuals with genotypes AA, AB and BB in the
next generation are θ2 , 2θ(1 − θ) and (1 − θ)2 respectively. Furthermore, if
random mating continues, these proportions will remain nearly constant for
generation after generation. This is the famous result from genetics called
the Hardy-Weinberg Law. Suppose we have a group of n individuals and
let X1 = number with genotype AA, X2 = number with genotype AB and
X3 = number with genotype BB. Suppose however that it is not possible
to distinguish AA0 s from AB 0 s so that the observed data are (Y1 , Y2 ) where
Y1 = X1 + X2 and Y2 = X3 . The complete data are (X1, X2 , X3 ) and the
incomplete data are (Y1 , Y2 ).
2.5.3
Theorem
Suppose X, the complete data, has probability (density) function f (x; θ)
and Y = Y (X), the incomplete data, has probability (density) function
g(y; θ). Suppose further that f (x; θ) and g (x; θ) are regular models. Then
∂
log g(y; θ) = E[S(θ; X)|Y = y].
∂θ
∂
∂θ
Suppose θ̂, the value which maximizes log g(y; θ), is found by solving
log g(y; θ) = 0. By the previous theorem θ̂ is also the solution to
E[S(θ; X)|Y = y; θ] = 0.
Note that θ appears in two places in the second equation, as an argument
in the function S as well as an argument in the expectation E.
2.5.4
The E.M. Algorithm
The E.M. algorithm solves E[S(θ; X)|Y = y] = 0 using an iterative two-step
method. Let θ(i) be the estimate of θ from the ith iteration.
(1) E-Step (Expectation Step)
Calculate
h
i
E log f (X; θ) |Y = y; θ(i) = Q(θ, θ(i) ).
(2) M-step (Maximization Step)
2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM
69
Find the value of θ which maximizes Q(θ, θ(i) ) and set θ(i+1) equal to
this value. θ(i+1) is found by solving
∙
¸
h
i
∂
∂
Q(θ, θ(i) ) = E
log f (X; θ) |Y = y; θ(i) = E S(θ; X)|Y = y; θ(i) = 0
∂θ
∂θ
with respect to θ.
Note that
2.5.5
h
i
E S(θ(i+1) ; X)|Y = y; θ(i) = 0
Example
Give the E.M. algorithm for the “Lumped” Hardy-Weinberg example. Find
θ̂ if n = 10 and y1 = 3. Show how θ̂ can be found explicitly by solving
∂
∂θ log g(y; θ) = 0 directly.
The complete data (X1 , X2 ) have joint p.f.
f (x1 , x2 ; θ) =
x1 , x2
h
in−x1 −x2
£ 2 ¤x1
n!
x
2
[2θ (1 − θ)] 2 (1 − θ)
θ
x1 !x2 ! (n − x1 − x2 )!
= θ2x1 +x2 (1 − θ)2n−(2x1 +x2 ) · h (x1 , x2 )
= 0, 1, . . . ; x1 + x2 ≤ n; 0 < θ < 1
where
h (x1 , x2 ) =
n!
2x2 .
x1 !x2 ! (n − x1 − x2 )!
It is easy to see (show it!) that (X1 , X2 ) has a regular exponential family
distribution with natural sufficient statistic T = T (X1 , X2 ) = 2X1 + X2 .
The incomplete data are Y = X1 + X2 .
For the E-Step we need to calculate
h
i
Q(θ, θ(i) ) = E log f (X1 , X2 ; θ) |Y = y; θ(i)
n
o
= E (2X1 + X2 ) log θ + [2n − (2X1 + X2 )] log (1 − θ) |Y = X1 + X2 = y; θ(i)
h
i
+E h (X1 , X2 ) |Y = X1 + X2 = y; θ(i) .
(2.1)
To find these expectations we note that by the properties of the multinomial distribution
µ
¶
µ
¶
θ2
θ
X1 |X1 + X2 = y v BIN y, 2
= BIN y,
θ + 2θ (1 − θ)
2−θ
70
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
and
µ
X2 |X1 + X2 = y v BIN y, 1 −
θ
2−θ
¶
.
Therefore
µ
¶
µ (i) ¶
´
³
θ(i)
θ
+
y
1
−
= 2y
E 2X1 + X2 |Y = X1 + X2 = y; θ(i)
2 − θ(i)
2 − θ(i)
µ
¶
³
´
2
= y
= yp θ(i)
(2.2)
2 − θ(i)
where
p (θ) =
2
.
2−θ
Substituting (2.2) into (2.1) gives
³ ´
h
³ ´i
h
i
Q(θ, θ(i) ) = yp θ(i) log θ+ 2n − yp θ(i) log (1 − θ)+E h (X1 , X2 ) |Y = X1 + X2 = y; θ(i) .
Note that we do not need to simplify the last term on the right hand side
since it does not involve θ.
For the M-Step we need to solve
∂
Q(θ, θ(i) ) = 0.
∂θ
Now
∂ ³ (i) ´
=
Q θ, θ
∂θ
=
=
and
if
¡ ¢
yp θ(i)
2n − yp(θ(i) )
−
θ
(1 − θ)
£
¤
(i)
yp(θ ) (1 − θ) − 2n − yp(θ(i) ) θ
θ (1 − θ)
yp(θ(i) ) − 2nθ
θ (1 − θ)
∂
Q(θ, θ(i) ) = 0
∂θ
yp(θ(i) )
y
θ=
=
2n
2n
Therefore θ(i+1) is given by
θ
(i+1)
µ
2
2 − θ(i)
y
=
n
µ
¶
y
=
n
1
2 − θ(i)
¶
µ
.
1
2 − θ(i)
¶
.
(2.3)
2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM
71
Our algorithm for finding the M.L. estimate of θ is
¶
µ
y
1
(i+1)
θ
=
, i = 0, 1, . . .
n 2 − θ(i)
For the data n = 10 and y = 3 let the initial guess for θ be θ(0) = 0.1.
Note that the initial guess does not really matter in this example since the
algorithm converges rapidly for any initial guess between 0 and 1.
For the given data and initial guess we obtain:
θ(1)
=
θ(2)
=
θ(3)
=
θ(4)
=
µ
¶
3
1
= 0.1579
10 2 − 0.1
¶
µ
1
3
= 0.1629
10 2 − θ(1)
¶
µ
3
1
= 0.1633
10 2 − θ(2)
¶
µ
3
1
= 0.1633.
10 2 − θ(3)
So the M.L. estimate of θ is θ̂ = 0.1633 to four decimal places.
In this example we can find θ̂ directly since
¡
¢
Y = X1 + X2 v BIN n, θ2 + 2θ (1 − θ)
and therefore
µ ¶
in−y
¤y h
n £ 2
2
g (y; θ) =
θ + 2θ (1 − θ)
, y = 0, 1, . . . , n; 0 < θ < 1
(1 − θ)
y
µ ¶
n y
n−y
2
=
q (1 − q)
,
where q = 1 − (1 − θ)
y
which is a binomial likelihood so the M.L. estimate of q is q̂ = y/n.
By the
√ invariance property of M.L. estimates the M.L. estimate of
θ = 1 − 1 − q is
p
p
(2.4)
θ̂ = 1 − 1 − q̂ = 1 − 1 − y/n.
p
For the data n = 10 and y = 3 we obtain θ̂ = 1 − 1 − 3/10 = 0.1633
to four decimal places which is the same answer as we found using the E.M.
algorithm.
72
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
2.5.6
E.M. Algorithm and the Regular Exponential
Family
Suppose X, the complete data, has a regular exponential family distribution
with probability (density) function
#
"
k
P
qj (θ) Tj (x) h(x), θ = (θ1 , . . . , θk )T
f (x; θ) = C(θ) exp
j=1
and let Y = Y (X) be the incomplete data. Then the M-step of the E.M.
algorithm is given by
E[Tj (X); θ(i+1) ] = E[Tj (X)|Y = y; θ(i) ],
2.5.7
j = 1, . . . , k.
(2.5)
Problem
Prove (2.5) using the result from Problem 2.4.7.
2.5.8
Example
Use (2.5) to find the M-step for the “Lumped” Hardy-Weinberg example.
Since the natural sufficient statistic is T = T (X1 , X2 ) = 2X1 + X2 the
M-Step is given by
i
h
i
h
(2.6)
E 2X1 + X2 ; θ(i+1) = E 2X1 + X2 |Y = y; θ(i) .
Using (2.2) and the fact that
¡
¢
X1 v BIN n, θ2 and X2 v BIN (n, 2θ (1 − θ)) ,
then (2.6) can be written as
or
µ
h
i2
ih
i
h
2n θ(i+1) + n 2θ(i+1) 1 − θ(i+1) = y
θ(i+1) =
y
2n
µ
2
2 − θ(i)
¶
=
which is the same result as in (2.3).
If the algorithm converges and
lim θ(i) = θ̂
i→∞
y
n
µ
2
2 − θ(i)
1
2 − θ(i)
¶
¶
2.5. INCOMPLETE DATA AND THE E.M. ALGORITHM
73
(How would you prove this? Hint: Recall the Monotonic Sequence Theorem.) then
¶
µ
1
y
(i+1)
lim θ
= lim
i→∞
i→∞ n
2 − θ(i)
or
¶
µ
y
1
θ̂ =
.
n 2 − θ̂
Solving for θ̂ gives
θ̂ = 1 −
which is the same result as in (2.4).
2.5.9
p
1 − y/n
Example
Use (2.5) to give the M-step for the censored exponential data example.
Assuming the algorithm converges, find an expression for θ̂. Show that this
∂
is the same θ̂ which is obtained when ∂θ
log g(y; θ) = 0 is solved directly.
2.5.10
Problem
Suppose X1 , . . . , Xn is a random sample from the N(μ, σ 2 ) distribution.
Suppose we observe Xi , i = 1, . . . , m but for i = m + 1, . . . , n we observe
only that Xi > c.
(a) Give explicitly the M-step of the E.M. algorithm for finding the M.L.
estimate of μ in the case where σ 2 is known.
(b) Give explicitly the M-step of the E.M. algorithm for finding the M.L.
estimates of μ and σ 2 .
Hint: If Z ∼ N(0, 1) show that
E(Z|Z > b) =
φ(b)
= h(b)
1 − Φ(b)
where φ is the probability density function and Φ is the cumulative distribution function of Z and h is called the hazard function.
2.5.11
Problem
Let (X1 , Y1 ) , . . . , (Xn , Yn ) be a random sample from the BVN(μ, Σ) distribution. Suppose that some of the Xi and Yi are missing as follows: for
i = 1, . . . , n1 we observe both Xi and Yi , for i = n1 + 1, . . . , n2 we observe
only Xi and for i = n2 + 1, . . . , n we observe only Yi . Give explicitly the
M-step of the E.M. algorithm for finding the M.L. estimates of
74
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
(μ1 , μ2 , σ12 , σ22 , ρ).
Hint: Xi |Yi = yi ∼ N(μ1 + ρσ1 (yi − μ2 )/σ2 , (1 − ρ2 )σ12 ).
2.5.12
Problem
The data in the table below were obtained through the National Crime Survey conducted by the U.S. Bureau of Census (See Kadane (1985), Journal
of Econometrics, 29, 46-67.). Households were visited on two occasions, six
months apart, to determine if the occupants had been victimized by crime
in the preceding six-month period.
First visit
Crime-free (X1 = 0)
Victims (X1 = 1)
Nonrespondents
Crime-free (X2 = 0)
392
76
31
Second visit
Victims (X2 = 1)
55
38
7
Nonrespondents
33
9
115
Let X1i = 1 (0) if the occupants in household i were victimized (not victimized) by crime in the preceding six-month period on the first visit.
Let X2i = 1 (0) if the occupants in household i were victimized (not victimized) by crime during the six-month period between the first visit and
second visits.
Let θjk = P (X1i = j, X2i = k) , j = 0, 1; k = 0, 1; i = 1, . . . , N.
(a) Write down the probability of observing the complete data Xi = (X1i , X2i ) ,
i = 1, . . . , N and show that X = (X1 , . . . , XN ) has a regular exponential
family distribution.
(b) Give the M-step of the E.M. algorithm for finding the M.L. estimate
of θ = (θ00 , θ01 , θ10 , θ11 ). Find the M.L. estimate of θ for the data in the
table. Note: You may ignore the 115 households that did not respond to
the survey at either visit.
Hint:
E [(1 − X1i ) (1 − X2i ) ; θ] = P (X1i = 0, X2i = 0; θ) = θ00
E [(1 − X1i ) (1 − X2i ) |X1i = 1; θ] = 0
P (X1i = 0, X2i = 0; θ)
θ00
E [(1 − X1i ) (1 − X2i ) |X1i = 0; θ] =
=
P (X1i = 0; θ)
θ00 + θ01
etc.
2.6. THE INFORMATION INEQUALITY
75
(c) Find the M.L. estimate of the odds ratio
τ=
θ00 θ11
.
θ01 θ10
What is the significance of τ = 1?
2.6
The Information Inequality
Suppose we consider estimating a parameter τ (θ),where θ is a scalar, using
an unbiased estimator T (X). Is there any limit to how well an estimator like
this can behave? The answer for unbiased estimators is in the affirmative,
and a lower bound on the variance is given by the information inequality.
2.6.1
Information Inequality - One Parameter
Suppose T (X) is an unbiased estimator of the parameter τ (θ) in a regular
statistical model {f (x; θ) ; θ ∈ Ω}. Then
2
V ar(T ) ≥
[τ 0 (θ)]
.
J (θ)
Equality holds if and only if X has a regular exponential family with natural
sufficient statistic T (X).
2.6.2
Proof
Since T is an unbiased estimator of τ (θ),
Z
T (x)f (x; θ) dx = τ (θ),
for all θ ∈ Ω,
A
where P (X ∈ A; θ) = 1. Since f (x; θ) is a regular model we can take the
derivative with respect to θ on both sides and interchange the integral and
derivative to obtain:
Z
∂f (x; θ)
T (x)
dx = τ 0 (θ).
∂θ
A
Since E[S(θ; X)] = 0, this can be written as
Cov[T, S(θ; X)] = τ 0 (θ)
76
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
and by the covariance inequality, this implies
V ar(T )V ar[S(θ; X)] ≥ [τ 0 (θ]2
(2.7)
which, upon dividing by J(θ) = V ar[S(θ; X)], provides the desired result.
Now suppose we have equality in (2.7). Equality in the covariance inequality obtains if and only if the random variables T and S(θ; X) are linear
functions of one another. Therefore, for some (non-random) c1 (θ), c2 (θ), if
equality is achieved,
S(θ; x) = c1 (θ)T (x) + c2 (θ) all x ∈ A.
Integrating with respect to θ,
log f (x; θ) = C1 (θ)T (x) + C2 (θ) + C3 (x)
where we note that the constant of integration C3 is constant with respect
to changing θ but may depend on x. Therefore,
f (x; θ) = C(θ) exp [C1 (θ)T (x)] h(x)
where C(θ) = eC2 (θ) and h(x) = eC3 (x) which is exponential family with
natural sufficient statistic T (X).¥
The special case of the information inequality that is of most interest is
the unbiased estimation of the parameter θ. The above inequality indicates
that any unbiased estimator T of θ has variance at least 1/J(θ). The lower
bound is achieved only when f (x; θ) is regular exponential family with
natural sufficient statistic T .
Notes:
1. If equality holds then T (X) is called an efficient estimator of τ (θ).
2. The number
[τ 0 (θ)]2
J(θ)
is called the Cramér-Rao lower bound (C.R.L.B.).
3. The ratio of the C.R.L.B. to the variance of an unbiased estimator is
called the efficiency of the estimator.
2.6.3
Example
Suppose X1 , . . . , Xn is a random sample from the POI(θ) distribution.
Show that the variance of the U.M.V.U.E. of θ achieves the Cramér-Rao
2.6. THE INFORMATION INEQUALITY
77
lower bound for unbiased estimators of θ and find the lower bound. What
is the U.M.V.U.E. of τ (θ) = θ2 ? Does the variance of this estimator achieve
the Cramér-Rao lower bound for unbiased estimators of θ2 ? What is the
lower bound?
2.6.4
Example
Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function
f (x; θ) = θxθ−1 , 0 < x < 1, θ > 0.
Show that the variance of the U.M.V.U.E. of θ does not achieve the CramérRao lower bound. What is the efficiency of the U.M.V.U.E.?
For some time it was believed that no estimator of θ could have
variance smaller than 1/J(θ) at any value of θ but this was demonstrated
incorrect by the following example of Hodges.
2.6.5
Problem
Let X1 , . . . , Xn is a random sample from the N(θ, 1) distribution and define
X̄
if |X̄| ≤ n−1/4 , T (X) = X̄ otherwise.
2
1
Show that E(T ) ≈ θ, V ar(T ) ≈ 1/n if θ =
6 0, and V ar(T ) ≈ 4n
if θ = 0.
Show that the Cramér-Rao lower bound for estimating θ is equal to n1 .
T (X) =
This example indicates that it is possible to achieve variance smaller
than 1/J(θ) at one or more values of θ. It has been proved that this is the
exception. In fact the set of θ for which the variance of an estimator is less
than 1/J(θ) has measure 0, which means, for example, that it may be a
finite set or perhaps a countable set, but it cannot contain a non-degenerate
interval of values of θ.
2.6.6
Problem
For each of the following determine whether the variance of the U.M.V.U.E.
of θ based on a random sample X1 , . . . , Xn achieves the Cramér-Rao lower
bound. In each case determine the Cramér-Rao lower bound and find the
efficiency of the U.M.V.U.E.
(a) N(θ, 4)
(b) Bernoulli(θ)
(c) N(0, θ2 )
(d) N(0, θ).
78
2.6.7
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
Problem
Find examples of the following phenomena in a regular statistical model.
(a) No unbiased estimator of τ (θ) exists.
(b) An unbiased estimator of τ (θ) exists but there is no U.M.V.U.E.
(c) A U.M.V.U.E. of τ (θ) exists but its variance is strictly greater than the
Cramér-Rao lower bound.
(d) A U.M.V.U.E. of τ (θ) exists and its variance equals the Cramér-Rao
lower bound.
2.6.8
Information Inequality - Multiparameter
The right hand side in the information inequality generalizes naturally to
the multiple parameter case in which θ is a vector. For example if
θ = (θ1 , . . . , θk )T , then the Fisher information J(θ) is a k × k matrix. If
τ (θ) is any real-valued function of θ then its derivative is a column vector
´T
³
∂τ
∂τ
,
.
.
.
,
. Then if T (X) is any unbiased
we will denote by D(θ) = ∂θ
∂θ
1
k
estimator of τ (θ) in a regular model,
V ar(T ) ≥ [D(θ)]T [J(θ)]−1 D(θ) for all θ ∈ Ω.
2.6.9
Example
Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution. Find
the U.M.V.U.E. of σ and determine whether the U.M.V.U.E. is an efficient
estimator of σ. What happens as n → ∞? Hint:
∙
µ ¶¸
Γ (k + a)
(a + b − 1) (a − b)
1
as k → ∞
= ka−b 1 +
+O
Γ (k + b)
2k
k2
2.6.10
Problem
Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution. Find
the U.M.V.U.E. of μ/σ and determine whether the U.M.V.U.E. is an efficient estimator. What happens as n → ∞?
2.6.11
Problem
Let X1 , . . . , Xn be a random sample from the GAM(α, β) distribution.
Find the U.M.V.U.E. of E(Xi ; α, β) = αβ and determine whether the
U.M.V.U.E. is an efficient estimator.
2.7. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - ONE PARAMETER 79
2.6.12
Problem
Consider the model in Problem 1.7.26.
(a) Find the M.L. estimators of μ and σ 2 using the result from Problem
2.4.7. You do not need to verify that your answer corresponds to a maximum. Compare the M.L. estimators with the U.M.V.U.E.’s of μ and σ 2 .
(b) Find the observed information matrix and the Fisher information.
(c) Determine if the U.M.V.U.E.’s of μ and σ 2 are efficient estimators.
2.7
Asymptotic Properties of M.L.
Estimators - One Parameter
One of the more successful attempts at justifying estimators and demonstrating some form of optimality has been through large sample theory or
the asymptotic behaviour of estimators as the sample size n → ∞. One
of the first properties one requires is consistency of an estimator. This
means that the estimator converges to the true value of the parameter as
the sample size (and hence the information) approaches infinity.
2.7.1
Definition
Consider a sequence of estimators Tn where the subscript n indicates that
the estimator has been obtained from data X1 , . . . , Xn with sample size n.
Then the sequence is said to be a consistent sequence of estimators of τ (θ)
if Tn →p τ (θ) for all θ ∈ Ω.
It is worth a reminder at this point that probability (density) functions
are used to produce probabilities and are only unique up to a point. For
example if two probability density functions f (x) and g(x) were such that
they produced the same probabilities, or the same cumulative distribution
function, for example,
Zx
−∞
f (z)dz =
Zx
g(z)dz
−∞
for all x, then we would not consider them distinct probability densities,
even though f (x) and g(x) may differ at one or more values of x. Now
when we parameterize a given statistical model using θ as the parameter, it
is natural to do so in such a way that different values of the parameter lead
to distinct probability (density) functions. This means, for example, that
80
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
the cumulative distribution functions associated with these densities are
distinct. Without this assumption it would be impossible to accurately estimate the parameter since two different parameters could lead to the same
cumulative distribution function and hence exactly the same behaviour of
the observations. Therefore we assume:
(R7) The probability (density) functions corresponding to different values
of the parameters are distinct, that is, θ 6= θ∗ =⇒ f (x; θ) 6= f (x; θ∗ ).
This assumption together with assumptions (R1) − (R6) (see 2.3.1) are
sufficient conditions for the theorems given in this section.
2.7.2
Theorem - Consistency of the M.L. Estimator
(Regular Model)
Suppose X1 , . . . , Xn is a random sample from a model {f (x; θ) ; θ ∈ Ω}
satisfying regularity conditions (R1) − (R7). Then with probability tending
to 1 as n → ∞, the likelihood equation or score equation
n
X
∂
logf (Xi ; θ) = 0
∂θ
i=1
has a root θ̂n such that θ̂n converges in probability to θ0 , the true value of
the parameter, as n → ∞.
The proof of this theorem is given in Section 5.4.9 of the Appendix.
The likelihood equation does not always have a unique root as the following problem illustrates.
2.7.3
Problem
Indicate whether or not the likelihood equation based on X1 , . . . , Xn has a
unique root in each of the cases below:
(a) LOG(1, θ)
(b) WEI(1, θ)
(c) CAU(1, θ)
The consistency of the M.L. estimator is one indication that it performs
reasonably well. However, it provides no reason to prefer it to some other
consistent estimator. The following result indicates that M.L. estimators
perform as well as any reasonable estimator can, at least in the limit as
n → ∞.
2.7. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - ONE PARAMETER 81
2.7.4
Theorem - Asymptotic Distribution of
the M.L. Estimator (Regular Model)
Suppose (R1) − (R7) hold. Suppose θ̂n is a consistent root of the likelihood
equation as in Theorem 2.7.2. Then
p
J(θ0 )(θ̂n − θ0 ) →D Z ∼ N(0, 1)
where θ0 is the true value of the parameter.
The proof of this theorem is given in Section 5.4.10 of the
Appendix.
Note: Since J(θ) is the Fisher expected information based on a random
sample from the model {f (x; θ) ; θ ∈ Ω},
¸
∙
¸
∙ n
P ∂2
∂2
logf (Xi ; θ) ; θ = nE − 2 logf (X; θ) ; θ
J(θ) = E −
2
∂θ
i=1 ∂θ
where X has probability (density) function f (x; θ).
This theorem implies that for a regular model and sufficiently large n,
θ̂n has an approximately normal distribution with mean θ0 and variance
−1
−1
[J(θ0 )] . [J(θ0 )] is called the asymptotic variance of θ̂n . This theorem
also asserts that θ̂n is asymptotically unbiased and its asymptotic variance
approaches the Cramér-Rao lower bound for unbiased estimators of θ.
By the Limiting Theorems it also follows that
τ (θ̂ ) − τ (θ0 )
q n
→D Z v N (0, 1) .
[τ 0 (θ0 )]2 /J (θ0 )
Compare this result with the Information Inequality.
2.7.5
Definition
Suppose X1 , . . . , Xn is a random sample from a regular statistical model
{f (x; θ) ; θ ∈ Ω}. Suppose also that Tn = Tn (X1 , ..., Xn ) is asymptotically
normal with mean θ and variance σT2 /n. The asymptotic efficiency of Tn is
defined to be
½
∙ 2
¸¾−1
∂ log f (X; θ)
σT2 · E −
;
θ
∂θ2
where X has probability (density) function f (x; θ).
82
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
2.7.6
Problem
Suppose X1 , . . . , Xn is a random sample from a distribution with continuous
probability density function f (x; θ) and cumulative distribution function
F (x; θ) where θ is the median of the distribution. Suppose also that f (x; θ)
is continuous at x = θ. The sample median Tn = med(X1 , . . . , Xn ) is a
possible estimator of θ.
(a) Find the probability density function of the median if n = 2m + 1 is
odd.
(b) Prove
µ
¶
√
1
n(Tn − θ) →D T ∼ N 0,
4[f (0; 0)]2
(c) If X1 , ..., Xn is a random sample from the N(θ, 1) distribution find the
asymptotic efficiency of Tn .
(d) If X1 , ..., Xn is a random sample from the CAU(1, θ) distribution find
the asymptotic efficiency of Tn .
2.8
2.8.1
Interval Estimators
Definition
Suppose X is a random variable whose distribution depends on θ. Suppose
that A(x) and B(x) are functions such that A(x) ≤ B(x) for all x ∈ support
of X and θ ∈ Ω. Let x be the observed data. Then (A(x), B(x)) is an
interval estimate for θ. The interval (A(X), B(X)) is an interval estimator
for θ.
Likelihood intervals are one type of interval estimator. Confidence intervals are another type of interval estimator.
We now consider a general approach for constructing confidence intervals based on pivotal quantities.
2.8.2
Definition
Suppose X is a random variable whose distribution depends on θ. The
random variable Q(X; θ) is called a pivotal quantity if the distribution of
Q does not depend on θ. Q(X; θ) is called an asymptotic pivotal quantity if
the limiting distribution of Q as n → ∞ does not depend on θ.
For example, for a random sample X1 , . . . , Xn from a N(θ, σ 2 ) distrib-
2.8. INTERVAL ESTIMATORS
83
ution where σ 2 is known, the statistic
¢
√ ¡
n X̄ − θ
T =
σ
is a pivotal quantity whose distribution does not depend on θ. If X1 , . . . , Xn
is a random sample from a distribution, not necessarily normal, having
mean θ and known variance σ 2 then the asymptotic distribution of T is
N(0, 1) by the C.L.T. and T is an asymptotic pivotal quantity.
2.8.3
Definition
Suppose A(X) and B(X) are statistics. If P [A(X) < θ < B(X)] = p,
0 < p < 1 then (a(x), b(x)) is called a 100p% confidence interval (C.I.) for
θ.
Pivotal quantities can be used for constructing C.I.’s in the following
way. Since the distribution of Q(X; θ) is known we can write down a
probability statement of the form
P (q1 ≤ Q(X; θ) ≤ q2 ) = p.
If Q is a monotone function of θ then this statement can be rewritten as
P [A(X) ≤ θ ≤ B(X)] = p
and the interval [a(x), b(x)] is a 100p% C.I..
The following theorem gives the pivotal quantity in the case in which θ
is either a location parameter or a scale parameter.
2.8.4
Theorem
Let X = (X1 , ..., Xn ) be a random sample from the model {f (x; θ) ; θ ∈ Ω}
and let θ̂ = θ̂(X) be the M.L. estimator of the scalar parameter θ based on
X.
(1) If θ is a location parameter then Q = Q (X) = θ̂−θ is a pivotal quantity.
(2) If θ is a scale parameter then Q = Q (X) = θ̂/θ is a pivotal quantity.
2.8.5
Asymptotic Pivotal Quantities and Approximate
Confidence Intervals
In cases in which an exact pivotal quantity cannot be constructed we can
use the limiting distribution of θ̂n to construct approximate C.I.’s. Since
[J(θ̂n )]1/2 (θ̂n − θ0 ) →D Z v N(0, 1)
84
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
then [J(θ̂n )]1/2 (θ̂−θ0 ) is an asymptotic pivotal quantity and an approximate
100p% C.I. based on this asymptotic pivotal quantity is given by
i
h
θ̂n − a[J(θ̂n )]−1/2 , θ̂n + a[J(θ̂n )]−1/2
where θ̂n = θ̂n (x1 , ..., xn ) is the M.L. estimate of θ, P (−a < Z < a) = p
and Z v N(0, 1).
Similarly since
[I(θ̂n ; X)]1/2 (θ̂n − θ0 ) →D Z v N(0, 1)
where X = (X1 , ..., Xn ) then [I(θ̂n ; X)]1/2 (θ̂n − θ0 ) is an asymptotic pivotal
quantity and an approximate 100p% C.I. based on this asymptotic pivotal
quantity is given by
i
h
θ̂n − a[I(θ̂)]−1/2 , θ̂n + a[I(θ̂n )]−1/2
where I(θ̂n ) is the observed information.
Finally since
−2 log R(θ0 ; X) →D W v χ2 (1)
then −2 log R(θ0 ; X) is an asymptotic pivotal quantity and an approximate
100p% C.I. based on this asymptotic pivotal is
{θ : −2 log R(θ; x) ≤ b}
where x = (x1 , ..., xn ) are the observed data, P (W ≤ b) = p and
W v χ2 (1). Usually this must be calculated numerically.
Since
τ (θ̂ ) − τ (θ0 )
rh n i
→D Z v N (0, 1)
2
0
τ (θ̂n ) /J(θ̂n )
an approximate 100p% C.I. for τ (θ) is given by
"
τ (θ̂n ) − a
½h
¾1/2
½h
¾1/2 #
i2
i2
0
τ (θ̂n ) /J(θ̂n )
, τ (θ̂n ) + a τ (θ̂n ) /J(θ̂n )
0
where P (−a < Z < a) = p and Z v N(0, 1).
2.8. INTERVAL ESTIMATORS
2.8.6
85
Likelihood Intervals and Approximate Confidence
Intervals
A 15% L.I. for θ is given by {θ : R(θ; x) ≥ 0.15}. Since
−2 log R(θ0 ; X) →D W v χ2 (1)
we have
P [R(θ; X) ≥ 0.15] =
=
≈
≈
≈
P [−2 log R(θ; X) ≤ −2 log (0.15)]
P [−2 log R(θ; X) ≤ 3.79]
¡
¢
P (W ≤ 3.79) = P Z 2 ≤ 3.79
where Z v N(0, 1)
P (−1.95 ≤ Z ≤ 1.95)
0.95
and therefore a 15% L.I. is an approximate 95% C.I. for θ.
2.8.7
Example
Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function
f (x; θ) = θxθ−1 ,
0 < x < 1.
The likelihood function for observations x1 , . . . , xn is
L (θ) =
n
Q
i=1
θxθ−1
i
=θ
n
µ
n
Q
xi
i=1
¶θ−1
,
θ>0
The log likelihood and score function are
l (θ) = n log θ + (θ − 1)
S (θ) =
where t =
n
P
n
P
log xi ,
θ>0
i=1
n
1
+ t = (n + tθ)
θ
θ
log xi . Since
i=1
S (θ) > 0 for 0 < θ < −n/t and S (θ) < 0 for θ > −n/t
therefore by the First Derivative Test, θ̂ = −n/t is the M.L. estimate of θ.
n
P
The M.L. estimator of θ is θ̂ = −n/T where T =
log Xi .
i=1
86
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
The information function is
I (θ) =
n
,
θ2
θ>0
and the Fisher information is
J (θ) = E [I (θ; X)] = E
By the W.L.L.N.
−
³n´
θ2
=
n
θ2
n
T
1
1 P
log Xi →p E (− log Xi ; θ0 ) =
=−
n
n i=1
θ0
and by the Limit Theorems
θ̂n =
n
→p θ0
T
and thus θ̂n is a consistent estimator of θ0 .
By Theorem 2.7.4
r
p
n
J (θ0 )(θ̂n − θ0 ) =
(θ̂n − θ0 ) →D Z v N (0, 1) .
θ02
(2.8)
The asymptotic variance of θ̂n is equal to θ02 /n whereas the actual variance
of θ̂n is
n2 θ02
V ar(θ̂n ) =
.
2
(n − 1) (n − 2)
This can be shown using the fact that − log Xi v EXP (1/θ0 ), i = 1, . . . , n
independently which means
µ
¶
n
P
1
T =−
log Xi v GAM n,
(2.9)
θ0
i=1
and then using the result from Problem 1.3.4. Therefore the asymptotic
variance and the actual variance of θ̂n are not identical but are close in
value for large n.
An approximate 95% C.I. for θ based on
q
J(θ̂n )(θ̂n − θ0 ) →D Z v N (0, 1)
is given by
∙
¸ h
q
q
√
√ i
θ̂n − 1.96/ J(θ̂n ), θ̂n + 1.96/ J(θ̂n ) = θ̂n − 1.96θ̂n / n, θ̂n + 1.96θ̂n / n .
2.8. INTERVAL ESTIMATORS
87
√
Note
√ the width of the C.I. which is equal to 2 (1.96) θ̂n / n decreases as
1/ n.
An exact C.I. for θ can be obtained in this case since
nθ
T
= Tθ =
v GAM (n, 1)
θ−1
θ̂
and therefore nθ/θ̂ is a pivotal quantitiy. Since
2T θ =
2nθ
θ̂
v χ2 (2n)
we can use values from the chi-squared tables. From the chi-squared tables
we find values a and b such that
P (a ≤ W ≤ b) = 0.95 where W v χ2 (2n) .
Then
µ
¶
2nθ
a≤
≤ b = 0.95
θ̂
Ã
!
aθ̂
bθ̂
P
≤θ≤
= 0.95
2n
2n
P
or
and a 95% C.I. for θ is
"
#
aθ̂ bθ̂
,
.
2n 2n
If we choose
P (W ≤ a) =
1 − 0.95
= 0.025 = P (W ≥ b)
2
then we obtain an “equal-tail” C.I. for θ. This is not the narrowest C.I.
but it is easier to obtain than the narrowest C.I.. How would you obtain
the narrowest C.I.?
2.8.8
Example
Suppose X1 , . . . , Xn is a random sample from the POI (θ) distribution. The
parameter θ is neither a location or scale parameter. The M.L. estimator
of θ and the Fisher information are
θ̂n = X̄n
and J (θ) =
n
.
θ
88
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
By Theorem 2.7.4
³
´ rn ³
´
p
J (θ0 ) θ̂n − θ0 =
θ̂n − θ0 →D Z v N (0, 1) .
θ0
(2.10)
By the C.L.T.
X̄n − θ0
p
→D Z v N (0, 1)
θ0 /n
which is the same result.
The asymptotic variance of θ̂n is equal to θ0 /n. Since the actual variance
of θ̂n is
¡ ¢ θ0
V ar(θ̂n ) = V ar X̄n =
n
the asymptotic variance and the actual variance of θ̂n are identical in this
case.
An approximate 95% C.I. for θ based on
q
³
´
J(θ̂n ) θ̂n − θ0 →D Z v N (0, 1)
is given by
∙
¸ ∙
¸
q
q
q
q
θ̂n − 1.96/ J(θ̂n ), θ̂n + 1.96/ J(θ̂n ) = θ̂n − 1.96 θ̂n /n, θ̂n + 1.96 θ̂n /n .
An approximate 95% C.I. for τ (θ) = e−θ can be based on the asymptotic
pivotal
τ (θ̂ ) − τ (θ0 )
rh n i
→D Z v N (0, 1) .
2
0
τ (θ̂n ) /J(θ̂n )
¡ −θ ¢
d
e
= −e−θ and the approxFor τ (θ) = e−θ = P (X1 = 0; θ), τ 0 (θ) = dθ
imate 95% C.I. is given by
"
½h
½h
¾1/2
¾1/2 #
³ ´
i2
³ ´
i2
τ θ̂n − 1.96 τ 0 (θ̂n ) /J(θ̂n )
, τ θ̂n − 1.96 τ 0 (θ̂n ) /J(θ̂n )
=
∙
¸
q
q
e−θ̂n − 1.96e−θ̂n θ̂n /n, e−θ̂n + 1.96e−θ̂n θ̂n /n .
which is symmetric about the M.L. estimate τ (θ̂n ) = e−θ̂n .
(2.11)
2.8. INTERVAL ESTIMATORS
89
Alternatively since
µ
¶
q
q
0.95 ≈ P θ̂n − 1.96 θ̂n /n ≤ θ ≤ θ̂n + 1.96 θ̂n /n
µ
¶
q
q
= P −θ̂n + 1.96 θ̂n /n ≥ −θ ≥ −θ̂n − 1.96 θ̂n /n
µ
µ
¶
µ
¶¶
q
q
−θ
= P exp −θ̂n + 1.96 θ̂n /n ≥ e ≥ exp −θ̂n − 1.96 θ̂n /n
µ
µ
¶
µ
¶¶
q
q
−θ
= P exp −θ̂n − 1.96 θ̂n /n ≤ e ≤ exp −θ̂n + 1.96 θ̂n /n
therefore
∙
µ
¶
µ
¶¸
q
q
exp −θ̂n − 1.96 θ̂n /n , exp −θ̂n + 1.96 θ̂n /n
(2.12)
is also an approximate 95% C.I. for τ (θ).
If n = 20 and θ̂n = 3 then the C.I. (2.11) is equal to [0.012, 0.0876]
while the C.I. (2.12) is equal to [0.0233, 0.1064] .
2.8.9
Example
Suppose X1 , . . . , Xn is a random sample from the EXP (1, θ) distribution.
f (x; θ) = e−(x−θ) ,
x≥θ
The likelihood function for observations x1 , . . . , xn is
n
Q
e−(xi −θ) if xi ≥ θ > −∞, i = 1, . . . , n
i=1
µ n ¶
P
xi enθ if − ∞ < θ ≤ x(1)
= exp −
L (θ) =
i=1
and L (θ) is equal to 0 if θ > x(1) . To maximize this function of θ we note
that we want to make the term enθ as large as possible subject to θ ≤ x(1)
which implies that θ̂n = x(1) is the M.L. estimate and θ̂n = X(1) is the M.L.
estimator of θ.
Since the support of Xi depends on the unknown parameter θ, the model
is not a regular model. This means that Theorem 2.7.2 and 2.7.4 cannot
be used to determine the asymptotic properties of θ̂n . Since
P (θ̂n ≤ x; θ0 ) = 1 −
n
Q
i=1
P (Xi > x; θ0 ) = 1 − e−n(x−θ0 ) ,
x ≥ θ0
90
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
then θ̂n v EXP
¡1
n , θ0
¢
. Therefore
lim E(θ̂n ) = lim
n→∞
and
n→∞
µ
1
θ0 +
n
¶
= θ0
µ ¶2
1
lim V ar(θ̂n ) = lim
=0
n→∞
n→∞ n
by Theorem 5.3.8, θ̂n →p θ0 and θ̂n is a consistent estimator.
Since
h
P n(θ̂n − θ0 ) ≤ t; θ0
i
¶
µ
t
= P θ̂n ≤ + θ0 ; θ0
n
= 1 − en(t/n+θ0 −θ0 )
= 1 − e−t , t ≥ 0
true for n = 1, 2, . . ., therefore n(θ̂n − θ0 ) v EXP (1) for n = 1, 2, . . .
and therefore the asymptotic distribution of n(θ̂n − θ0 ) is also EXP (1).
Since we know the exact distribution of θ̂n for n = 1, 2, . . ., the asymptotic
distribution is not needed for obtaining C.I.’s.
For this model the parameter θ is a location parameter and therefore
(θ̂ − θ) is a pivotal quantity and in particular
P (θ̂ − θ
≤ t; θ) = P (θ̂ ≤ t + θ; θ)
= 1 − e−nt , t ≥ 0.
h
i
Since (θ̂−θ) is a pivotal quantity, a C.I. for θ would take the form θ̂ − b, θ̂ − a
where 0 ≤ a ≤ b. Unless a = 0 this interval would not contain the M.L. estimate θ̂ and therefore a “one-tail” C.I. makes sense in this case. To obtain
a 95% “one-tail” C.I. for θ we solve
0.95 = P (θ̂ − b ≤ θ ≤ θ̂; θ)
³
´
= P 0 ≤ θ̂ − θ ≤ b; θ
= 1 − e−nb
which gives
1
log 20
b = − log (0.05) =
.
n
n
Therefore
¸
∙
log 20
, θ̂
θ̂ −
n
is a 95% “one-tail” C.I. for θ.
2.8. INTERVAL ESTIMATORS
2.8.10
91
Problem
Let X1 , . . . , Xn be a random sample from the distribution with p.d.f.
f (x; β) =
2x
,
β2
0 < x ≤ β.
(a) Find the likelihood function of β and the M.L. estimator of β.
(b) Find the M.L. estimator of E (X; β) where X has p.d.f. f (x; β).
(c) Show that the M.L. estimator of β is a consistent estimator of β.
(d) If n = 15 and x(15) = 0.99, find the M.L. estimate of β. Plot the relative
likelihood function for β and find 10% and 50% likelihood intervals for β.
(e) If n = 15 and x(15) = 0.99, construct an exact 95% one-tail C.I. for β.
2.8.11
Problem
Suppose X1 , . . . , Xn is a random sample from the UNIF(0, θ) distribution.
Show that the M.L. estimator θ̂n is a consistent estimator of θ. How would
you construct a C.I. for θ?
2.8.12
Problem
A certain type of electronic equipment is susceptible to instantaneous failure
at any time. Components do not deteriorate significantly with age and the
distribution of the lifetime is the EXP(θ) density. Ten components were
tested independently with the observed lifetimes, to the nearest days, given
by 70 11 66 5 20 4 35 40 29 8.
(a) Find the M.L. estimate of θ and verify that it corresponds to a local
maximum. Find the Fisher information and calculate an approximate 95%
C.I. for θ based on the asymptotic distribution of θ̂. Compare this with an
exact 95% C.I. for θ.
(b) The estimate in (a) ignores the fact that the data were rounded to the
nearest day. Find the exact likelihood function based on the fact that the
probability of observing a lifetime of i days is given by
g(i; θ) =
i+0.5
Z
i−0.5
1 −x/θ
dx, i = 1, 2, . . . and g(0; θ) =
e
θ
Z0.5
1 −x/θ
dx.
e
θ
0
Obtain the M.L. estimate of θ and verify that it corresponds to a local
maximum. Find the Fisher information and calculate an approximate 95%
C.I. for θ. Compare these results with those in (a).
92
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
2.8.13
Problem
The number of calls to a switchboard per minute is thought to have a
POI(θ) distribution. However, because there are only two lines available,
we are only able to record whether the number of calls is 0, 1 , or ≥ 2. For
50 one minute intervals the observed data were: 25 intervals with 0 calls,
16 intervals with 1 call and 9 intervals with ≥ 2 calls.
(a) Find the M.L. estimate of θ.
(b) By computing the Fisher information both for this problem and for one
with full information, that is, one in which all of the values of X1 , . . . , X50
had been recorded, determine how much information was lost by the fact
that we were only able to record the number of times X > 1 rather than the
value of these X’s. How much difference does this make to the asymptotic
variance of the M.L. estimator?
2.8.14
Problem
Let X1 , . . . , Xn be a random sample from a UNIF(θ, 2θ) distribution. Show
that the M.L. estimator θ̂ is a consistent estimator of θ. What is the
5
minimal sufficient statistic for this model? Show that θ̃ = 14
X(n) + 27 X(1)
is a consistent estimator of θ which has smaller M.S.E. than θ̂.
2.9
Asymptotic Properties of M.L.
Estimators - Multiparameter
Under similar regularity conditions to the univariate case, the conclusion of
Theorem 2.7.2 holds in the multiparameter case θ = (θ1 , . . . , θk )T , that is,
each component of θ̂n converges in probability to the corresponding component of θ0 . Similarly, Theorem 2.7.4 remains valid with little modification:
[J (θ0 )]1/2 (θ̂n − θ0 ) →D Z ∼ MVN(0k , Ik )
where 0k is a k ×1 vector of zeros and Ik is the k ×k identity matrix. Therefore for a regular model and sufficiently large n, θ̂n has approximately a multivariate normal distribution with mean vector θ0 and variance/covariance
−1
matrix [J (θ0 )] .
Consider the reparameterization
τj = τj (θ),
j = 1, . . . , m ≤ k.
It follows that
ª−1/2
©
[τ (θ̂n ) − τ (θ0 )] →D Z ∼ MVN(0m , Im )
[D(θ0 )]T [J(θ0 )]−1 D(θ0 )
2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 93
where τ (θ) = (τ1 (θ), . . . , τm (θ))T and D(θ) is a k × m matrix with (i, j)
element equal to ∂τj /∂θi .
2.9.1
Definition
A 100p% confidence region for the vector θ based on X = (X1 , ..., Xn ) is a
region R(X) ⊂ Rk which satisfies
P (θ ∈ R(X); θ) = p.
2.9.2
Aymptotic Pivotal Quantities and Approximate
Confidence Regions
Since
[J(θ̂n )]1/2 (θ̂n − θ0 ) →D Z v MVN(0k , Ik )
it follows that
(θ̂n − θ0 )T J(θ̂n )(θ̂n − θ0 ) →D W v χ2 (k)
and an approximate 100p% confidence region for θ based on this asymptotic
pivotal is the set of all θ vectors in the set
{θ : (θ̂n − θ)T J(θ̂n )(θ̂n − θ) ≤ b}
where θ̂n = θ̂(x1 , ..., xn ) is the M.L. estimate of θ and b is the value such
that P (W < b) = p where W v χ2 (k).
Similarly since
[I(θ̂n ; X)]1/2 (θ̂n − θ0 ) →D Z v MVN(0k , Ik )
it follows that
(θ̂n − θ0 )T I(θ̂n ; X)(θ̂n − θ0 ) →D W v χ2 (k)
where X = (X1 , ..., Xn ). An approximate 100p% confidence region for θ
based on this asymptotic pivotal quantity is the set of all θ vectors in the
set
{θ : (θ̂n − θ)T I(θ̂n )(θ̂n − θ) ≤ b}
where I(θ̂n ) is the observed information matrix.
Finally since
−2 log R(θ0 ; X) →D W v χ2 (k)
94
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
an approximate 100p% confidence region for θ based on this asymptotic
pivotal quantity is the set of all θ vectors in the set
{θ : −2 log R(θ; x) ≤ b}
where x = (x1 , ..., xn ) are the observed data, R(θ; x) is the relative likelihood function. Note that since
{θ : −2 log R(θ; x) ≤ b} = {θ : R(θ; x) ≥ e−b/2 }
this approximate 100p% confidence region is also a 100e−b/2 % likelihood
region for θ.
Approximate confidence intervals for a single parameter, say θi , from
the vector of parameters θ = (θ1 , ..., θi , ..., θk )T can also be obtained. Since
[J(θ̂n )]1/2 (θ̂n − θ0 ) →D Z v MVN(0k , Ik )
it follows that an approximate 100p% C.I. for θi is given by
h
p
p i
θ̂i − a v̂ii , θ̂i + a v̂ii
where θ̂i is the M.L. estimate of θi , v̂ii is the (i, i) entry of [J(θ̂n )]−1 and a
is the value such that P (−a < Z < a) = p where Z v N(0, 1).
Similarly since
[I(θ̂n ; X)]1/2 (θ̂n − θ0 ) →D Z v MVN(0k , Ik )
it follows that an approximate 100p% C.I. for θi is given by
h
p
p i
θ̂i − a v̂ii , θ̂i + a v̂ii
where v̂ii is the (i, i) entry of [I(θ̂n )]−1 .
If τ (θ) is a scalar function of θ then
o−1/2
n
[τ (θ̂n ) − τ (θ0 )] →D Z ∼ N(0, 1)
[D(θ̂n )]T [J(θ̂n )]−1 D(θ̂n )
where D(θ) is a k × 1 vector with ith element equal to ∂τ /∂θi . An approximate 100p% C.I. for τ (θ) is given by
∙
n
o1/2
n
o1/2 ¸
τ (θ̂n ) − a [D(θ̂n )]T [J(θ̂n )]−1 D(θ̂n )
.
, τ (θ̂n ) + a [D(θ̂n )]T [J(θ̂n )]−1 D(θ̂n )
(2.13)
2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 95
2.9.3
Example
Recall from Example 2.4.17 that for a random sample from the BETA(a, b)
distribution the information matix and the Fisher information matrix are
given by
"
#
Ψ0 (a) − Ψ0 (a + b)
−Ψ0 (a + b)
I(a, b) = n
= J(a, b).
−Ψ0 (a + b)
Ψ0 (b) − Ψ0 (a + b)
Since
£
â − a0
¤ ³ ´
b̂ − b0 J â, b̂
"
â − a0
b̂ − b0
#
→D W v χ2 (2) ,
an approximate 100p% confidence region for (a, b) is given by
"
#
â − a
£
¤
{(a, b) : â − a b̂ − b J(â, b̂)
< c}
b̂ − b
where P (W ≤ c) = p. Since χ2 (2) = GAM (1, 2) = EXP(2), c can be
determined using
Z c
1 −x/2
p = P (W ≤ c) =
dx = 1 − e−c/2
e
0 2
which gives
c = −2 log(1 − p).
For p = 0.95, c = −2 log(0.05) = 5.99. An approximate 95% confidence
region is given by
"
#
â − a
£
¤
{(a, b) : â − a b̂ − b J(â, b̂)
< 5.99}.
b̂ − b
Let
J(â, b̂) =
"
Jˆ11
Jˆ12
Jˆ12
Jˆ22
#
then the confidence region can be written as
{(a, b) : (â − a)2 Jˆ11 + 2 (â − a) (b̂ − b)Jˆ12 + (b̂ − b)2 Jˆ22 ≤ 5.99}
which can be seen to be the points inside an on the ellipse centred at (â, b̂).
For the data in Example 2.4.15, â = 2.7072, b̂ = 6.7493 and
∙
¸
10.0280 −3.3461
J(â, b̂) =
.
−3.3461 1.4443
96
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
14
99%
12
95%
10
90%
(1.5,8)
8
(2.7,6.7)
b
6
4
2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
a
Figure 2.7: Approximate Confidence Regions for Beta(a,b) Example
Approximate 90%, 95% and 99% confidence regions are shown in Figure
2.7.
A 10% likelihood region for (a, b) is given by {(a, b) : R(a, b; x) ≥ 0.1}.
Since
−2 log R(a0 , b0 ; X) →D W v χ2 (2) = EXP (2)
we have
P [R(a, b; X) ≥ 0.1] =
≈
=
=
P [−2 log R(a, b; X) ≤ −2 log (0.1)]
P (W ≤ −2 log (0.1))
1 − e−[−2 log(0.1)]/2
1 − 0.1 = 0.9
and therefore a 10% likelihood region corresponds to an approximate 90%
confidence region. Similarly 1% and 5% likelihood regions correspond to
approximate 99% and 95% confidence regions respectively. Compare the
2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 97
likelihood regions in Figure 2.6 with the approximate confidence regions
shown in Figure 2.7. What do you notice?
Let
h
i−1
=
J(â, b̂)
"
v̂11
v̂12
v̂12
v̂22
#
.
Since
1/2
[J(â, b̂)]
∙
â − a0
b̂ − b0
¸
→D Z v BVN
µ∙
0
0
¸ ∙
¸¶
1 0
,
0 1
then for large n, V ar(â) ≈ v̂11 , V ar(b̂) ≈ v̂22 and Cov(â, b̂) ≈ v̂12 . Therefore an approximate 95% C.I. for a is given by
h
p i
p
â − 1.96 v̂11 , â + 1.96 v̂11
and an approximate 95% C.I. for b is given by
h
p i
p
b̂ − 1.96 v̂22 , b̂ + 1.96 v̂22 .
For the given data â = 2.7072, b̂ = 6.7493 and
h
i−1 ∙ 0.4393 1.0178 ¸
=
J(â, b̂)
1.0178 3.0503
so the approximate 95% C.I. for a is
i
h
√
√
2.7072 + 1.96 0.44393, 2.7072 − 1.96 0.44393 = [1.4080, 4.0063]
and the approximate 95% C.I. for b is
i
h
√
√
6.7493 − 1.96 3.0503, 6.7493 + 1.96 3.0503 = [3.3261, 10.1725] .
Note that a = 1.5 is in the approximate 95% C.I. for a and b = 8 is in
the approximate 95% C.I. for b and yet the point (1.5, 8) is not in the
approximate 95% joint confidence region for (a, b). Clearly these marginal
C.I.’s for a and b must be used with care.
To obtain an approximate 95% C.I. for
τ (a, b) = E (X; a, b) =
a
a+b
98
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
we use (2.13) with
D (a, b) =
and
v̂
£
∂τ
∂b
∂τ
∂a
¤T
=
h
b
(a+b)2
−a
(a+b)2
= [D(â, b̂)]T [J(â, b̂)]−1 D(â, b̂)
#"
"
i v̂11 v̂12
h
−â
b̂
=
(â+b̂)2
(â+b̂)2
v̂12 v̂22
iT
b̂
(â+b̂)2
−â
(â+b̂)2
#
For the given data
τ (â, b̂) =
â
â + b̂
=
2.7072
= 0.28628
2.7072 + 6.7493
and
v̂ = 0.00064706.
The approximate 95% C.I. for τ (a, b) = E (X; a, b) = a/ (a + b) is
i
h
√
√
0.28628 − 1.96 0.00064706, 0.28628 + 1.96 0.00064706
= [0.23642, 0.33614] .
2.9.4
Problem
In Problem 2.4.10 find an approximate 95% C.I. for Cov(X1 , X2 ; θ1 , θ2 ).
2.9.5
Problem
In Problem 2.4.18 find an approximate 95% joint confidence region for (α, β)
and approximate 95% C.I.’s for β and τ (α, β) = E(X; α, β) = αβ.
2.9.6
Problem
In Problem 2.4.19 find approximate 95% C.I.’s for β and E(X; α, β).
2.9.7
Problem
Suppose X1 , ..., Xn is a random sample from the EXP(β, μ) distribution.
Show that the M.L. estimators β̂n and μ̂n are consistent estimators. How
would you construct a joint confidence region for (β, μ)? How would you
construct a C.I. for β? How would you construct a C.I. for μ?
2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 99
2.9.8
Problem
Consider the model in Problem 1.7.26. Explain clearly how you would
construct a C.I. for σ 2 and a C.I. for μ.
2.9.9
Problem
Let X1 , . . . , Xn be a random sample from the distribution with p.d.f.
f (x; α, β) =
αxα−1
,
βα
0 < x ≤ β, α > 0.
(a) Find the likelihood function of α and β and the M.L. estimators of α
and β.
(b) Show that the M.L. estimator of α is a consistent estimator of α. Show
that the M.L. estimator of β is a consistent estimator of β.
15
P
(c) If n = 15, x(15) = 0.99 and
log xi = −7.7685 find the M.L. estimates
i=1
of α and β.
(d) If n = 15, x(15) = 0.99 and
15
P
i=1
log xi = −7.7685, construct an exact
95% equal-tail C.I. for α and an exact 95% one-tail C.I. for β.
(e) Explain how you would construct a joint likelihood region for α and β.
Explain how you would construct a joint confidence region for α and β?
2.9.10
Problem
The following are the results, in millions of revolutions to failure, of endurance tests for 23 deep-groove ball bearings:
17.88
48.48
68.64
105.12
28.92
51.84
68.64
105.84
33.00
51.96
68.88
127.92
41.52
54.12
84.12
128.04
42.12
55.56
93.12
173.40
45.60
67.80
98.64
As a result of testing thousands of ball bearings, it is known that their
lifetimes have a WEI(θ, β) distribution.
(a) Find the M.L. estimates of θ and β and the observed information I(θ̂, β̂).
(b) Plot the 1%, 5% and 10% likelihood regions for θ and β on the same
graph.
100
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
(c) Plot the approximate 99%, 95% and 90% joint confidence regions for θ
and β on the same graph. Compare these with the likelihood regions in (b)
and comment.
(d) Calulate an approximate 95% confidence interval for β.
(e) The value β = 1 is of interest since WEI(θ, 1) = EXP (θ). Is β = 1 a
plausible value of β in light of the observed data? Justify your conclusion.
(f ) If X v WEI (θ, β) then
" µ ¶ #
β
80
= τ (θ, β) .
P (X > 80; θ, β) = exp −
θ
Find an approximate 95% confidence interval for τ (θ, β).
2.9.11
Example - Logistic Regression
Pistons are made by casting molten aluminum into moulds and then machining the raw casting. One defect that can occur is called porosity, due
to the entrapment of bubbles of gas in the casting as the metal solidifies.
The presence or absence of porosity is thought to be a function of pouring
temperature of the aluminum.
One batch of raw aluminum is available and the pistons are cast in 8
different dies. The pouring temperature is set at one of 4 levels
750, 775, 800, 825
and at each level, 3 pistons are cast in the 8 dies available. The presence
(1) or absence (0) of porosity is recorded for each piston and the data are
given below:
Temperature
750
775
800
825
Total
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
1
0
1
1
0
1
0
0
0
1
0
0
1
1
1
0
1
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
1
1
0
1
1
1
0
1
0
1
1
0
1
1
0
0
1
1
0
0
1
0
1
1
0
1
0
0
0
1
0
1
1
1
1
0
0
1
1
0
1
1
0
0
1
1
0
13
11
10
8
In Figure 2.8, the scatter plot of the proportion of pistons with porosity
versus temperature shows that there is a general decrease in porosity as
temperature increases.
2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 101
0.6
0.55
0.5
0.45
proportion
of defects
0.4
0.35
0.3
0.25
740
750
760
770
780
790
Temperature
800
810
820
830
Figure 2.8: Proportion of Defects vs Temperature
A model for these data is
Yij ∼ BIN(1, pi ),
i = 1, . . . , 4,
j = 1, . . . , 24 independently
where i indicates the level of pouring temperature, j the replication. We
would like to fit a curve, a function of the pouring temperature, to the
probabilities pi and the most common function used for this purpose is the
logistic function, ez /(1 + ez ). This function is bounded between 0 and 1
and so can be used to model probabilities. We may choose the exponent z
to depend on the explanatory variates resulting in:
pi = pi (α, β) =
eα+β(xi −x̄)
.
1 + eα+β(xi −x̄)
In this expression, xi is the pouring temperature at level i, x̄ = 787.5 is the
average pouring temperature, and α, β are two unknown parameters. Note
also that
µ
¶
pi
= α + β(xi − x̄).
logit(pi ) = log
1 − pi
The likelihood function is
L(α, β) =
4 Q
24
Q
i=1 j=1
P (Yij = yij ; α, β) =
4 Q
24
Q
i=1 j=1
y
pi ij (1 − pi )(1−yij )
102
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
and the log likelihood is
l (α, β) =
24
4 P
P
i=1 j=1
Note that
[yij log (pi ) + (1 − yij ) log (1 − pi )] .
∂pi
= pi (1 − pi )
∂α
and
∂pi
= (xi − x̄) pi (1 − pi )
∂β
so that
∂l(α, β)
∂α
∙
¸
24 ∂l
24
4 P
4 P
P
P
∂pi
1 − yij
yij
·
−
pi (1 − pi )
=
=
∂α
1 − pi
i=1 j=1 ∂pi
i=1 j=1 pi
=
24
4 P
P
i=1 j=1
=
4
P
i=1
where
[yij (1 − pi ) − (1 − yij ) pi ] =
24
4 P
P
i=1 j=1
(yij − pi )
(yi. − 24pi )
yi. =
24
P
yij .
j=1
Similarly
4
P
∂l(α, β)
(xi − x̄)(yi. − 24pi ).
=
∂β
i=1
The score function is
⎡
⎢
⎢
S(α, β) = ⎢ 4
⎣ P
∂ 2 l (α, β)
∂α2
∂ 2 l (α, β)
∂β 2
and
∂ 2 l (α, β)
∂α∂β
(yi. − 24pi )
i=1
(xi − x̄)(yi. − 24pi )
i=1
Since
4
P
⎤
⎥
⎥
⎥.
⎦
4 ∂p
4
P
P
i
pi (1 − pi ) ,
= 24
i=1 ∂α
i=1
4
4
P
P
∂pi
2
= 24
(xi − x̄)
(xi − x̄) pi (1 − pi ) ,
= 24
∂β
i=1
i=1
4 ∂p
4
P
P
i
(xi − x̄) pi (1 − pi )
= 24
= 24
i=1 ∂β
i=1
= 24
2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 103
the information matrix and the Fisher information matrix are equal and
given by
⎡
4
4
P
P
24
pi (1 − pi )
24 (xi − x̄)pi (1 − pi )
⎢
i=1
i=1
⎢
I (α, β) = J (α, β) = ⎢
4
4
⎣
P
P
(xi − x̄)2 pi (1 − pi )
24 (xi − x̄)pi (1 − pi ) 24
i=1
i=1
To find the M.L. estimators of α and β we must solve
∂l(α, β)
∂l(α, β)
=0=
∂α
∂β
simultaneously which must be done numerically using a method such as
Newton’s method. Initial estimates of α and β can be obtained by drawing
a line through the points in Figure 2.8, choosing two points on the line and
then solving for α and β. For example, suppose we require that the line
pass through the points (775, 11/24) and (825, 8/24). We obtain
−0.167 = logit(11/24) = α + β (775 − 787.5) = α + β(−12.5)
−0.693 = logit(8/24) = α + β (825 − 787.5) = α + β(37.5),
and these result in initial estimates: α(0) = −0.298, β (0) = −0.0105.
Now
¸
³
´ ∙ 23.01
−27.22
J α(0) , β (0) =
−27.22 17748.30
and
´
³
S α(0) , β (0) = [0.9533428, − 9.521179]T
and the first iteration of Newton’s method gives
∙ (1) ¸ ∙ (0) ¸ h ³
´i−1 ³
´ ∙ −0.2571332 ¸
α
α
(0)
(0)
(0)
(0)
,
β
S
α
,
β
=
+
J
α
=
.
−0.01097377
β (1)
β (0)
Repeating this process does not substantially change these estimates, so we
have the M.L. estimates:
α̂ = −0.2571831896
β̂ = −0.01097623887.
The Fisher information matrix evaluated at the M.L. estimate is
"
#
23.09153
−24.63342
J(α̂, β̂) =
.
−24.63342 17783.63646
⎤
⎥
⎥
⎥.
⎦
104
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
The inverse of this matrix gives an estimate of the asymptotic variance/covariance
matrix of the estimators:
"
#
0.0433700024
0.0000600749759
−1
.
[J(α̂, β̂)] =
0.0000600749759
0.00005631468312
0.65
0.6
0.55
0.5
proportion
of defects
0.45
0.4
0.35
740
750
760
770
780
790
800
Temperature
810
820
830
840
Figure 2.9: Fitted Model for Proportion of Defects as a Function of Temperature
A plot of
h
i
exp α̂ + β̂(x − x̄)
h
i
p̂(x) =
1 + exp α̂ + β̂(x − x̄)
is shown in Figure 2.9. Note that the curve is very close to a straight line
over the range of x.
The 1%, 5% and 10% likelihood regions for (α, β) are shown in Figure
2.10. Note that these likelihood regions are very elliptical in shape. This
follows since the (1, 2) entry in the estimated variance/covariance matrix
[J(α̂, β̂)]−1 is very close to zero which implies that the estimators α̂ and
β̂ are not highly correlated. This allows us to make inferences more easily
about β alone. Plausible values for β can be determined from the likelihood
regions in 2.10. A model with no effect due to pouring temperature corresponds to β = 0. The likelihood regions indicate that the value, β = 0, is
a very plausible value in light of the data for all plausible value of α.
2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 105
0.02
1%
0.01
5%
0
10%
beta
-0.01
(-.26,-.011)
-0.02
-0.03
-0.04
-1
-0.5
0
0.5
alpha
Figure 2.10: Likelihood regions for (α, β).
The probability of a defect when the pouring temperature is x = 750 is
equal to
τ = τ (α, β) =
eα+β(750−x̄)
eα+β(−37.5)
1
=
= −α−β(−37.5)
.
α+β(750−x̄)
α+β(−37.5)
1+e
1+e
e
+1
By the invariance property of M.L. estimators the M.L. estimator of τ is
τ̂ = τ (α̂, β̂) =
eα̂+β̂(−37.5)
1+
eα̂+β̂(−37.5)
=
1
e−α̂−β̂(−37.5)
+1
and the M.L. estimate is
1
1
= 0.5385.
= 0.2571831896+0.01097623887(−37.5)
τ̂ =
−
α̂−
β̂(−37.5)
e
+1
e
+1
To construct a approximate C.I. for τ we need an estimate of V ar (τ̂ ; α, β).
Now
h
i
V ar −α̂ − β̂(−37.5); α, β
= (−1)2 V ar (α̂; α, β) + (37.5)2 V ar(β̂; α, β) + 2 (−1) (37.5)Cov(α̂, β̂; α, β)
and using [J(α̂, β̂)]−1 we estimate this variance by
v̂
= 0.0433700024 + (37.5)2 (0.00005631468312) − 2(37.5) (0.0000600749759)
= 0.118057.
106
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
An approximate 95% C.I. for −α − β(−37.5) is
√
√
[−α̂ − β̂(−37.5) − 1.96 v̂, − α̂ − β̂(−37.5) + 1.96 v̂]
h
i
√
√
= −0.154426 − 1.96 0.118057, − 0.154426 + 1.96 0.118057
= [−0.827870, 0.519019] .
An approximate 95% C.I. for
τ = τ (α, β) =
1
e−α−β(−37.5) + 1
is
∙
¸
1
1
,
= [0.373082, 0.695904] .
exp (0.519019) + 1 exp (−0.827870) + 1
The near linearity of the fitted function as indicated in Figure 2.9 seems
to imply that we need not use the logistic function treated in this example,
but that a straight line could have been fit to these data with similar results
over the range of temperatures observed. Indeed, a simple linear regression
would provide nearly the same fit. However, if values of pi near 0 or 1 had
been observed, e.g. for temperatures well above or well below those used
here, the non-linearity of the logistic function would have been important
and provided some advantage over simple linear regression.
2.9.12
Problem - The Challenger Data
On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle
Challenger exploded shortly after lift-off, killing all seven crew members.
The presidential commission on the accident concluded that it was caused
by the failure of an O-ring in a field joint on the rocket booster, and that
this failure was due to a faulty design that made the O-ring unacceptably
sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was
lost at sea), and these data were discussed on the evening preceding the
Challenger launch, but unfortunately only the data corresponding to the
7 flights on which there was a damage incident were considered important
and these were thought to show no obvious trend. The data are given in
Table 1. (See Dalal, Fowlkes and Hoadley (1989), JASA, 84, 945-957.)
2.9. ASYMPTOTIC PROPERTIES OF M.L.ESTIMATORS - MULTIPARAMETER 107
Table 1
Date
Temperature
4/12/81
11/12/81
3/22/82
6/27/82
1/11/82
4/4/83
6/18/83
8/30/83
11/28/83
2/3/84
4/6/84
8/30/84
10/5/84
11/8/84
1/24/85
4/12/85
4/29/85
6/17/85
7/29/85
8/27/85
10/3/85
10/30/85
11/26/85
1/12/86
1/28/86
66
70
69
80
68
67
72
73
70
57
63
70
78
67
53
67
75
70
81
76
79
75
76
58
31
Number of
Damage Incidents
0
1
0
Not available
0
0
0
0
0
1
1
1
0
0
3
0
0
0
0
0
0
2
0
1
Challenger Accident
(a) Let
p(t; α, β) = P (at least one damage incident for a flight at temperature t)
eα+βt
=
.
1 + eα+βt
Using M.L. estimation fit the model
Yi ∼ BIN(1, p (ti ; α, β)),
i = 1, . . . , 23
to the data available from the flights prior to the Challenger accident. You
may ignore the flight for which information on damage incidents is not
available.
(b) Plot 10% and 50% likelihood regions for α and β.
108
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
(c) Find an approximate 95% C.I. for β. How plausible is the value β = 0?
(d) Find an approximate 95% C.I. for p(t) if t = 31, the temperature on
the day of the disaster. Comment.
2.10
Nuisance Parameters and
M.L. Estimation
Suppose X1 , . . . , Xn is a random sample from the distribution with probability (density) function f (x; θ). Suppose also that θ = (λ, φ) where λ is a
vector of parameters of interest and φ is a vector of nuisance parameters.
The profile likelihood is one modification of the likelihood which allows
us to look at estimation methods for λ in the presence of the nuisance
parameter φ.
2.10.1
Definition
Suppose θ = (λ, φ) with likelihood function L(λ, φ). Let φ̂(λ) be the M.L.
estimator of φ for a fixed value of λ. Then the profile likelihood for λ is
given by L(λ, φ̂(λ)).
The M.L. estimator of λ based on the profile likelihood is, of course,
the same estimator obtained by maximizing the joint likelihood L(λ, φ)
simultaneously over λ and φ. If the profile likelihood is used to construct
likelihood regions for λ, care must be taken since the imprecision in the
estimation of the nuisance paramter φ is not taken into account.
Profile likelihood is one example of a group of modifications of the likelihood known as pseudo-likelihoods which are based on a derived likelihood
for a subset of parmeters. Marginal likelihood, conditional likelihood and
partial likelihood are also included in this class.
Suppose that θ = (λ, φ) and the data X, or some function of the data,
can be partitioned into U and V . Suppose also that
f (u, v; θ) = f (u; λ) · f (v|u; θ).
If the conditional distribution of V given U does depends only on φ then
estimation of λ can be based on f (u; λ), the marginal likelihood for λ. If
f (v|u; θ) depends on both λ and φ then the marginal likelihood may still be
used for estimation of λ if, in ignoring the conditional distribution, there is
little information lost.
If there is a factorization of the form
f (u, v; θ) = f (u|v; λ) · f (v; θ)
2.11. PROBLEMS WITH M.L. ESTIMATORS
109
then estimation of λ can be based on f (u|v; λ) the conditional likelihood
for λ.
2.10.2
Problem
Suppose X1 , . . . , Xn is a random sample from a N(μ, σ 2 ) distribution and
that σ is the parameter of interest while μ is a nuisance parameter. Find the
profile likelihood of σ. Let U = S 2 and V = X̄. Find f (u; σ), the marginal
likelihood of σ and f (u|v; σ), the conditional likelihood of σ. Compare the
three likelihoods.
2.11
Problems with M.L. Estimators
2.11.1
Example
This is an example to indicate that in the presence of a large number of
nuisance parameters, it is possible for a M.L. estimator to be inconsistent.
Suppose we are interested in the effect of environment on the performance of
identical twins in some test, where these twins were separated at birth and
raised in different environments. If the vector (Xi , Yi ) denotes the scores
of the i’th pair of twins, we might assume (Xi , Yi ) are both independent
N(μi , σ 2 ) random variables. We wish to estimate the parameter σ 2 based
on a sample of n twins. Show that the M.L. estimator of σ 2 is
n
P
1
σ̂ 2 = 4n
(Xi − Yi )2 and this is a biased and inconsistent estimator of
i=1
σ 2 . Show, however, that a simple modification results in an unbiased and
consistent estimator.
2.11.2
Example
Recall that Theorem 2.7.2 states that under some conditions a root of the
likelihood equation exists which is consistent as the sample size approaches
infinity. One might wonder why the theorem did not simply make the same
assertion for the value of the parameter providing the global maximum of
the likelihood function. The answer is that while the consistent root of
the likelihood equation often corresponds to the global maximum of the
likelihood function, there is no guarantee of this without some additional
conditions. This somewhat unusual example shows circumstances under
which the consistent root of the likelihood equation is not the global maximizer of the likelihood function. Suppose Xi , i = 1, . . . , n are independent
110
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
observations from the mixture density of the form
2
2
2
²
1−²
f (x; θ) = √
e−(x−μ) /2σ + √ e−(x−μ) /2
2πσ
2π
where θ = (μ, σ 2 ) with both parameters unknown. Notice that the likelihood function L(μ, σ) → ∞ for μ = xj , σ → 0 for any j = 1, . . . , n. This
means that the globally maximizing σ is σ = 0, which lies on the boundary
of the parameter space. However, there is a local maximum of the likelihood function at some σ̂ > 0 which provides a consistent estimator of the
parameter.
2.11.3
Unidentifiability and Singular Information Matrices
Suppose we observe two independent random variables Y1 , Y2 having normal distributions with the same variance σ2 and means θ1 + θ2 , θ2 + θ3
respectively. In this case, although the means depend on the parameter
θ = (θ1 , θ2 , θ3 ), the value of this vector parameter is unidentifiable in the
sense that, for some pairs of distinct parameter values, the probability density function of the observations are identical. For example the parameter
θ = (1, 0, 1) leads to exactly the same joint distribution of Y1 , Y2 as does
the parameter θ = (0, 1, 0). In this case, we we might consider only the
two parameters (φ1 , φ2 ) = (θ1 + θ2 , θ2 + θ3 ) and anything derivable from
this pair estimable, while parameters such as θ2 that cannot be obtained
as functions of φ1 , φ2 are consequently unidentifiable. The solution to the
original identifiability problem is the reparametrization to the new parameter (φ1 , φ2 ) in this case, and in general, unidentifiability usually means
one should seek a new, more parsimonious parametrization.
In the above example, compute the Fisher information matrix for the
parameter θ = (θ1 , θ2 , θ3 ). Notice that the Fisher information matrix is
singular. This means that if you were to attempt to compute the asymptotic
variance of the M.L. estimator of θ by inverting the Fisher information
matrix, the inversion would be impossible. Attempting to invert a singular
matrix is like attempting to invert the number zero. It results in one or
more components that you can consider to be infinite. Arguing intuitively,
the asymptotic variance of the M.L. estimator of some of the parameters
is infinite. This is an indication that asymptotically, at least, some of the
parameters may not be identifiable. When parameters are unidentifiable,
the Fisher information matrix is generally singular. However, when J(θ)
is singular for all values of θ, this may or may not mean parameters are
unidentifiable for finite sample sizes, but it does usually mean one should
2.12. HISTORICAL NOTES
111
take a careful look at the parameters with a possible view to adopting
another parametrization.
2.11.4
U.M.V.U.E.’s and M.L. Estimators: A Comparison
Should we use U.M.V.U.E.’s or M.L. estimators? There is no general consensus among statisticians.
1. If we are estimating the expectation of a natural sufficient statistic
Ti (X) in a regular exponential family both M.L. and unbiasedness
considerations lead to the use of Ti as an estimator.
2. When sample sizes are large U.M.V.U.E.’s and M.L. estimators are
essentially the same. In that case use is governed by ease of computation. Unfortunately how large “large” needs to be is usually unknown. Some studies have been carried out comparing the behaviour
of U.M.V.U.E.’s and M.L. estimators for various small fixed sample
sizes. The results are, as might be expected, inconclusive.
3. M.L. estimators exist “more frequently” and when they do they are
usually easier to compute than U.M.V.U.E.’s. This is essentially because of the appealing invariance property of M.L.E.’s.
4. Simple examples are known for which M.L. estimators behave badly
even for large samples (see Examples 2.10.1 and 2.10.2 above).
5. U.M.V.U.E.’s and M.L. estimators are not necessarily robust. They
are sensitive to model misspecification.
In Chapter 3 we examine other approaches to estimation.
2.12
Historical Notes
The concept of sufficiency is due to Fisher (1920), who in his fundamental
paper of 1922 also introduced the term and stated the factorization criterion. The criterion was rediscovered by Neyman (1935) and was proved
for general dominated families by Halmos and Savage (1949). The theory of minimal sufficiency was initiated by Lehmann and Scheffé (1950)
and Dynkin (1951). For further generalizations, see Bahadur (1954) and
Landers and Rogge (1972).
One-parameter exponential families as the only (regular) families of distributions for which there exists a one-dimensional sufficient statistic were
112
CHAPTER 2. MAXIMUM LIKELIHOOD ESTIMATION
also introduced by Fisher (1934). His result was generalized to more than
one dimension by Darmois (1935), Koopman (1936) and Pitman (1936). A
more recent discussion of this theorem with references to the literature is
given, for example, by Hipp (1974). A comprehensive treatment of exponential families is provided by Barndorff-Nielsen (1978).
The concept of unbiasedness as “lack of systematic error” in the estimator was introduced by Gauss (1821) in his work on the theory of least
squares. The first UMVU estimators were obtained by Aitken and Silverstone (1942) in the situation in which the information inequality yields the
same result. UMVU estimators as unique unbiased functions of a suitable
sufficient statistic were derived in special cases by Halmos (1946) and Kolmogorov (1950), and were pointed out as a general fact by Rao (1947). The
concept of completeness was defined, its implications for unbiased estimation developed, and the Lehmann-Scheffé Theorem obtained in Lehmann
and Scheffé (1950, 1955, 1956). Basu’s theorem is due to Basu (1955, 1958).
The origins of the concept of maximum likelihood go back to the work
of Lambert, Daniel Bernoulli, and Lagrange in the second half of the 18th
century, and of Gauss and Laplace at the beginning of the 19th. [For
details and references, see Edwards (1974).] The modern history begins
with Edgeworth (1908-09) and Fisher (1922, 1925), whose contributions
are discussed by Savage (1976) and Pratt (1976).
The amount of information that a data set contains about a parameter
was introduced by Edgeworth (1908, 1909) and was developed systematically by Fisher (1922 and later papers). The first version of the information
inequality appears to have been given by Fréchet (1943), Rao (1945), and
Cramér (1946). The designation “information inequality” which replaced
the earlier “Cramér-Rao inequality” was proposed by Savage (1954).
Fisher’s work on maximum likelihood was followed by a euphoric belief
in the universal consistence and asymptotic efficiency of maximum likelihood estimators, at least in the i.i.d. case. The true situation was sorted
out gradually. Landmarks are Wald (1949), who provided fairly general
conditions for consistency, Cramér (1946), who defined the “regular” case
in which the likelihood equation has a consistent asymptotically efficient
root, the counterexamples of Bahadur (1958) and Hodges (Le Cam, 1953),
and Le Cam’s resulting theorem on superefficiency (1935).
Chapter 3
Other Methods of
Estimation
3.1
Best Linear Unbiased Estimators
The problem of finding best unbiased estimators is considerably simpler if
we limit the class in which we search. If we permit any function of the
data, then we usually require the heavy machinery of complete sufficiency
to produce U.M.V.U.E.’s. However, the situation is much simpler if we
suggest some initial random variables and then require that our estimator
be a linear combination of these. Suppose, for example we have random
variables Y1 , Y2 , Y3 with E(Y1 ) = α + θ, E(Y2 ) = α − θ, E(Y3 ) = θ where
θ is the parameter of interest and α is another parameter. What linear
combinations of the Yi ’s provide an unbiased estimator of θ and among these
possible linear combinations which one has the smallest possible variance?
To answer these questions, we need to know the covariances Cov(Yi , Yj )
(at least up to some scalar multiple). Suppose Cov(Yi , Yj ) = 0, i 6= j and
V ar(Yj ) = σ 2 . Let Y = (Y1 , Y2 , Y3 )T and β = (α, θ)T . The model can be
written as a general linear model as
Y = Xβ + ²
where
⎡
⎤
1 1
X = ⎣ 1 −1 ⎦ ,
0 1
ε = (ε1 , ε2 , ε3 )T , and the εi ’s are uncorrelated random variables with
E(²i ) = 0 and V ar(εi ) = σ2 . Then the linear combination of the compo113
114
CHAPTER 3. OTHER METHODS OF ESTIMATION
nents of Y that has the smallest variance among all unbiased estimators of
β is given by the usual regression formula β̃ = (α̃, θ̃)T = (X t X)−1 X T Y and
θ̃ = 13 (Y1 − Y2 + Y3 ) provides the best estimator of θ in the sense of smallest
variance. In other words, the linear combination of the components of Y
which has smallest variance among all unbiased estimators of aT β is aT β̃
where aT = (0, 1). This result follows from the following theorem.
3.1.1
Gauss-Markov Theorem
Suppose Y = (Y1 , . . . , Yn )T is a vector of random variables such that
Y = Xβ + ε
where X is a n × k (design) matrix of known constants having rank k,
β = (β1 , . . . , βk )T is a vector of unknown parameters and ε = (ε1 , . . . , εn )T
is a vector of random variables such that E (εi ) = 0, i = 1, . . . , n and
V ar (ε) = σ2 B
where B is a known non-singular matrix and σ 2 is a possibly unknown scalar
parameter. Let θ = aT β, where a is a known k × 1 vector. The unbiased
estimator of θ having smallest variance among all unbiased estimators that
are linear combinations of the components of Y is
θ̃ = aT (X T B −1 X)−1 X T B −1 Y.
Note that this result does not depend on any assumed normality of the
components of Y but only on the first and second moment behaviour, that
is, the mean and the covariances. The special case when B is the identity
matrix is the least squares estimator.
3.1.2
Problem
Show that if the conditions of the Gauss-Markov Theorem hold and the εi ’s
are assumed to be normally distributed then the U.M.V.U.E. of β is given
by
β̃ = (X T B −1 X)−1 X T B −1 Y
(see Problems 1.5.11 and 1.7.25). Use this result to prove the Gauss-Markov
Theorem in the case in which the εi ’s are not assumed to be normally
distributed.
3.2. EQUIVARIANT ESTIMATORS
3.1.3
115
Example
Suppose T1 , . . . , Tn are independent unbiased estimators of θ with known
variances V ar(Ti ) = σi2 , i = 1, . . . , n. Find the best linear combination of
these estimators, that is, the one that results in an unbiased estimator of θ
having the minimum variance among all linear unbiased estimators.
3.1.4
Problem
Suppose Yij , i = 1, 2; j = 1, . . . , n are independent random variables with
E(Yij ) = μ + αi and V ar(Yij ) = σ2 where α1 + α2 = 0.
(a) Find the best linear unbiased estimator of α1 .
(b) Under what additional assumptions is this estimator the U.M.V.U.E.?
Justify your answer.
3.1.5
Problem
Suppose Yij , i = 1, 2; j = 1, . . . , ni are independent random variables with
E (Yij ) = αi + βi (xij − x̄i ) , V ar (Yij ) = σ 2 and x̄i =
ni
1 P
xij .
ni j=1
(a) Find the best linear unbiased estimators of α1 , α2 , β1 and β2 .
(b) Under what additional assumptions are these estimators the U.M.V.U.E.’s?
Justify your answer.
3.1.6
Problem
Suppose X1 , . . . , Xn is a random sample from the N(μ, σ 2 ) distribution.
Find the linear combination of the random variables (Xi − X̄)2 , i = 1, . . . , n
which minimizes the M.S.E. for estimating σ 2 . Compare this estimator with
the M.L. estimator and the U.M.V.U.E. of σ2 .
3.2
3.2.1
Equivariant Estimators
Definition
A model {f (x; θ) ; θ ∈ <} such that f (x; θ) = f0 (x − θ) with f0 known is
called a location invariant family and θ is called a location parameter. (See
1.2.3.)
In many examples the location of the origin is arbitrary. For example
if we record temperatures in degrees celcius, the 0 point has been more or
116
CHAPTER 3. OTHER METHODS OF ESTIMATION
less arbitrarily chosen and we might wish that our inference methods do not
depend on the choice of origin. This can be ensured by requiring that the
estimator when it is applied to shifted data, is shifted by the same amount.
3.2.2
Definition
The estimator θ̃(X1 , . . . , Xn ) is location equivariant if
θ̃(x1 + a, . . . , xn + a) = θ̃(x1 , . . . , xn ) + a
for all values of (x1 , . . . , xn ) and real constants a.
3.2.3
Example
Suppose X1 , . . . , Xn is a random sample from a N(θ, 1) distribution. Show
that the U.M.V.U.E. of θ is a location equivariant estimator.
Of course, location equivariant estimators do not make much sense for
estimating variances; they are naturally connected to estimating the location parameter in a location invariant family.
We call a given estimator, θ̃(X), minimum risk equivariant (M.R.E.) if,
among all location equivariant estimators, it has the smallest M.S.E.. It is
not difficult to show that a M.R.E. estimator must be unbiased (Problem
3.2.8). Remarkably, best estimators in the class of location equivariant
estimators are known, due to the following theorem of Pitman.
3.2.4
Theorem
Suppose X1 , . . . , Xn is a random sample from a location invariant family
{f (x; θ) = f0 (x − θ), θ ∈ <} , with known density f0 . Then among all location equivariant estimators, the one with smallest M.S.E. is the Pitman
location equivariant estimator given by
θ̃(X1 , ..., Xn ) =
R∞
−∞
R∞
u
n
Q
i=1
n
Q
−∞ i=1
3.2.5
f0 (Xi − u) du
.
(3.2)
f0 (Xi − u) du
Example
Let X1 , . . . , Xn be a random sample from the N(θ, 1) distribution. Show
that the Pitman estimator of θ is the U.M.V.U.E. of θ.
3.2. EQUIVARIANT ESTIMATORS
3.2.6
117
Problem
Prove that the M.R.E. estimator is unbiased.
3.2.7
Problem
Let (X1 , X2 ) be a random sample from the distribution with probability
density function
1
1
f (x; θ) = −6(x − θ − )(x − θ + ),
2
2
θ−
1
1
<x<θ+ .
2
2
Show that the Pitman estimator of θ is θ̃ (X1 , X2 ) = (X1 + X2 )/2.
3.2.8
Problem
Let X1 , . . . , Xn be a random sample from the EXP(1, θ) distribution. Find
the Pitman estimator of θ and compare it to the M.L. estimator of θ and
the U.M.V.U.E. of θ.
3.2.9
Problem
Let X1 , . . . , Xn be a random sample from the UNIF(θ − 1/2, θ + 1/2) distribution. Find the Pitman estimator of θ. Show that the M.L. estimator
is not unique in this case.
3.2.10
Problem
Suppose X1 , . . . , Xn is a random sample from a location invariant family
{f (x; θ) = f0 (x − θ), θ ∈ <}. Show that if the M.L. estimator is unique
then it is a location equivariant estimator.
Since the M.R.E. estimator is an unbiased estimator, it follows that
if there is a U.M.V.U.E. in a given problem and if that U.M.V.U.E. is
location equivariant then the M.R.E. estimator and the U.M.V.U.E. must
be identical. M.R.E. estimators are primarily used when no U.M.V.U.E.
exists. For example, the Pitman estimator of the location parameter for
a Cauchy distribution performs very well by comparison with any other
estimator, including the M.L. estimator.
3.2.11
Definition
A model {f (x; θ) ; θ > 0} such that f (x; θ) = 1θ f1 ( xθ ) with f1 known is
called a scale invariant family and θ is called a scale parameter (See 1.2.3).
118
3.2.12
CHAPTER 3. OTHER METHODS OF ESTIMATION
Definition
An estimator θ̃k = θ̃k (X1 , . . . , Xn ) is scale equivariant if
θ̃k (cx1 , . . . , cxn ) = ck θ̃k (x1 , . . . , xn )
for all values of (x1 , . . . , xn ) and c > 0.
3.2.13
Theorem
Suppose
X1 , . . . , Xn is aª random sample from a scale invariant family
©
f (x; θ) = θ1 f1 ( xθ ), θ > 0 , with known density f1 . The Pitman scale equivariant estimator of θk which minimizes
⎡Ã
!2 ⎤
k
k
−
θ
θ̃
E⎣
; θ⎦
θk
(the scaled M.S.E.) is given by
R∞
0
θ̃k = θ̃k (X1 , . . . , Xn ) = R∞
0
for all k for which the integrals exist.
3.2.14
un+k−1
un+2k−1
n
Q
i=1
n
Q
f1 (uXi )du
f1 (uXi )du
i=1
Problem
(a) Show that the EXP(θ) density is a scale invariant family.
(b) Show that the U.M.V.U.E. of θ based on a random sample X1 , . . . , Xn is
a scale equivariant estimator and compare it to the Pitman scale equivariant estimator of θ. How does the M.L. estimator of θ compare with these
estimators?
(c) Find the Pitman scale equivariant estimator of θ−1 .
3.2.15
Problem
(a) Show that the N(0, σ 2 ) density is a scale invariant family.
(b) Show that the U.M.V.U.E. of σ2 based on a random sample X1 , . . . , Xn
is a scale equivariant estimator and compare it to the Pitman scale equivariant estimator of σ2 . How does the M.L. estimator of σ 2 compare with these
estimators?
(c) Find the Pitman scale equivariant estimator of σ and compare it to the
M.L. estimator of σ and the U.M.V.U.E. of σ.
3.3. ESTIMATING EQUATIONS
3.2.16
119
Problem
(a) Show that the UNIF(0, θ) density is a scale invariant family.
(b) Show that the U.M.V.U.E. of θ based on a random sample X1 , . . . , Xn is
a scale equivariant estimator and compare it to the Pitman scale equivariant estimator of θ. How does the M.L. estimator of θ compare with these
estimators?
(c) Find the Pitman scale equivariant estimator of θ2 .
3.3
Estimating Equations
To find the M.L. estimator, we usually solve the likelihood equation
n
X
∂
logf (Xi ; θ) = 0.
∂θ
i=1
(3.3)
Note that the function on the left hand side is a function of both the observations and the parameter. Such a function is called an estimating function.
Most sensible estimators, like the M.L. estimator, can be described easily
through an estimating function. For example, if we know V ar(Xi ) = θ
for independent identically distributed Xi , then we can use the estimating
function
n
P
(Xi − X̄)2 − (n − 1)θ
Ψ(θ; X) =
i=1
to estimate the parameter θ, without any other knowledge of the distribution, its density, mean etc. The estimating function is set equal to 0
and solved for θ. The above estimating function is an unbiased estimating
function in the sense that
E[Ψ(θ; X); θ] = 0,
θ ∈ Ω.
(3.4)
This allows us to conclude that the function is at least centered appropriately for the estimation of the parameter θ. Now suppose that Ψ is an
unbiased estimating function corresponding to a large sample. Often Ψ can
be written as the sum of independent components, for example
Ψ(θ; X) =
n
P
ψ(θ; Xi ).
i=1
Now suppose θ̂ is a root of the estimating equation
Ψ(θ; X) = 0.
(3.5)
120
CHAPTER 3. OTHER METHODS OF ESTIMATION
Then for θ sufficiently close to θ̂,
Ψ(θ; X) = Ψ(θ; X) − Ψ(θ̂; X) ≈ (θ − θ̂)
∂
Ψ(θ; X).
∂θ
Now using the Central Limit Theorem, assuming that θ is the true value
of the parameter and provided ψ is a sum as in (3.5), the left hand side
is approximately normal with mean 0 and variance equal to V ar[Ψ(θ; X)].
∂
Ψ(θ; X) is also a sum of similar derivatives of the individual
The term ∂θ
ψ(θ; Xi ). If a law of large numbers applies to these terms, £then when divided
¤
∂
by n this sum will be asymptotically equivalent to n1 E ∂θ
Ψ(θ, X); θ . It
follows that the root θ̂ will have an approximate normal distribution with
mean θ and variance
V ar [Ψ(θ; X); θ]
¤ª2 .
© £∂
E ∂θ Ψ(θ; X); θ
By analogy with the relation between asymptotic variance of the M.L. estimator and the Fisher information, we call the reciprocal of the above
asymptotic variance formula the Godambe information of the estimating
function. This information measure is
© £∂
¤ª2
E ∂θ Ψ(θ; X); θ
J(Ψ; θ) =
.
(3.1)
V ar [Ψ(θ; X); θ]
Godambe(1960) proved the following result.
3.3.1
Theorem
Among all unbiased estimating functions satisfying the usual regularity conditions (see 2.3.1), an estimating function which maximizes the Godambe
information (3.1) is of the form c(θ)S(θ; X) where c(θ) is non-random.
3.3.2
Problem
Prove Theorem 3.3.1.
3.3.3
Example
Suppose X = (X1 , . . . , Xn ) is a random sample from a distribution with
E(log Xi ; θ) = eθ
and V ar(log Xi ; θ) = e2θ ,
Consider the estimating function
Ψ(θ; X) =
n
P
(log Xi − eθ ).
i=1
i = 1, . . . , n.
3.3. ESTIMATING EQUATIONS
121
(a) Show that Ψ(θ; X) is an unbiased estimating function.
(b) Find the estimator θ̂ which satisfies Ψ(θ̂; X) = 0.
(c) Construct an approximate 95% C.I. for θ.
3.3.4
Problem
Suppose X1 , . . . , Xn is a random sample from the Bernoulli(θ) distribution.
Suppose also that (²i , . . . , ²n ) are independent N(0, σ 2 ) random variables
independent of the Xi ’s. Define Yi = θXi + ²i , i = 1, . . . , n. We observe
only the values (Xi , Yi ), i = 1, . . . , n. The parameter θ is unknown and the
²i ’s are unobserved. Define the estimating function
Ψ[θ; (X, Y )] =
n
P
(Yi − θXi ).
i=1
(a) Show that this is an unbiased estimating function for θ.
(b) Find the estimator θ̂ which satisfies Ψ[θ̂; (X, Y )] = 0. Is θ̂ an unbiased
estimator of θ?
(c) Construct an approximate 95% C.I. for θ.
3.3.5
Problem
Consider random variables X1 , . . . , Xn generated according to a first order
autoregressive process
Xi = θXi−1 + Zi ,
where X0 is a constant and Z1 , . . . , Zn are independent N(0, σ 2 ) random
variables.
(a) Show that
Xi = θi X0 +
i
P
θi−j Zj .
j=1
(b) Show that
Ψ(θ; X) =
n−1
P
i=0
Xi (Xi+1 − θXi )
is an unbiased estimating function for θ.
(c) Find the estimator θ̂ which satisfies Ψ(θ̂; X) = 0. Compare the asymptotic variance of this estimator with the Cramér-Rao lower bound.
122
3.3.6
CHAPTER 3. OTHER METHODS OF ESTIMATION
Problem
Let X1 , . . . , Xn be a random sample from the POI(θ) distribution. Since
V ar (Xi ; θ) = θ, we could use the sample variance S 2 rather than the sample
mean X̄ as an estimator of θ, that is, we could use the estimating function
Ψ(θ; X) =
n
1 P
(Xi − X̄)2 − θ.
n − 1 i=1
Find the asymptotic variance of the resulting estimator and hence the asymptotic efficiency of this estimation method. (Hint: The sample variance S 2 has asymptotic variance V ar(S 2 ) ≈ n1 {E[(Xi − μ)4 ] − σ 4 } where
E(Xi ) = μ and V ar(Xi ) = σ2 .)
3.3.7
Problem
Suppose Y1 , . . . , Yn are independent random variable such that E (Yi ) = μi
and V ar (Yi ) = v (μi ) , i = 1, . . . , n where v is a known function. Suppose
also that h (μi ) = xTi β where h is a known function, xi = (xi1 , . . . , xik )T ,
i = 1, . . . , n are vectors of known constants and β is a k × 1 vector of unknown parameters. The quasi-likelihood estimating equation for estimating
β is
µ ¶T
∂μ
[V (μ)]−1 (Y − μ) = 0
∂β
where Y = (Y1 , . . . , Yn )T , μ = (μ1 , . . . , μn )T , V (μ) = diag {v (μ1 ) , . . . , v (μn )},
∂μi
and ∂μ
∂β is the n × k matrix whose (i, j) entry is ∂βj .
(a) Show that this is an unbiased estimating equation for all β.
(b) Show that if Yi v POI(μi ) and log (μi ) = xTi β then the quasi-likelihood
estimating equation is also the likelihood equation for estimating β.
3.3.8
Problem
Suppose X1 , . . . , Xn be a random sample from a distribution with p.f./p.d.f.
f (x; θ). It is well known that the estimator X̄ is sensitive to extreme
observations while the sample median is not. Attempts have been made
to find robust estimators which are not unduly affected by outliers. One
such class proposed by Huber (1981) is the class of M-estimators. These
estimators are defined as the estimators which minimize
n
P
i=1
ρ (Xi ; θ)
(3.2)
3.3. ESTIMATING EQUATIONS
123
with respect to θ for some function ρ. The “M” stands for “maximum
likelihood type” since for ρ (x; θ) = −logf (x; θ) the estimator is the M.L.
estimator. Since minimizing (3.2) is usually equivalent to solving
Ψ(θ; X) =
n ∂
P
ρ (Xi ; θ) = 0,
i=1 ∂θ
M-estimators may also be defined in terms of estimating functions.
Three examples of ρ functions are:
2
(1) ρ (x; θ) = (x − θ) /2
(2) ρ (x; θ) = |x − θ|
(3)
½
2
(x − θ) /2
ρ (x; θ) =
c |x − θ| − c2 /2
if |x − θ| ≤ c
if |x − θ| > c
(a) For all three ρ functions, find a p.d.f. f (x; θ) such that
ρ (x; θ) = −logf (x; θ) + log k.
(b) For the f (x; θ) obtained for (3), graph the p.d.f. for θ = 0, c = 1 and
θ = 0, c = 1.5. On the same graph plot the N(0, 1) and t(2) p.d.f.’s. What
do you notice?
(c) The following data, ordered from smallest to largest, were randomly
generated from a t(2) distribution:
−1.75
−0.78
−0.18
0.09
1.16
−1.24
−0.61
−0.18
0.14
1.34
−1.15
−0.59
−0.17
0.25
1.61
−1.09
−0.58
−0.15
0.36
1.95
−1.02
−0.44
−0.08
0.43
2.25
−0.93
−0.35
−0.04
0.93
2.37
−0.92
−0.26
0.02
1.03
2.59
−0.91
−0.20
0.03
1.13
4.82
Construct a frequency histogram for these data.
(d) Find the M-estimate for θ for each of the ρ functions given above. Use
c = 1 for (3).
(e) Compare these estimates with the M.L. estimate obtained by assuming
that X1 , . . . , Xn is a random sample from the distribution with p.d.f.
"
#−3/2
Γ (3/2)
(x − θ)2
f (x; θ) = √
1+
, t∈<
2
2π
which is a t(2) p.d.f. if θ = 0.
124
3.4
CHAPTER 3. OTHER METHODS OF ESTIMATION
Bayes Estimation
There are two major schools of thought on the way in which statistical
inference is conducted, the frequentist and the Bayesian school. Typically,
these schools differ slightly on the actual methodology and the conclusions
that are reached, but more substantially on the philosophy underlying the
treatment of parameters. So far we have considered a parameter as an
unknown constant underlying or indexing the probability density function
of the data. It is only the data, and statistics derived from the data that
are random.
However, the Bayesian begins by asserting that the parameter θ is simply the realization of some larger random experiment. The parameter is
assumed to have been generated according to some distribution, the prior
distribution π and the observations then obtained from the corresponding
probability density function f (x; θ) interpreted as the conditional probability density of the data given the value of θ. The prior distribution π(θ)
quantifies information about θ prior to any further data being gathered.
Sometimes π(θ) can be constructed on the basis of past data. For example,
if a quality inspection program has been running for some time, the distribution of the number of defectives in past batches can be used as the prior
distribution for the number of defectives in a future batch. The prior can
also be chosen to incorporate subjective information based on an expert’s
experience and personal judgement. The purpose of the data is then to
adjust this distribution for θ in the light of the data, to result in the posterior distribution for the parameter. Any conclusions about the plausible
value of the parameter are to be drawn from the posterior distribution. For
a frequentist, statements like P (1 < θ < 2) are meaningless; all randomness lies in the data and the parameter is an unknown constant. Hence
the effort taken in earlier courses in carefully assuring students that if an
observed 95% confidence interval for the parameter is 1 ≤ θ ≤ 2 this does
not imply P (1 ≤ θ ≤ 2) = 0.95. However, a Bayesian will happily quote
such a probability, usually conditionally on some observations, for example,
P (1 ≤ θ ≤ 2|x) = 0.95.
3.4.1
Posterior Distribution
Suppose the parameter is initially chosen at random according to the prior
distribution π(θ) and then given the value of the parameter the observations
are independent identically distributed, each with conditional probability
(density) function f (x; θ). Then the posterior distribution of the parameter
3.4. BAYES ESTIMATION
125
is the conditional distribution of θ given the data x = (x1 , . . . , xn )
π(θ|x) = cπ(θ)
n
Q
f (xi ; θ) = cπ(θ)L(θ; x)
i=1
where
−1
c
=
Z∞
π(θ)L(θ; x)dθ
−∞
is independent of θ and L(θ; x) is the likelihood function. Since Bayesian
inference is based on the posterior distribution it depends only on the data
through the likelihood function.
3.4.2
Example
Suppose a coin is tossed n times with probability of heads θ. It is known
from “previous experience with coins” that the prior probability of heads is
not always identically 1/2 but follows a BETA(10, 10) distribution. If the
n tosses result in x heads, find the posterior density function for θ.
3.4.3
Definition - Conjugate Prior Distribution
If a prior distribution has the property that the posterior distribution is
in the same family of distributions as the prior then the prior is called a
conjugate prior.
3.4.4
Conjugate Prior Distribution for the Exponential Family
Suppose X1 , . . . , Xn is a random sample from the exponential family
f (x; θ) = C(θ) exp[q(θ)T (x)]h(x)
and θ is assumed to have the prior distribution with parameters a, b given
by
π(θ) = π (θ; a, b) = k[C(θ)]a exp[bq(θ)]
(3.8)
where
k−1 =
∞
R
[C (θ)]a exp [bq (θ)] dθ.
−∞
Then the posterior distribution of θ, given the data x = (x1 , . . . , xn ) is
easily seen to be given by
π(θ|x) = c[C(θ)]a+n exp{q(θ)[b +
n
P
i=1
T (xi )]}
126
CHAPTER 3. OTHER METHODS OF ESTIMATION
where
−1
c
=
∞
R
a+n
[C (θ)]
−∞
½
∙
¸¾
n
P
exp q (θ) b +
T (xi ) dθ.
i=1
Notice that the posterior distribution is in the same family of distributions
as (3.8) and thus π(θ) is a conjugate prior. The value of the parameters of
the posterior distribution reflect the choice of parameters in the prior.
3.4.5
Example
Find the conjugate prior for θ for a random sample X1 , . . . , Xn from the
distribution with probability density function
f (x; θ) = θxθ−1 ,
0 < x < 1, θ > 0.
Show that the posterior distribution of θ given the data x = (x1 , . . . , xn ) is
in the same family of distributions as the prior.
3.4.6
Problem
Find the conjugate prior distribution of the parameter θ for a random
sample X1 , . . . , Xn from each of the following distributions. In each case,
find the posterior distribution of θ given the data x = (x1 , . . . , xn ).
(a) POI(θ)
(b) N(θ, σ 2 ), σ2 known
(c) N(μ, θ), μ known
(d) GAM(α, θ), α known.
3.4.7
Problem
Suppose X1 , . . . , Xn is a random sample from the UNIF(0, θ) distribution.
Show that the prior distribution θ ∼ PAR(a, b) is a conjugate prior.
3.4.8
Problem
Suppose X1 , . . . , Xn is a random sample from the N(μ, 1θ ) where μ and θ
are unknown. Show that the joint prior given by
½
¾
¤
θ£
π(μ, θ) = cθb1 /2 exp − a1 + b2 (a2 − μ)2 , θ > 0, μ ∈ <
2
where a1 , a2 , b1 and b2 are parameters, is a conjugate prior. This prior is
called a normal-gamma prior. Why? Hint: π(μ, θ) = π1 (μ|θ)π2 (θ).
3.4. BAYES ESTIMATION
3.4.9
127
Empirical Bayes
In the conjugate prior given in (3.8) there are two parameters, a and b,
which must be specified. In an empirical Bayes approach the parameters of
the prior are assumed to be unknown constants and are estimated from the
data. Suppose the prior distribution for θ is π(θ; λ) where λ is an unknown
parameter (possibly a vector) and X1 , . . . , Xn is a random sample from
f (x; θ). The marginal distribution of X1 , . . . , Xn is given by
f (x1 , . . . , xn ; λ) =
Z∞
π(θ; λ)
n
Q
f (xi ; θ)dθ
i=1
−∞
which depends on the data X1 , . . . , Xn and λ and therefore can be used to
estimate λ.
3.4.10
Example
In Example 3.4.5 find the marginal distribution of (X1 , . . . , Xn ) and indicate how it could be used to estimate the parameters a and b of the
conjugate prior.
3.4.11
Problem
Suppose X1 , . . . , Xn is a random sample from the POI(θ) distribution.
If a conjugate prior is assumed for θ find the marginal distribution of
(X1 , . . . , Xn ) and indicate how it could be used to estimate the parameters
a and b of the conjugate prior.
3.4.12
Problem
An insurance company insures n drivers. For each driver the company
knows Xi the number of accidents driver i has had in the past three years.
To estimate each driver’s accident rate λi the company assumes (λ1 , . . . , λn )
is a random sample from the GAM(a, b) distribution where a and b are
unknown constants and Xi ∼ POI(λi ), i = 1, . . . , n independently. Find
the marginal distribution of (X1 , . . . , Xn ) and indicate how you would find
the M.L. estimates of a and b using this distribution. Another approach to
estimating a and b would be to use the estimators
X̄
ã = ,
b̃
b̃ =
n
P
i=1
n
P
i=1
Xi2
Xi
¡
¢
− 1 + X̄ .
128
CHAPTER 3. OTHER METHODS OF ESTIMATION
Show that these are consistent estimators of a and b respectively.
3.4.13
Noninformative Prior Distributions
The choice of the prior distribution to be the conjugate prior is often motivated by mathematical convenience. However, a Bayesian would also like
the prior to accurately represent the preliminary uncertainty about the
plausible values of the parameter, and this may not be easily translated
into one of the conjugate prior distributions. Noninformative priors are the
usual way of representing ignorance about θ and they are frequently used
in practice. It can be argued that they are more objective than a subjectively assessed prior distribution since the latter may contain personal bias
as well as background knowledge. Also, in some applications the amount
of prior information available is far less than the information contained in
the data. In this case there seems little point in worrying about a precise
specification of the prior distribution.
If in Example 3.4.2 there were no reason to prefer one value of θ over
any other then a noninformative or ‘flat’ prior disribution for θ that could
be used is the UNIF(0, 1) distribution. For estimating the mean θ of a
N(θ, 1) distribution the possible values for θ are (−∞, ∞). If we take the
prior distribution to be uniform on (−∞, ∞), that is,
π(θ) = c,
−∞<θ <∞
then this is not a proper density since
Z∞
−∞
π(θ)dθ = c
Z∞
−∞
dθ = ∞.
Prior densities of this type are called improper priors. In this case we could
consider a sequence of prior distributions such as the UNIF(−M, M ) which
approximates this prior as M → ∞. Suppose we call such a prior density
function πM . Then the posterior distribution of the parameter is given by
πM (θ|x) = cπM (θ)L(θ; x)
and it is easy to see that as M → ∞, this approaches a constant multiple
of the likelihood function L(θ). This provides another interpretation of the
likelihood function. We can consider it as proportional to the posterior
distribution of the parameter when using a uniform improper prior on the
whole real line. The language is somewhat sloppy here since, as we have
seen, the uniform distribution on the whole real line really makes sense only
through taking limits for uniform distributions on finite intervals.
3.4. BAYES ESTIMATION
129
In the case of a scale parameter, which must take positive values such
as the normal variance, it is usual to express ignorance of the prior distribution of the parameter by assuming that the logarithm of the parameter
is uniform on the real line.
3.4.14
Example
Let X1 , . . . , Xn be a random sample from a N(μ, σ 2 ) distribution and assume that the prior distributions of μ, and log(σ 2 ) are independent improper uniform distributions. Show that the marginal√posterior distribution of μ given the data x = (x1 , . . . , xn ) is such that n (μ − x̄) /s has a
t distribution with n − 1 degrees of freedom. Show also that the marginal
posterior distribution
of σ 2 given the data x is such that 1/σ 2 has a GAM
³
´
n−1
2
distribution.
2 , (n−1)s2
3.4.15
Jeffreys’ Prior
A problem with nonformative prior distributions is whether the prior distribution should be uniform for θ or some function of θ, such as θ2 or log(θ).
It is common to use a uniform prior for τ = h(θ) where h(θ) is the function
of θ whose Fisher information does not depend on the unknown parameter. This idea is due to Jeffreys and leads to a prior distribution which is
proportional to [J(θ)]1/2 . Such a prior is referred to as a Jeffreys’ prior.
3.4.16
Problem
h
i
∂2
Suppose {f (x; θ) ; θ ∈ Ω} is a regular model and J(θ) = E − ∂θ
2 logf (X; θ) ; θ
is the Fisher information. Consider the reparameterization
Zθ p
τ = h(θ) =
J(u)du ,
(3.9)
θ0
where θ0 is a constant. Show that the Fisher information for the reparameterization is equal to one (see Problem 2.3.4). (Note: Since the asymptotic
variance of the M.L. estimator τ̂n is equal to 1/n, which does not depend
on τ, (3.9) is called a variance stabilizing transformation.)
3.4.17
Example
Find the Jeffreys’ prior for θ if X has a BIN(n, θ) distribution. What
function of θ has a uniform prior distribution?
130
CHAPTER 3. OTHER METHODS OF ESTIMATION
3.4.18
Problem
Find the Jeffreys’ prior distribution for a random sample X1 , . . . , Xn from
each of the following distributions. In each case, find the posterior distribution of the parameter θ given the data x = (x1 , . . . , xn ). What function
of θ has a uniform prior distribution?
(a) POI(θ)
(b) N(θ, σ 2 ), σ2 known
(c) N(μ, θ), μ known
(d) GAM(α, θ), α known.
3.4.19
Problem
If θ is a vector then the Jeffreys’ prior is taken to be proportional to the
square root of the determinant of the Fisher information matrix. Suppose
(X1 , X2 ) v MULT(n, θ1 , θ2 ). Find the Jeffreys’ prior for (θ1 , θ2 ). Find the
posterior distribution of (θ1 , θ2 ) given (x1 , x2 ). Find the marginal posterior
distribution of θ1 given (x1 , x2 ) and the marginal posterior distribution of
θ2 given (x1 , x2 ).
Hint: Show
Z1 1−x
Z
Γ (a) Γ (b) Γ (c)
xa−1 y b−1 (1 − x − y)c−1 dydx =
,
Γ (a + b + c)
0
a, b, c > 0
0
3.4.20
Problem
Suppose E(Y ) = Xβ where Y = (Y1 , . . . , Yn )T is a vector of independent
and normally distributed random variables with V ar(Yi ) = σ 2 , i = 1, . . . , n,
X is a n × k matrix of known constants of rank k and β = (β1 , . . . , βk )T is
a vector of unknown parameters. Let
¢−1 T
¡
X y and s2e = (y − X β̂)T (y − X β̂)/ (n − k)
β̂ = X T X
where y = (y1 , . . . , yn )T are the observed data.
(a) Find the joint posterior distribution of β and σ 2 given the data if the
joint (improper) prior distribution of β and σ2 is assumed to be proportional
to σ −2 .
(b) Show that the marginal³posterior distribution
of σ 2 given the data y is
´
n−k
2
−2
such that σ has a GAM
distribution.
2 , (n−k)s2
e
(c) Find the marginal posterior distribution of β given the data y.
3.4. BAYES ESTIMATION
131
(d) Show that the
posterior distribution of β given σ 2 and the
³ conditional
¡ T ¢−1 ´
2
data y is MVN β̂, σ X X
.
¡ ¢
(e) Show that (β − β̂)T X T (Xβ − β̂)/ ks2e has a Fk,n−k distribution.
3.4.21
Bayes Point Estimators
One method of obtaining a point estimator of θ is to use the posterior
distribution and a suitable loss function.
3.4.22
Theorem
Suppose X has p.f./p.d.f. f (x; θ) and θ has prior distribution π(θ). The
Bayes estimator of θ for squared error loss with respect to the prior π(θ)
given X is
Z∞
θ̃ = θ̃(X) =
θπ(θ|X)dθ = E(θ|X)
−∞
which is the mean of the posterior distribution π(θ|X). This estimator
minimizes
⎡
⎤
Z∞ Z∞ ³
´2
⎣
θ̃ − θ f (x; θ) dx⎦ π(θ)dθ.
E[(θ̃ − θ)2 ] =
−∞
3.4.23
−∞
Example
Suppose X1 , . . . , Xn is a random sample from the distribution with probability density function
f (x; θ) = θxθ−1 ,
0 < x < 1, θ > 0.
Using a conjugate prior for θ find the Bayes estimator of θ for squared error
loss. What is the Bayes estimator of τ = 1/θ for squared error loss? Do
Bayes estimators satisfy an invariance property?
3.4.24
Example
In Example 3.4.14 find the Bayes estimators of μ and σ2 for squared error
loss based on their respective marginal posterior distributions.
3.4.25
Problem
Prove Theorem 3.4.22. Hint: Show that E[(X − c)2 ] is minimized by the
value c = E(X).
132
CHAPTER 3. OTHER METHODS OF ESTIMATION
3.4.26
Problem
For each case in Problems 3.4.7 and 3.4.18 find the Bayes estimator of θ
for squared error loss and compare the estimator with the U.M.V.U.E. as
n → ∞.
3.4.27
Problem
In Problem 3.4.12 find the Bayes estimators of (λ1 , . . . , λn ) for squared
error loss.
3.4.28
Problem
Let X1 , . . . , Xn be a random sample from a GAM(α, β) distribution where
α is known. Find the posterior distribution of λ = 1/β given X1 , . . . , Xn if
the improper prior distribution of λ is assumed to be proportional to 1/λ.
Find the Bayes estimator of β for squared error loss and compare it to the
U.M.V.U.E. of β.
3.4.29
Problem
In Problem 3.4.19 find the Bayes estimators of θ1 and θ2 for squared error
loss using their respective marginal posterior distributions. Compare these
to the U.M.V.U.E.’s.
3.4.30
Problem
In Problem 3.4.20 find the Bayes estimators of β and σ 2 for squared error
loss using their respective marginal posterior distributions. Compare these
to the U.M.V.U.E.’s.
3.4.31
Problem
Show that the Bayes estimator of θ for absolute error loss with respect to
the prior π(θ) given data X is the median of the posterior distribution.
Hint:
d
dy
b(y)
b(y)
Z
Z
∂g(x, y)
0
0
g(x, y)dx = g(b (y) , y) · b (y) − g(a (y) , y) · a (y) +
dx.
∂y
a(y)
a(y)
3.4. BAYES ESTIMATION
3.4.32
133
Bayesian Intervals
There remains, after many decades, a controversy between Bayesians and
frequentists about which approach to estimation is more suitable to the real
world. The Bayesian has advantages at least in the ease of interpretation
of the results. For example, a Bayesian can use the posterior distribution
given the data x = (x1 , . . . , xn ) to determine points a = a(x), b = b(x) such
that
Za
π(θ|x)dθ = 0.95
a
and then give a Bayesian confidence interval (a, b) for the parameter. If
this results in [2, 5] the Bayesian will state that (in a Bayesian model, subject to the validity of the prior) the conditional probability given the data
that the parameter falls in the interval [2, 5] is 0.95. No such probability
can be ascribed to a confidence interval for frequentists, who see no randomness in the parameter to which this probability statement is supposed
to apply. Bayesian confidence regions are also called credible regions in
order to make clear the distinction between the interpretation of Bayesian
confidence regions and frequentist confidence regions.
Suppose π(θ|x) is the posterior distribution of θ given the data x and
A is a subset of Ω. If
Z
P (θ ∈ A|x) = π(θ|x)dθ = p
A
then A is called a p credible region for θ. A credible region can be formed
in many ways. If (a, b) is an interval such that
P (θ < a|x) =
1−p
= P (θ > b|x)
2
then [a, b] is called a p equal-tailed credible region. A highest posterior density (H.P.D.) credible region is constructed in a manner similar to likelihood
regions. The p H.P.D. credible region is given by {θ : π(θ|x) ≥ c} where c
is chosen such that
Z
π(θ|x)dθ.
p=
{θ:π(θ|x)≥c}
A H.P.D. credible region is optimal in the sense that it is the shortest
interval for a given value of p.
134
3.4.33
CHAPTER 3. OTHER METHODS OF ESTIMATION
Example
Suppose X1 , . . . , Xn is a random sample from the N(μ, σ 2 ) distribution
where σ 2 is known and μ has the conjugate prior. Find the p = 0.95 H.P.D.
credible region for μ. Compare this to a 95% C.I. for μ.
3.4.34
Problem
Suppose (X1 , . . . , X10 ) is a random sample from the GAM(2, 1θ ) distribu10
P
tion. If θ has the Jeffreys’ prior and
xi = 4 then find and compare
i=1
(a) the 0.95 equal-tailed credible region for θ
(b) the 0.95 H.P.D. credible region for θ
(c) the 95% exact equal tail C.I. for θ.
Finally, although statisticians argue whether the Bayesian or the frequentist approach is better, there is really no one right way to do statistics.
Some problems are best solved using a frequentist approach while others
are best solved using a Bayesian approach. There are certainly instances in
which a Bayesian approach seems sensible— particularly for example if the
parameter is a measurement on a possibly randomly chosen individual (say
the expected total annual claim of a client of an insurance company).
Chapter 4
Hypothesis Tests
4.1
Introduction
Statistical estimation usually concerns the estimation of the value of a parameter when we know little about it except perhaps that it lies in a given
parameter space, and when we have no a priori reason to prefer one value
of the parameter over another. If, however, we are asked to decide between
two possible values of the parameter, the consequences of one choice of the
parameter value may be quite different from another choice. For example,
if we believe Yi is normally distributed with mean α + βxi and variance
σ 2 for some explanatory variables xi , then the value β = 0 means there
is no relation between Yi and xi . We need neither collect the values of xi
nor build a model around them. Thus the two choices β = 0 and β = 1
are quite different in their consequences. This is often the case. An excellent example of the complete asymmetry in the costs attached to these two
choices is Problem 4.4.17.
A hypothesis test involves a (usually natural) separation of the parameter space Ω into two disjoint regions, Ω0 and Ω − Ω0 . By the difference
between the two sets we mean those points in the former (Ω) that are not
in the latter (Ω0 ). This partition of the parameter space corresponds to
testing the null hypothesis that the parameter is in Ω0 . We usually write
this hypothesis in the form
H0 : θ ∈ Ω0 .
The null hypothesis is usually the status quo. For example in a test of a
new drug, the null hypothesis would be that the drug had no effect, or no
more of an effect than drugs already on the market. The null hypothesis
135
136
CHAPTER 4. HYPOTHESIS TESTS
is only rejected if there is reasonably strong evidence against it. The alternative hypothesis determines what departures from the null hypothesis are
anticipated. In this case, it might be simply
H1 : θ ∈ Ω − Ω0 .
Since we do not know the true value of the parameter, we must base our
decision on the observed value of X. The hypothesis test is conducted
by determining a partition of the sample space into two sets, the critical
or rejection region R and its complement R̄ which is called the acceptance
region. We declare that H0 is false (in favour of the alternative) if we
observe x ∈ R. When a test of hypothesis is conducted there are two types
of possible errors: reject the null hypothesis H0 when it is true (Type I
error) and accept H0 when it is false (Type II error).
4.1.1
Definition
The power function of a test with rejection region R is the function
β(θ) = P (X ∈ R; θ) = P (reject H0 ; θ), θ ∈ Ω.
Note that
β (θ) = 1 − P (accept H0 ; θ)
¡
¢
= 1 − P X ∈ R̄; θ
= 1 − P (type II error; θ) for θ ∈ Ω − Ω0 .
In order to minimize the two types of possible errors in a test of hypothesis, it is obviously desirable that the power function β(θ) be small for
θ ∈ Ω0 but large for θ ∈ Ω − Ω0 .
The probability of rejecting H0 when it is true determines one important
measure of the performance of a test, the level of significance.
4.1.2
Definition
A test has level of significance α if β(θ) ≤ α for all θ ∈ Ω0 .
The level of significance is simply an upper bound on the probability of
a type I error. There is no assurance that the upper bound is tight, that is,
that equality is achieved somewhere. The lowest such upper bound is often
called the size of the test.
4.1. INTRODUCTION
4.1.3
137
Definition
The size of a test is equal to sup β(θ).
θ∈Ω0
4.1.4
Example
Suppose we toss a coin 100 times to determine if the coin is fair. Let
X = number of heads observed. Then the model is
X v BIN (n, θ) , θ ∈ Ω = {θ : 0 < θ < 1} .
The null hypothesis is H0 : θ = 0.5 and Ω0 = {0.5}. This is an example of a simple hypothesis. A simple null hypothesis is one for which Ω0
contains a single point. The alternative hypothesis is H1 : θ 6= 0.5 and
Ω − Ω0 = {θ : 0 < θ < 1, θ 6= 0.5} . The alternative hypothesis is not a simple hypothesis since Ω − Ω0 contains more than one point. It is an example
of a composite hypothesis.
The sample space is S = {x : x = 0, 1, . . . , 100}. Suppose we choose the
rejection region to be
R = {x : |x − 50| ≥ 10} = {x : x ≤ 40 or x ≥ 60} .
The test of hypothesis is conducted by rejecting the null hypothesis H0 in
favour of the alternative H1 if x ∈ R. The acceptance region is
R̄ = {x : 41 ≤ x ≤ 59} .
The power function is
β (θ) = P (X ∈ R; θ)
= P (X ≤ 40 ∪ X ≥ 60; θ)
µ ¶
59
P
100 x
100−x
= 1−
θ (1 − θ)
x
x=41
A graph of the power function is given below.
For this example Ω0 = {0.5} consists of a single point and therefore
P (type I error) =
=
=
=
size of test
β (0.5)
P (X ∈ R; θ = 0.5)
P (X ≤ 40 ∪ X ≥ 60; θ = 0.5)
µ ¶
59
P
100
= 1−
(0.5)x (0.5)100−x
x
x=41
≈ 0.05689.
138
CHAPTER 4. HYPOTHESIS TESTS
1
0.9
0.8
0.7
0.6
beta(theta)
0.5
0.4
0.3
0.2
0.1
0
0.3
0.35
0.4
0.45
0.5
theta
0.55
0.6
0.65
0.7
Figure 4.1: Power Function for Binomial Example
4.2
Uniformly Most Powerful Tests
Tests are often constructed by specifying the size of the test, which in
turn determines the probability of the type I error, and then attempting
to minimize the probability that the null hypothesis is accepted when it is
false (type II error). Equivalently, we try to maximize the power function
of the test for θ ∈ Ω − Ω0 .
4.2.1
Definition
A test with power function β(θ) is a uniformly most powerful (U.M.P.) test
of size α if, for all other tests of the same size α having power function
β ∗ (θ), we have β(θ) ≥ β ∗ (θ) for all θ ∈ Ω − Ω0 .
The word “uniformly” above refers to the fact that one function dominates another, that is, β(θ) ≥ β ∗ (θ) uniformly for all θ ∈ Ω − Ω0 . When
the alternative Ω − Ω0 consists of a single point {θ1 } then the construc-
4.2. UNIFORMLY MOST POWERFUL TESTS
139
tion of a best test is particularly easy. In this case, we may drop the word
“uniformly” and refer to a “most powerful test”. The construction of a
best test, by this definition, is possible under rather special circumstances.
First, we often require a simple null hypothesis. This is the case when Ω0
consists of a single point {θ0 } and so we are testing the null hypothesis
H0 : θ = θ0 .
4.2.2
Neyman-Pearson Lemma
Let X have probability (density) function f (x; θ) , θ ∈ Ω. Consider testing a
simple null hypothesis H0 : θ = θ0 against a simple alternative H1 : θ = θ1 .
For a constant c, suppose the rejection region defined by
R = {x;
f (x; θ1 )
> c}
f (x; θ0 )
corresponds to a test of size α. Then the test with this rejection region is
a most powerful test of size α for testing H0 : θ = θ0 against H1 : θ = θ1 .
4.2.3
Proof
Consider another rejection region R1 with the same size. Then
Z
Z
P (X ∈ R; θ0 ) = P (X ∈ R1 ; θ0 ) = α or
f (x; θ0 )dx = f (x; θ0 )dx.
R
Therefore
Z
Z
f (x; θ0 )dx +
f (x; θ0 )dx =
R∩R1
R∩R̄1
and
Z
Z
f (x; θ0 )dx +
Z
f (x; θ0 )dx.
R∩R1
f (x; θ0 )dx =
R∩R̄1
R1
Z
f (x; θ0 )dx
R̄∩R1
(4.1)
R̄∩R1
For x ∈ R ∩ R̄1 ,
f (x; θ1 )
> c or f (x; θ1 ) > cf (x; θ0 )
f (x; θ0 )
and thus
Z
R∩R̄1
f (x; θ1 ) > c
Z
R∩R̄1
f (x; θ0 )dx.
(4.2)
140
CHAPTER 4. HYPOTHESIS TESTS
For x ∈ R̄ ∩ R1 , f (x; θ1 ) ≤ cf (x; θ0 ), and thus
Z
Z
f (x; θ1 )dx ≥ −c
f (x; θ0 )dx.
−
R̄∩R1
(4.3)
R̄∩R1
Now
β(θ1 ) = P (X ∈ R; θ1 ) = P (X ∈ R ∩ R1 ; θ1 ) + P (X ∈ R ∩ R̄1 ; θ1 )
Z
Z
=
f (x; θ1 )dx +
f (x; θ1 )dx
R∩R1
R∩R̄1
and
β1 (θ1 ) = P (X ∈ R1 ; θ1 )
Z
=
f (x; θ1 )dx +
R∩R1
Z
f (x; θ1 )dx.
R̄∩R1
Therefore, using (4.1), (4.2), and (4.3) we have
Z
Z
β(θ1 ) − β1 (θ1 ) =
f (x; θ1 )dx −
f (x; θ1 )dx
R̄∩R1
R∩R̄1
≥ c
Z
f (x; θ0 )dx − c
R∩R̄1
⎡
⎢
= c⎣
Z
Z
R∩R̄1
R̄∩R1
f (x; θ0 )dx −
Z
R̄∩R1
f (x; θ0 )dx
⎤
⎥
f (x; θ0 )dx⎦ = 0
and the test with rejection region R is therefore the most powerful.¥
4.2.4
Example
Suppose X1 , . . . , Xn are independent N(θ, 1) random variables. We consider only the parameter space Ω = [0, ∞). Suppose we wish to test the
hypothesis H0 : θ = 0 against H1 : θ > 0.
(a) Choose an arbitrary θ1 > 0 and obtain the rejection region for the most
powerful test of size 0.05 of H0 against H1 : θ = θ1 .
(b) Does this test depend on the value of θ1 you chose? Can you conclude
that it is uniformly most powerful?
4.2. UNIFORMLY MOST POWERFUL TESTS
141
(c) Graph the power function of the test.
(d) Find the rejection region for the uniformly most powerful test of H0 :
θ = 0 against H1 : θ < 0. Find and graph the power function of this test.
1
0.9
0.8
0.7
0.6
power
function
0.5
0.4
0.3
0.2
0.1
0
-1.5
-1
-0.5
0
theta
0.5
1
Figure 4.2: Power Functions for Examples 4.2.4 and 4.2.5:
β1 (θ), -· β2 (θ)
4.2.5
1.5
— β (θ), - -
Example
Let X1 , . . . , Xn be a random sample from the √
N(θ, 1) distribution. Consider
the rejection region {(x1 , . . . , xn ); |x̄| > 1.96/ n} for testing the hypothesis
H0 : θ = 0 against H1 : θ 6= 0. What is the size of this test? Graph the
power function of this test. Is this test uniformly most powerful?
4.2.6
Problem - Sufficient Statistics and Hypothesis
Tests
Suppose X has probability (density) function f (x; θ) , θ ∈ Ω. Suppose
also that T = T (X) is a minimal sufficient statistic for θ. Show that the
142
CHAPTER 4. HYPOTHESIS TESTS
rejection region of the most powerful test of H0 : θ = θ0 against H1 : θ = θ1
depends on the data X only through T.
4.2.7
Problem
Let X1 , . . . , X5 be a random sample from the distribution with probability
density function
θ
f (x; θ) = θ+1 , x ≥ 1, θ > 0.
x
(a) Find the rejection region for the most powerful test of size 0.05
¡ ¢ of
H0 : θ = 1 against H1 : θ = θ1 where θ1 > 1. Note: log (Xi ) v EXP θ1 .
(b) Explain why the rejection region in (a) is also the rejection region for the
uniformly most powerful test of size 0.05 of H0 : θ = 1 against H1 : θ > 1.
Sketch the power function of this test.
(c) Find the uniformly most powerful test of size 0.05 of H0 : θ = 1 against
H1 : θ < 1. On the same graph as in (b) sketch the power function of this
test.
(d) Explain why there is no uniformly most powerful test of H0 against
H1 : θ 6= 1. What reasonable test of size 0.05 might be used for testing H0
against H1 : θ 6= 1? On the same graph as in (b) sketch the power function
of this test.
4.2.8
Problem
Let X1 , . . . , X10 be a random sample from the GAM( 12 , θ) distribution.
(a) Find the rejection region for the most powerful test of size 0.05 of
H0 : θ = 2 against the alternative H1 : θ = θ1 where θ1 < 2.
(b) Explain why the rejection region in (a) is also the rejection region for
the uniformly most powerful test of size 0.05 of H0 : θ = 2 against the
alternative H1 : θ < 2. Sketch the power function of this test.
(c) Find the uniformly most powerful test of size 0.05 of H0 : θ = 2 against
the alternative H1 : θ > 2. On the same graph as in (b) sketch the power
function of this test.
(d) Explain why there is no uniformly most powerful test of H0 against
H1 : θ 6= 2. What reasonable test of size 0.05 might be used for testing H0
against H1 : θ 6= 2? On the same graph as in (b) sketch the power function
of this test.
4.2. UNIFORMLY MOST POWERFUL TESTS
4.2.9
143
Problem
Let X1 , . . . , Xn be a random sample from the UNIF(0, θ) distribution. Find
the rejection region for the uniformly most powerful test of H0 : θ = 1
against the alternative H1 : θ > 1 of size 0.01. Sketch the power function
of this test for n = 10.
4.2.10
Problem
We anticipate collecting observations (X1 , . . . , Xn ) from a N(μ, σ 2 ) distribution in order to test the hypothesis H0 : μ = 0 against the alternative
H1 : μ > 0 at level of significance 0.05. A preliminary investigation yields
σ ≈ 2. How large a sample must we take in order to have power equal to
0.95 when μ = 1?
4.2.11
Relationship Betweeen Hypothesis Tests and
Confidence Intervals
There is a close relationship between hypothesis tests and confidence intervals as the following example illustrates. Suppose X1 , . . . , Xn is a random
sample from the N(θ, 1) distribution and we wish to test the hypothesis √
H0 :
θ = θ0 against H1 : θ 6= θ0 . The rejection region {x; |x̄ − θ0 | > 1.96/ n}
is a size α = 0.05 rejection
√ region which has a corresponding acceptance
region {x; |x̄ − θ0 | ≤ 1.96/ n}. Note that the hypothesis
H0 : θ = θ0 would
√
not be rejected at the 0.05 level if |x̄ − θ0 | ≤ 1.96/ n or equivalently
√
√
x̄ − 1.96/ n ≤ θ0 ≤ x̄ + 1.96/ n
which is a 95% C.I. for θ.
4.2.12
Problem
Let (X1 , . . . , X5 ) be a random sample from the GAM(2, θ) distribution.
Show that
½ 5
¾
5
P
P
R = x;
xi < 4.7955θ0 or
xi > 17.085θ0
i=1
i=1
is a size 0.05 rejection region for testing H0 : θ = θ0 . Show how this
rejection region may be used to construct a 95% C.I. for θ.
144
4.3
CHAPTER 4. HYPOTHESIS TESTS
Locally Most Powerful Tests
It is not always possible to construct a uniformly most powerful test. For
this reason, and because alternative values of the parameter close to those
under H0 are the hardest to differentiate from H0 itself, one may wish to
develop a test that is best able to test the hypthesis H0 : θ = θ0 against
alternatives very close to θ0 . Such a test is called locally most powerful.
4.3.1
Definition
A test of H0 : θ = θ0 against H1 : θ > θ0 with power function β(θ) is
locally most powerful if, for any other test having the same size and having
power function β ∗ (θ), there exists an ² > 0 such that β(θ) ≥ β ∗ (θ) for all
θ0 < θ < θ0 + ².
This definition asserts that there is a neighbourhood of the null hypothesis in which the test is most powerful.
4.3.2
Theorem
Suppose {f (x; θ) ; θ ∈ Ω} is a regular statistical model with corresponding
score function
∂
S(θ; x) =
log f (x; θ) .
∂θ
A locally most powerful test of H0 : θ = θ0 against H1 : θ > θ0 has rejection
region
R = {x; S(θ0 ; x) > c} ,
where c is a constant determined by
P [S(θ0 ; X) > c; θ0 ] = size of test.
Since this test is based on the score function, it is also called a score
test.
4.3.3
Example
Suppose X1 , . . . , Xn is a random sample from a N(θ, 1) distribution. Show
that the locally most powerful test of H0 : θ = 0 against H1 : θ > 0 is also
the uniformly most powerful test.
4.3. LOCALLY MOST POWERFUL TESTS
4.3.4
145
Problem
Consider a single observation X from the LOG(1, θ) distribution. Find the
rejection region for the locally most powerful test of H0 : θ = 0 against
H1 : θ > 0. Is this test also uniformly most powerful? What is the power
function of the test?
Suppose X = (X1 , . . . , Xn ) is a random sample from a regular statistical
model {f (x; θ) ; θ ∈ Ω} and the exact distribution of
S(θ0 ; X) =
n ∂
P
log f (Xi ; θ0 )
i=1 ∂θ
is difficult to obtain. Since, under H0 : θ = θ0 ,
S(θ0 ; X)
p
→D Z v N(0, 1)
J (θ0 )
by the C.L.T., an approximate size α rejection region for testing H0 : θ = θ0
against H1 : θ > θ0 is given by
)
(
S(θ0 ; x)
≥a
x; p
J (θ0 )
where P (Z ≥ a) = α and Z v N(0, 1). J(θ0 ) may be replaced by I(θ0 ; x).
4.3.5
Example
Suppose X1 , . . . , Xn is a random sample from the CAU(1, θ) distribution.
Find an approximate rejection region for a locally most powerful size 0.05
test of H0 : θ = θ0 against H1 : θ < θ0 . Hint: Show J(θ) = n/2.
4.3.6
Problem
Suppose X1 , . . . , Xn is a random sample from the WEI(1, θ) distribution.
Find an approximate rejection region for a locally most powerful size 0.01
test of H0 : θ = θ0 against H1 : θ > θ0 .
Hint: Show that
n
π2
J(θ) = 2 (1 +
+ γ 2 − 2γ)
θ
6
where
Z∞
γ = − (log y)e−y dy ≈ 0.5772
0
is Euler’s constant.
146
4.4
CHAPTER 4. HYPOTHESIS TESTS
Likelihood Ratio Tests
Consider a test of the hypothesis H0 : θ ∈ Ω0 against H1 : θ ∈ Ω − Ω0 .
We have seen that for prescribed θ0 ∈ Ω0 , θ1 ∈ Ω − Ω0 , the most powerful
test of the simple null hypothesis H0 : θ = θ0 against a simple alternative
H1 : θ = θ1 is based on the likelihood ratio fθ1 (x)/fθ0 (x). By the NeymanPearson Lemma it has rejection region
½
¾
f (x; θ1 )
R = x;
>c
f (x; θ0 )
where c is a constant determined by the size of the test. When either the
null or the alternative hypothesis are composite (i.e. contain more than one
point) and there is no uniformly most powerful test, it seems reasonable to
use a test with rejection region R for some choice of θ1 , θ0 . The likelihood
ratio test does this with θ1 replaced by θ̂, the M.L. estimator over all possible
values of the parameter, and θ0 replaced by the M.L. estimator of the
parameter when it is restricted to Ω0 . Thus, the likelihood ratio test of
H0 : θ ∈ Ω0 versus H1 : θ ∈ Ω − Ω0 has rejection region R = {x; Λ(x) > c}
where
Λ(x) =
sup f (x; θ)
sup L (θ; x)
θ∈Ω
θ∈Ω
sup f (x; θ)
θ∈Ω0
=
sup L (θ; x)
θ∈Ω0
and c is determined by the size of the test. In general, the distribution of
the test statistic Λ(X) may be difficult to find. Fortunately, however, the
asymptotic distribution is known under fairly general conditions. In a few
cases, we can show that the likelihood ratio test is equivalent to the use of
a statistic with known distribution. However, in many cases, we need to
rely on the asymptotic chi-squared distribution of Theorem 4.4.7.
4.4.1
Example
Let X1 , . . . , Xn be a random sample from the N(μ, σ 2 ) distribution where
μ and σ 2 are unknown. Consider a test of
H0 : μ = 0, 0 < σ 2 < ∞
against the alternative
H1 : μ 6= 0, 0 < σ 2 < ∞.
(a) Show that the likelihood ratio test of H0 against H1 has rejection region
4.4. LIKELIHOOD RATIO TESTS
147
R = {x; nx̄2 /s2 > c}.
(b) Show under H0 that the statistic T = nX̄ 2 /S 2 has a F(1, n − 1) distribution and thus find a size 0.05 test for n = 20.
(c) What rejection region would you use for testing H0 : μ = 0, 0 < σ 2 < ∞
against the one-sided alternative H1 : μ > 0, 0 < σ 2 < ∞?
4.4.2
Problem
Suppose X ∼ GAM(2, β1 ) and Y ∼ GAM(2, β2 ) independently.
(a) Show that the likelihood ratio statistic for testing the hypothesis H0 :
β1 = β2 against the alternative H1 : β1 6= β2 is a function of the statistic
T = X/(X + Y ).
(b) Find the distribution of T under H0 .
(b) Find the rejection region for a size 0.01 test. What rejection region
would you use for testing H0 : β1 = β2 against the one-sided alternative
H1 : β1 > β2 ?
4.4.3
Problem
Let (X1 , . . . , Xn ) be a random sample from the N(μ, σ2 ) distribution and
independently let (Y1 , . . . , Yn ) be a random sample from the N(θ, σ 2 ) distribution where σ 2 is known.
(a) Show that the likelihood ratio statistic for testing the hypothesis H0 :
μ = θ against the alternative H1 : μ 6= θ is a function of T = |X̄ − Ȳ |.
(b) Find the rejection region for a size 0.05 test. Is this test U.M.P.? Why?
4.4.4
Problem
Suppose X1 , . . . , Xn are independent EXP(λ) random variables and independently Y1 , . . . , Ym are independent EXP(μ) random variables.
(a) Show that the likelihood ratio statistic for testing the hypothesis H0 :
λ = μ against the alternative H1 : λ 6= μ is a function of
T =∙
n
P
Xi
i=1
n
P
i=1
Xi +
m
P
i=1
Yi
¸.
(b) Find the distribution of T under H0 . Explain clearly how you would
find a size α = 0.05 rejection region.
148
CHAPTER 4. HYPOTHESIS TESTS
(c) For n = 20 find the rejection region for the one-sided alternative H1 :
λ > μ for a size 0.05 test.
4.4.5
Problem
Suppose X1 , . . . , Xn is a random sample from the EXP(β, μ) distribution
where β and μ are unknown.
(a) Show that the likelihood ratio statistic for testing the hypothesis
H0 : β = 1 against the alternative H1 : β 6= 1 is a function of the statistic
T =
n ¡
¢
P
Xi − X(1) .
i=1
(b) Show that under H0 , 2T has a chi-squared distribution (see Problem
1.8.11).
(c) For n = 12 find the rejection region for the one-sided alternative
H1 : β > 1 for a size 0.05 test.
4.4.6
Problem
Let X1 , . . . , Xn be a random sample from the distribution with p.d.f.
f (x; α, β) =
αxα−1
,
βα
0 < x ≤ β.
(a) Show that the likelihood ratio statistic for testing the hypothesis
H0 : α = 1 against the alternative H1 : α 6= 1 is a function of the statistic
T =
n ¡
¢
Q
Xi /X(n)
i=1
(b) Show that under H0 , −2 log T has a chi-squared distribution (see Problem 1.8.12).
(c) For n = 14 find the rejection region for the one-sided alternative
H1 : α > 1 for a size 0.05 test.
4.4.7
Problem
Suppose Yi ∼ N(α+βxi , σ 2 ), i = 1, 2, . . . , n independently where x1 , . . . , xn
are known constants and α, β and σ 2 are unknown parameters.
4.4. LIKELIHOOD RATIO TESTS
149
(a) Show that the likelihood ratio statistic for testing H0 : β = 0 against
the alternative H1 : β 6= 0 is a function of
β̂ 2
T =
where
Se2 =
n
P
i=1
(xi − x̄)
2
Se2
n
1 P
(Yi − α̂ − β̂xi )2 .
n − 2 i=1
(b) What is the distribution of T under H0 ?
4.4.8
Theorem - Asymptotic Distribution of the Likelihood Ratio Statistic (Regular Model)
Suppose X = (X1 , . . . , Xn ) is a random sample from a regular statistical
model {f (x; θ) ; θ ∈ Ω} with Ω an open set in k−dimensional Euclidean
space. Consider a subset of Ω defined by Ω0 = {θ(η); η ∈ open subset of
q-dimensional Euclidean space }. Then the likelihood ratio statistic defined
by
n
Q
f (X; θ) sup L (θ; X)
sup
θ∈Ω i=1
θ∈Ω
Λn (X)=
=
n
Q
sup L (θ; X)
sup
f (X; θ) θ∈Ω0
θ∈Ω0 i=1
is such that, under the hypothesis H0 : θ ∈ Ω0 ,
2 log Λn (X) →D W v χ2 (k − q).
Note: The number of degrees of freedom is the difference between the
number of parameters that need to be estimated in the general model,
and the number of parameters left to be estimated under the restrictions
imposed by H0 .
4.4.9
Example
Suppose X1 , . . . , Xn are independent POI(λ) random variables and independently Y1 , . . . , Yn are independent POI(μ) random variables.
(a) Find the likelihood ratio test statistic for testing H0 : λ = μ against the
alternative H1 : λ 6= μ.
(b) Find the approximate rejection region for a size α = 0.05 test. Be sure
to justify the approximation.
(c) Find the rejection region for the one-sided alternative H1 : λ < μ for a
size 0.05 test.
150
4.4.10
CHAPTER 4. HYPOTHESIS TESTS
Problem
Suppose (X1 , X2 ) ∼ MULT(n, θ1 , θ2 ).
(a) Find the likelihood ratio statistic for testing H0 : θ1 = θ2 = θ3 against
all alternatives.
(b) Find the approximate rejection region for a size 0.05 test. Be sure to
justify the approximation.
4.4.11
Problem
Suppose (X1 , X2 ) ∼ MULT(n, θ1 , θ2 ).
(a) Find the likelihood ratio statistic for testing H0 : θ1 = θ2 , θ2 = 2θ(1−θ)
against all alternatives.
(a) Find the approximate rejection region for a size 0.05 test. Be sure to
justify the approximation.
4.4.12
Problem
Suppose (X1 , Y1 ), . . . , (Xn , Yn ) is a random sample from the BVN(μ, Σ)
distribution with (μ, Σ) unknown.
(a) Find the likelihood ratio statistic for testing H0 : ρ = 0 against the
alternative H1 : ρ 6= 0.
(b) Find the approximate size 0.05 rejection region. Be sure to justify the
approximation.
4.4.13
Problem
Suppose in Problem 2.1.25 we wish to test the hypothesis that the data
arise from the assumed model. Show that the likelihood ratio statistic is
given by
µ ¶
k
P
Fi
Fi log
Λ=2
E
i
i=1
where Ei = npi (θ̂) and θ̂ is the M.L. estimator of θ. What is the asymptotic
distribution of Λ? Another test statistic which is commonly used is the
Pearson goodness of fit statistic given by
k (F − E )2
P
i
i
E
i
i=1
which also has an approximate χ2 distribution.
4.4. LIKELIHOOD RATIO TESTS
4.4.14
151
Problem
In Example 2.1.32 test the hypothesis that the data arise from the assumed
model using the likelihood ratio statistic. Compare this with the answer
that you obtain using the Pearson goodness of fit statistic.
4.4.15
Problem
In Example 2.1.33 test the hypothesis that the data arise from the assumed
model using the likelihood ratio statistic. Compare this with the answer
that you obtain using the Pearson goodness of fit statistic.
4.4.16
Problem
In Example 2.9.11 test the hypothesis that the data arise from the assumed
model using the likelihood ratio statistic. Compare this with the answer
that you obtain using the Pearson goodness of fit statistic.
4.4.17
Problem
Suppose we have n independent repetitions of an experiment in which each
outcome is classified according to whether event A occurred or not as well
as whether event B occurred or not. The observed data can be arranged
in a 2 × 2 contingency table as follows:
A
A
Total
B
f11
f21
c1
B
f12
f22
c2
Total
r1
r2
n
Find the likelihood ratio statistic for testing the hypothesis that the events
A and B are independent, that is, H0 : P (A ∩ B) = P (A)P (B).
4.4.18
Problem
Suppose E(Y ) = Xβ where Y = (Y1 , . . . , Yn )T is a vector of independent
and normally distributed random variables with V ar(Yi ) = σ 2 , i = 1, . . . , n,
X is a n × k matrix of known constants of rank k and β = (β1 , . . . , βk )T
is a vector of unknown parameters. Find the likelihood ratio statistic for
testing the hypothesis H0 : βi = 0 against the alternative H1 : βi 6= 0 where
βi is the ith element of β.
152
4.4.19
CHAPTER 4. HYPOTHESIS TESTS
Signed Square-root Likelihood Ratio Statistic
Suppose X = (X1 , . . . , Xn ) is a random sample from a regular statistical
T
model {f (x; θ) ; θ ∈ Ω} where θ = (θ1 , θ2 ) , θ1 is a scalar and Ω is an
k
open set in < . Suppose also that the null hypothesis is H0 : θ1 = θ10 .
Let θ̂ = (θ̂1 , θ̂2 ) be the maximum likelihood of θ and let θ̃ = (θ10 , θ̃2 (θ10 ))
where θ̃2 (θ10 ) is the maximum likelihood value of θ2 assuming θ1 = θ10 .
Then by Theorem 4.4.8
2 log Λn (X) = 2l(θ̂; X) − 2l(θ̃; X) →D W v χ2 (1)
under H0 or equivalently
h
i1/2
→D Z v N (0, 1)
2l(θ̂; X) − 2l(θ̃; X)
under H0 . The signed square-root likelihood ratio statistic defined by
h
i1/2
sign(θ̂1 − θ10 ) 2l(θ̂; X) − 2l(θ̃; X)
can be used to test one-sided alternatives such as H1 : θ1 > θ10 or H1 :
θ1 < θ10 . For example if the alternative hypothesis were H1 : θ1 > θ10 then
the rejection region for an approximate size 0.05 test would be given by
¾
½
h
i1/2
> 1.645 .
x; sign(θ̂1 − θ10 ) 2l(θ̂; X) − 2l(θ̃; X)
4.4.20
Problem - The Challenger Data
In Problem 2.8.9 test the hypothesis that β = 0. What would a sensible
alternative be? Describe in detail the null and the alternative hypotheses
that you have in mind and the relative costs of the two different kinds of
errors.
4.4.21
Significance Tests and p-values
We have seen that a test of hypothesis is a rule which allows us to decide
whether to accept the null hypothesis H0 or to reject it in favour of the
alternative hypothesis H1 based on the observed data. A test of significance
can be used in situations in which H1 is difficult to specify. A (pure) test
of significance is a procedure for measuring the strength of the evidence
provided by the observed data against H0 . This method usually involves
looking at the distribution of a test statistic or discrepancy measure T
under H0 . The p-value or significance level for the test is the probability,
4.5. SCORE AND MAXIMUM LIKELIHOOD TESTS
153
computed under H0 , of observing a T value at least as extreme as the value
observed. The smaller the observed p-value, the stronger the evidence
against H0 . The difficulty with this approach is how to find a statistic
with ‘good properties’. The likelihood ratio statistic provides a general test
statistic which may be used.
4.5
Score and Maximum Likelihood Tests
4.5.1
Score or Rao Tests
In Section 4.3 we saw that the locally most powerful test was a score test.
Score tests can be viewed as a more general class of tests of H0 : θ = θ0
against H1 : θ ∈ Ω−{θ0 }. If the usual regularity conditions hold then under
H0 : θ = θ0 we have
S(θ0 ; X)[J(θ0 )]−1/2 →D Z v N(0, 1).
and thus
R(X; θ0 ) = [S(θ0 ; X)]2 [J(θ0 )]−1 →D Y v χ2 (1).
For a vector θ = (θ1 , . . . , θk )T we have
R(X; θ0 ) = [S(θ0 ; X)]T [J(θ0 )]−1 S(θ0 ; X) →D Y v χ2 (k).
(4.4)
The corresponding rejection region is
R = {x; R(x; θ0 ) > c}
where c is determined by the size of the test, that is, c satisfies
P [R(X; θ0 ) > c; θ0 ] = α. An approximate value for c can be determined
using P (Y > c) = α where Y v χ2 (k).
The test based on R(X; θ0 ) is asymptotically equivalent to the likelihood
ratio test. In (4.4) J(θ0 ) may be replaced by I(θ0 ) for an asymptotically
equivalent test.
Such test statistics are called score or Rao test statistics.
4.5.2
Maximum Likelihood or Wald Tests
Suppose that θ̂ is the M.L. estimator of θ over all θ ∈ Ω and we wish to test
H0 : θ = θ0 against H1 : θ ∈ Ω − {θ0 }. If the usual regularity conditions
hold then under H0 : θ = θ0
W (X; θ0 ) = (θ̂ − θ0 )T J(θ0 )(θ̂ − θ0 ) →D Y v χ2 (k).
(4.5)
154
CHAPTER 4. HYPOTHESIS TESTS
The corresponding rejection region is
R = {x; W (x; θ0 ) > c}
where c is determined by the size of the test, that is, c satisfies
P [W (X; θ0 ) > c; θ0 ] = α. An approximate value for c can be determined
using P (Y > c) = α where Y v χ2 (k).
The test based on W (X; θ0 ) is asymptotically equivalent to the likelihood ratio test. In (4.5) J(θ0 ) may also be replaced by J(θ̂), I(θ0 ) or I(θ̂)
to obtain an asymptotically equivalent test statistic.
Such statistics are called maximum likelihood or Wald test statistics.
4.5.3
Example
Suppose X v POI (θ). Find the score test statistic (4.4) and the maximum
likelihood test statistic (4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0 .
4.5.4
Problem
Find the score test statistic (4.4) and the Wald test statistic (4.5) for testing
H0 : θ = θ0 against H1 : θ 6= θ0 based on a random sample (X1 , . . . , Xn )
from each of the following distributions:
(a) EXP(θ)
(b) BIN(n, θ)
(c) N(θ, σ 2 ), σ 2 known
(d) EXP(θ, μ), μ known
(e) GAM(α, θ), α known
4.5.5
Problem
Let (X1 , . . . , Xn ) be a random sample from the PAR(1, θ) distribution.
Find the score test statistic (4.4) and the maximum likelihood test statistic
(4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0 .
4.5.6
Problem
Suppose (X1 , . . . , Xn ) is a random sample from an exponential family model
{f (x; θ) ; θ ∈ Ω}. Show that the score test statistic (4.4) and the maximum
likelihood test statistic (4.5) for testing H0 : θ = θ0 against H1 : θ 6= θ0 are
identical if the maximum likelihood estimator of θ is a linear function of
the natural sufficient statistic.
4.6. BAYESIAN HYPOTHESIS TESTS
4.6
155
Bayesian Hypothesis Tests
Suppose we have two simple hypotheses H0 : θ = θ0 and H1 : θ = θ1 .
The prior probability that H0 is true is denoted by P (H0 ) and the prior
probability that H1 is true is P (H1 ) = 1 − P (H0 ). P (H0 )/P (H1 ) are the
prior odds. Suppose also that the data x have probability (density) function
f (x; θ). The posterior probability that Hi is true is denoted by P (Hi |x), i =
0, 1. The Bayesian aim in hypothesis testing is to determine the posterior
odds based on the data x given by
P (H0 |x)
P (H0 ) f (x; θ0 )
=
×
.
P (H1 |x)
P (H1 ) f (x; θ1 )
The ratio f (x; θ0 )/f (x; θ1 ) is called the Bayes factor. If P (H0 ) = P (H1 )
then the posterior odds are just a likelihood ratio. The Bayes factor measures how the data have changed the odds as to which hypothesis is true.
If the posterior odds were equal to q then a Bayesian would conclude that
H0 is q times more likely to be true than H1 . A Bayesian may also decide
to accept H0 rather than H1 if q is suitably large.
If we have two composite hypotheses H0 : θ ∈ Ω0 and H1 : θ ∈ Ω − Ω0
then a prior distribution for θ must be specified for each hypothesis. We
denote these by π0 (θ|H0 ) and π1 (θ|H1 ). In this case the posterior odds are
P (H0 |x)
P (H0 )
=
·B
P (H1 |x)
P (H1 )
where B is the Bayes factor given by
R
f (x; θ) π0 (θ|H0 )dθ
Ω0
.
B= R
f (x; θ) π1 (θ|H1 )dθ
Ω−Ω0
For the hypotheses H0 : θ = θ0 and H1 : θ 6= θ0 the Bayes factor is
B= R
f (x; θ0 )
.
f (x; θ) π1 (θ|H1 )dθ
θ6=θ0
4.6.1
Problem
Suppose (X1 , . . . , Xn ) is a random sample from a POI(θ) distribution and
we wish to test H0 : θ = θ0 against H1 : θ 6= θ0 . Find the Bayes factor if
under H1 the prior distribution for θ is the conjugate prior.
156
CHAPTER 4. HYPOTHESIS TESTS
Chapter 5
Appendix
5.1
5.1.1
Inequalities and Useful Results
Hölder’s Inequality
Suppose X and Y are random variables and p and q are positive numbers
satisfying
1 1
+ = 1.
p q
Then
1/p
|E (XY )| ≤ E (|XY |) ≤ [E (|X|p )]
[E (|Y |q )]
1/q
.
Letting Y = 1 we have
1/p
E (|X|) ≤ [E (|X|p )]
5.1.2
,
p > 1.
Covariance Inequality
If X and Y are random variables with variances σ12 and σ22 respectively
then
[Cov (X, Y )]2 ≤ σ12 σ22 .
5.1.3
Chebyshev’s Inequality
If X is a random variable with E(X) = μ and V ar(X) = σ 2 < ∞ then
P (|X − μ| ≥ k) ≤
for any k > 0.
157
σ2
.
k2
158
5.1.4
CHAPTER 5. APPENDIX
Jensen’s Inequality
If X is a random variable and g (x) is a convex function then
E [g (X)] ≥ g [E (X)] .
5.1.5
Corollary
If X is a non-degenerate random variable and g (x) is a strictly convex
function. Then
E [g (X)] > g [E (X)] .
5.1.6
Stirling’s Formula
For large n
Γ (n + 1) ≈
5.1.7
√
2πnn+1/2 e−n .
Matrix Differentiation
T
T
Suppose x = (x1 , . . . , xk ) , b = (b1 , . . . , bk ) and A is a k × k symmetric
matrix. Then
∙
¸T
∂ ¡ T ¢
∂ ¡ T ¢
∂ ¡ T ¢
=b
x b ,...,
x b
x b =
∂x
∂x1
∂xk
and
∙
¸T
∂ ¡ T ¢
∂ ¡ T ¢
∂ ¡ T ¢
= 2Ax.
x Ax , . . . ,
x Ax
x Ax =
∂x
∂x1
∂xk
5.2. DISTRIBUTIONAL RESULTS
5.2
5.2.1
159
Distributional Results
Functions of Random Variables
Univariate One-to-One Transformation
Suppose X is a continuous random variable with p.d.f. f (x) and support
set A. Let Y = h (X) be a real-valued, one-to-one function from A to B.
Then the probability density function of Y is
¯
¯
¯
¡ −1
¢ ¯ d −1
¯
g (y) = f h (y) · ¯ h (y)¯¯ , y ∈ B.
dy
Multivariate One-to-One Transformation
Suppose (X1 , . . . , Xn ) is a vector of random variables with joint p.d.f.
f (x1 , . . . , xn ) and support set RX . Suppose the transformation S defined
by
Ui = hi (X1 , . . . , Xn ), i = 1, . . . , n
is a one-to-one, real-valued transformation with inverse transformation
Xi = wi (U1 , . . . , Un ),
i = 1, . . . , n.
Suppose also that S maps RX into RU . Then g(u1 , . . . , un ), the joint p.d.f.
of (U1 , . . . , Un ) , is given by
¯
¯
¯ ∂(x1 , . . . , xn ) ¯
¯ , (u1 , . . . , un ) ∈ RU
g(u) = f (w1 (u), . . . , wn (u)) ¯¯
∂(u1 , . . . , un ) ¯
where
¯
¯
∂(x1 , . . . , xn ) ¯¯
=¯
∂(u1 , . . . , un ) ¯
¯
∂x1
∂u1
···
∂x1
∂un
∂xn
∂u1
···
∂xn
∂un
..
.
is the Jacobian of the transformation.
..
.
¯
¯ ∙
¸−1
¯
∂(u1 , . . . , un )
¯
¯=
¯
∂(x1 , . . . , xn )
¯
160
5.2.2
CHAPTER 5. APPENDIX
Order Statistic
The following results are derived in Casella and Berger, Section 5.4.
Joint Distribution of the Order Statistic
Suppose X1 , . . . , Xn is a random sample from a continuous distribution with
probability density function
probability density function
¡ f (x). The joint
¢
of the order statistic T = X(1) , . . . , X(n) = (T1 , . . . , Tn ) is
g (t1 , . . . , tn ) = n!
n
Q
i=1
f (ti ) ,
− ∞ < t1 < · · · < tn < ∞.
Distribution of the Maximum and the Minimum of a Vector of
Random Variables
Suppose X1 , . . . , Xn is a random sample from a continuous distribution
with probability density function f (x), support set A, and cumulative distribution function F (x).
The probability density function of U = X(i) , i = 1, . . . , n is
n!
i−1
n−i
[1 − F (u)]
,
f (u) [F (u)]
(i − 1)! (n − i)!
u ∈ A.
In particular the probability density function of T = X(n) = max (X1 , ..., Xn )
is
g1 (t) = nf (t) [F (t)]n−1 ,
t∈A
and the probability density function of S = X(1) = min (X1 , ..., Xn ) is
g2 (s) = nf (s) [1 − F (s)]n−1 ,
s ∈ A.
5.2. DISTRIBUTIONAL RESULTS
161
The joint p.d.f. of U = X(i) and V = X(j) , 1 ≤ i < j ≤ n is given by
n!
i−1
n−j
[F (v) − F (u)] [1 − F (v)]
,
f (u) f (v) [F (u)]
(i − 1)! (j − 1 − i)! (n − j)!
u < v, u ∈ A, v ∈ A.
In particular the joint probability density function of S = X(1) and
T = X(n) is
n−2
g (s, t) = n (n − 1) f (s) f (t) [F (t) − F (s)]
5.2.3
,
s < t, s ∈ A, t ∈ A.
Problem
If Xi ∼ UNIF(a, b), i = 1, . . . , n independently, then show
X(1) − a
∼ BETA (1, n)
b−a
X(n) − a
∼ BETA (n, 1)
b−a
162
5.2.4
CHAPTER 5. APPENDIX
Distribution of Sums of Random Variables
(1) If Xi ∼ POI(μi ), i = 1, . . . , n independently, then
µn ¶
n
P
P
Xi ∼ POI
μi .
i=1
i=1
(2) If Xi ∼ BIN(ni , p), i = 1, . . . , n independently, then
µn
¶
n
P
P
Xi ∼ BIN
ni , p .
i=1
i=1
(3) If Xi ∼ NB(ki , p), i = 1, . . . , n independently, then
µn
¶
n
P
P
Xi ∼ NB
ki , p .
i=1
(4) If Xi ∼
i=1
N(μi , σi2 ),
n
P
i=1
2
i = 1, . . . , n independently, then
µn
¶
n
P
P
2 2
ai Xi ∼ N
ai μi ,
ai σi .
i=1
i=1
(5) If Xi ∼ N(μ, σ ), i = 1, . . . , n independently, then
n
¡
¢
¢
¡
P
Xi ∼ N nμ, nσ 2 and X̄ ∼ N μ, σ 2 /n .
i=1
(6) If Xi ∼ GAM(αi , β), i = 1, . . . , n independently, then
µn
¶
n
P
P
Xi ∼ GAM
αi , β .
i=1
i=1
(7) If Xi ∼ GAM(1, β) = EXP(β), i = 1, . . . , n independently, then
n
P
i=1
Xi ∼ GAM (n, β) .
(8) If Xi ∼ χ2 (ki ), i = 1, . . . , n independently, then
µn ¶
n
P
P
2
Xi ∼ χ
ki .
i=1
i=1
(9) If Xi ∼ GAM(αi , β), i = 1, . . . , n independently where αi is a positive integer, then
µ n
¶
n
P
2 P
Xi ∼ χ2 2
αi .
β i=1
i=1
(10) If Xi ∼ N(μ, σ 2 ), i = 1, . . . , n independently, then
µ
¶2
n
P
Xi − μ
∼ χ2 (n) .
σ
i=1
5.2. DISTRIBUTIONAL RESULTS
5.2.5
163
Theorem - Properties of the Multinomial Distribution
Suppose (X1 , . . . , Xk ) ∼ MULT (n, p1 , . . . , pk ) with joint p.f.
f (x1 , . . . , xk ) =
n!
xk+1
px1 px2 · · · pk+1
x1 !x2 ! · · · xk+1 ! 1 2
xi = 0, . . . , n, i = 1, . . . , k+1, xk+1 = n−
k
P
xi , 0 < pi < 1, i = 1, . . . , k+1,
i=1
k+1
P
pi = 1. Then
i=1
(1) (X1 , . . . , Xk ) has joint m.g.f.
M (t1 , . . . , tk ) = (p1 et1 + · · · + pk etk + pk+1 )n ,
(t1 , . . . , tk ) ∈ <k .
(2) Any subset of X1 , . . . , Xk+1 also has a multinomial distribution. In
particular
Xi v BIN (n, pi ) ,
i = 1, . . . , k + 1.
(3) If T = Xi + Xj , i 6= j, then
T v BIN (n, pi + pj ) .
(4)
Cov (Xi , Xj ) = −nθi θj .
(5) The conditional distribution of any subset of (X1 , . . . , Xk+1 ) given the
rest of the coordinates is a multinomial distribution. In particular the
conditional p.f. of Xi given Xj = xj , i 6= j, is
µ
¶
pi
Xi |Xj = xj ∼ BIN n − xj ,
.
1 − pj
(6) The conditional distribution of Xi given T = Xi + Xj = t, i 6= j, is
µ
Xi |Xi + Xj = t ∼ BIN t,
pi
pi + pj
¶
.
164
5.2.6
CHAPTER 5. APPENDIX
Definition - Multivariate Normal Distribution
Let X = (X1 , . . . , Xk )T be a k × 1 random vector with E(Xi ) = μi and
Cov(Xi , Xj ) = σij , i, j = 1, . . . , k. (Note: Cov(Xi , Xi ) = σii = V ar(Xi ) =
σi2 .) Let μ = (μ1 , . . . , μk )T be the mean vector and Σ be the k×k symmetric
covariance matrix whose (i, j) entry is σij . Suppose also that Σ−1 exists. If
the joint p.d.f. of (X1 , . . . , Xk ) is given by
∙
¸
1
1
T −1
exp
−
Σ
(x
−
μ)
, x ∈ Rk
f (x1 , . . . , xk ) =
(x
−
μ)
2
(2π)k/2 |Σ|1/2
where x = (x1 , . . . , xk )T then X is said to have a multivariate normal
distribution. We write X v MVN(μ, Σ).
5.2.7
Theorem - Properties of the MVN Distribution
Suppose X = (X1 , . . . , Xk )T v MVN(μ, Σ). Then
(1) X has joint m.g.f.
¶
µ
1 T
T
M (t) = exp μ t + t Σt ,
2
t = (t1 , . . . , tk )T ∈ <k .
(2) Any subset of X1 , . . . , Xk also has a MVN distribution and in particular
Xi v N(μi , σi2 ), i = 1, . . . , k.
(3)
(X − μ)T Σ−1 (X − μ) v χ2 (k).
(4) Let c = (c1 , . . . , ck )T be a nonzero vector of constants then
cT X =
k
P
i=1
ci Xi v N(cT μ, cT Σc).
(5) Let A be a k × p vector of constants of rank p then
AT X v N(AT μ, AT ΣA).
(6) The conditional distribution of any subset of (X1 , . . . , Xk ) given the
rest of the coordinates is a multivariate normal distribution. In particular
the conditional p.d.f. of Xi given Xj = xj , i 6= j, is
Xi |Xj = xj ∼ N(μi + ρij σi (xj − μj )/σj , (1 − ρ2ij )σi2 )
5.2. DISTRIBUTIONAL RESULTS
165
In following figures the BVN joint p.d.f. is graphed. The graphs all
have the same mean vector μ = [0 0]T but different variance/covariance
matrices Σ. The axes all have the same scale.
0.2
f(x,y)
0.15
0.1
0.05
0
3
3
2
2
1
1
0
0
-1
-1
-2
-2
y
-3
-3
x
Figure 5.1:
Graph of BVN p.d.f. with μ = [0 0]T and Σ = [1 0 ; 0 1].
166
CHAPTER 5. APPENDIX
0.2
f(x,y)
0.15
0.1
0.05
0
3
3
2
2
1
1
0
0
-1
-1
-2
-2
y
-3
-3
x
Graph of BVN p.d.f. with μ = [0 0]T and Σ = [1 0.5; 0.5 1].
5.2. DISTRIBUTIONAL RESULTS
167
0.2
f(x,y)
0.15
0.1
0.05
0
3
3
2
2
1
1
0
0
-1
-1
-2
-2
y
-3
-3
x
Graph of BVN p.d.f. with μ = [0 0]T and Σ = [0.6 0.5; 0.5 1].
168
5.3
5.3.1
CHAPTER 5. APPENDIX
Limiting Distributions
Definition - Convergence in Probability to a Constant
The sequence of random variables X1 , X2 , . . . , Xn , . . . converges in probability to the constant c if for each ² > 0
lim P (|Xn − c| ≥ ²) = 0.
n→∞
We write Xn →p c.
5.3.2
Theorem
If X1 , X2 , ..., Xn , ... is a sequence of random variables such that
½
0
x<b
lim P (Xn ≤ x) =
n→∞
1
x>b
then Xn →p b.
5.3.3
Theorem - Weak Law of Large Numbers
If X1 , . . . , Xn is a random sample from a distribution with
E(Xi ) = μ and V ar(Xi ) = σ 2 < ∞ then
X̄n =
5.3.4
n
1 P
Xi →p μ.
n i=1
Problem
Suppose X1 , X2 , . . . , Xn , . . . is a sequence of random variables such that
E(Xn ) = c and lim V ar(Xn ) = 0. Show that Xn →p c.
n→∞
5.3.5
Problem
Suppose X1 , X2 , . . . , Xn , . . . is a sequence of random variables such that
Xn /n →p b < 0. Show that lim P (Xn < 0) = 1.
n→∞
5.3.6
Problem
Show that if Yn →p a and
lim P (|Xn | ≤ Yn ) = 1
n→∞
5.3. LIMITING DISTRIBUTIONS
169
then Xn is bounded in probability, that is, there exists b > 0 such that
lim P (|Xn | ≤ b) = 1.
n→∞
5.3.7
Definition - Convergence in Distribution
The sequence of random variables X1 , X2 , . . . , Xn , . . . converges in distribution to a random variable X if
lim P (Xn ≤ x) = P (X ≤ x) = F (x)
n→∞
for all values of x at which F (x) is continuous. We write Xn →D X.
5.3.8
Theorem
Suppose X1 , ..., Xn , ... is a sequence of random variables with E(Xn ) = μn
and V ar(Xn ) = σn2 . If lim μn = μ and lim σn2 = 0 then Xn →p μ.
n→∞
5.3.9
n→∞
Central Limit Theorem
If X1 , . . . , Xn is a random sample from a distribution with
E(Xi ) = μ and V ar(Xi ) = σ 2 < ∞ then
Yn =
5.3.10
n
P
i=1
√
Xi − nμ
n(X̄n − μ)
√
=
→D Z v N(0, 1).
σ
nσ
Limit Theorems
1. If Xn →p a and g is continuous at a, then g(Xn ) →p g(a).
2. If Xn →p a, Yn →p b and g(x, y) is continuous at (a, b) then
g(Xn , Yn ) →p g(a, b).
3. (Slutsky) If Xn →D X, Yn →p b and g(x, b) is continuous for all
x ∈ support of X then g(Xn , Yn ) →D g(X, b).
4. (Delta Method) If X1 , X2 , ..., Xn , ... is a sequence of random variables
such that
nb (Xn − a) →D X
for some b > 0 and if the function g(x) is differentiable at a with
g 0 (a) 6= 0 then
nb [g(Xn ) − g(a)] →D g 0 (a)X.
170
5.3.11
CHAPTER 5. APPENDIX
Problem
If Xn →p a > 0, Yn →p b 6= 0 and Zn →D Z ∼ N (0, 1) then find the
limiting distributions of
p
(1) Xn2
(2) Xn
(3) Xn Yn
(4) Xn + Yn
(5) Xn /Yn
(6) 2Zn
(7) Zn + Yn
(8) Xn Zn
(9) Zn2
(10) 1/Zn
5.4. PROOFS
5.4
5.4.1
171
Proofs
Theorem
Suppose the model is {f (x; θ) ; θ ∈ Ω} and let A = support of X. Partition
A into the equivalence classes defined by
½
¾
f (x; θ)
Ay = x;
= H(x, y) for all θ ∈ Ω , y ∈ A.
(5.1)
f (y; θ)
This is a minimal sufficient partition. The statistic T (X) which induces
this partition is a minimal sufficient statistic.
5.4.2
Proof
We give the proof for the case in which A does not depend on θ.
Let T (X) be the statistic which induces the partition in (5.1). To show
that T (X) is sufficient we define
B = {t : t = T (x) for some x ∈ A} .
Then the set A can be written as
A = {∪t∈B At }
where
At = {x : T (x) = t} , t ∈ B.
The statistic T (X) induces the partition defined by At , t ∈ B. For each
At we can choose and fix one element, xt ∈ A. Obviously T (xt ) = t. Let
g (t; θ) be a function defined on B such that
g (t; θ) = f (xt ; θ) , t ∈ B.
Consider any x ∈ A. For this x we can calculate T (x) = t and thus
determine the set At to which x belongs as well as the value xt which was
chosen for this set. Obviously T (x) = T (xt ). By the definition of the
partition induced by T (X), we know that for all x ∈ At , f (x; θ) /f (xt ; θ) is
a constant function of θ. Therefore for any x ∈ A we can define a function
h (x) =
where T (x) = T (xt ) = t.
f (x; θ)
f (xt ; θ)
172
CHAPTER 5. APPENDIX
Therefore for all x ∈ A and θ ∈ Ω we have
f (x; θ)
f (xt ; θ)
= g (t; θ) h (x)
= g (T (xt ) ; θ) h (x)
= g (T (x) ; θ) h (x)
f (x; θ) = f (xt ; θ)
and by the Factorization Criterion for Sufficieny T (X) is a sufficient statistic.
To show that T (X) is a minimal sufficient statistic suppose that T1 (X)
is any other sufficient statistic. By the Factorization Criterion for Sufficieny,
there exist functions h1 (x) and g1 (t; θ) such that
f (x; θ) = g1 (T1 (x) ; θ) h1 (x)
for all x ∈ A and θ ∈ Ω. Let x and y be any two points in A with
T1 (x) = T1 (y). Then
f (x; θ)
f (y; θ)
=
=
=
g1 (T1 (x) ; θ) h1 (x)
g1 (T1 (y) ; θ) h1 (y)
h1 (x)
h1 (y)
function of x and y which does not depend on θ
and therefore by the definition of T (X) this implies T (x) = T (y). This
implies that T1 induces either the same partition of A as T (X) or it induces
a finer partition of A than T (X) and therefore T (X) is a function of T1 (X).
Since T (X) is a function of every other sufficient statistic, therefore T (X)
is a minimal sufficient statistic.
5.4.3
Theorem
If T (X) is a complete sufficient statistic for the model {f (x; θ) ; θ ∈ Ω}
then T (X) is a minimal sufficient statistic for {f (x; θ) ; θ ∈ Ω}.
5.4.4
Proof
Suppose U = U (X) is a minimal sufficient statistic for the model
{f (x; θ) ; θ ∈ Ω}. The function E (T |U ) is a function of U which does not
depend on θ since U is a sufficient statistic. Also by Definition 1.5.2, U is
5.4. PROOFS
173
a function of the sufficient statistic T which implies E (T |U ) is a function
of T .
Let
h (T ) = T − E (T |U ) .
Now
E [h (T ) ; θ] =
=
=
=
E [T − E (T |U ) ; θ]
E (T ; θ) − E [E (T |U ) ; θ]
E (T ; θ) − E (T ; θ)
0, for all θ ∈ Ω.
Since T is complete this implies
P [h (T ) = 0; θ] = 1,
for all θ ∈ Ω
P [T = E (T |U )] = 1,
for all θ ∈ Ω
or
and therefore T is a function of U . This can only be true if T is also a
minimal sufficent statistic for the model.
The regularity conditions are repeated here since they are used in the
proofs that follow.
5.4.5
Regularity Conditions
Consider the model {f (x; θ) ; θ ∈ Ω}. Suppose that:
(R1) The parameter space Ω is an open interval in the real line.
(R2) The densities f (x; θ) have common support, so that the set
A = {x; f (x; θ) > 0} , does not depend on θ.
(R3) For all x ∈ A, f (x; θ) is a continuous, three times differentiable function of θ.
R
(R4) The integral f (x; θ) dx can be twice differentiated with respect to θ
A
under the integral sign, that is,
∂k
∂θk
Z
A
f (x; θ) dx =
Z
A
∂k
f (x; θ) dx, k = 1, 2 for all θ ∈ Ω.
∂θk
174
CHAPTER 5. APPENDIX
(R5) For each θ0 ∈ Ω there exist a positive number c and function M (x)
(both of which may depend on θ0 ), such that for all θ ∈ (θ0 − c, θ0 + c)
holds for all x ∈ A, and
¯
¯ 3
¯ ∂ log f (x; θ) ¯
¯ < M (x)
¯
¯
¯
∂θ3
E [M (X) ; θ] < ∞ for all θ ∈ (θ0 − c, θ0 + c) .
(R6) For each θ ∈ Ω,
0<E
(∙
∂ 2 log f (X; θ)
∂θ2
¸2
;θ
)
<∞
(R7) The probability (density) functions corresponding to different values
of the parameters are distinct, that is, θ 6= θ∗ =⇒ f (x; θ) 6= f (x; θ∗ ).
The following lemma is required for the proof of consistency of the M.L.
estimator.
5.4.6
Lemma
If X is a non-degenerate random variable with model {f (x; θ) ; θ ∈ Ω} satisfying (R1) − (R7) then
E [log f (X; θ) − log f (X; θ0 ) ; θ0 ] < 0
5.4.7
for all θ, θ0 ∈ Ω, and θ 6= θ0 .
Proof
Since g (x) = − log x is strictly convex and X is a non-degenerate random
variable then by the corollary to Jensen’s inequality
¾
½ ∙
¸
f (X; θ)
E [log f (X; θ) − log f (X; θ0 ) ; θ0 ] = E log
; θ0
f (X; θ0 )
½ ∙
¸¾
f (X; θ)
< log E
; θ0
f (X; θ0 )
for all θ, θ0 ∈ Ω, and θ 6= θ0 .
5.4. PROOFS
175
Since
∙
f (X; θ)
E
; θ0
f (X; θ0 )
¸
=
Z
f (x; θ)
f (x; θ0 ) dx
f (x; θ0 )
A
=
Z
f (x; θ) dx = 1, for all θ ∈ Ω
A
therefore
E [l (θ; X) − l (θ0 ; X) ; θ0 ] < log (1) = 0, for all θ, θ0 ∈ Ω, and θ 6= θ0 .
5.4.8
Theorem
Suppose (X1 , . . . , Xn ) is a random sample from a model {f (x; θ) ; θ ∈ Ω}
satisfying regularity conditions (R1) − (R7). Then with probability tending
to 1 as n → ∞, the likelihood equation or score equation
n
X
∂
logf (Xi ; θ) = 0
∂θ
i=1
has a root θ̂n such that θ̂n converges in probability to θ0 , the true value of
the parameter, as n → ∞.
5.4.9
Proof
Let
ln (θ; X) = ln (θ; X1 , . . . , Xn )
∙n
¸
Q
= log
f (Xi ; θ)
i=1
=
n
P
i=1
logf (Xi ; θ) , θ ∈ Ω.
Since f (x; θ) is differentiable with respect to θ for all θ ∈ Ω this implies
ln (θ; x) is differentiable with respect to θ for all θ ∈ Ω and also ln (θ; x) is
a continuous function of θ for all θ ∈ Ω.
By the above lemma we have for any δ > 0 such that θ0 ± δ ∈ Ω,
E [ln (θ0 + δ; X) − ln (θ0 ; X) ; θ0 ] < 0
(5.2)
E [ln (θ0 − δ; X) − ln (θ0 ; X) ; θ0 ] < 0.
(5.3)
and
176
CHAPTER 5. APPENDIX
By (5.2) and the WLLN
1
[ln (θ; X) − ln (θ0 ; X)] →p b < 0
n
which implies
lim P [ln (θ0 + δ; X) − ln (θ0 ; X) < 0] = 1
n→∞
(see Problem 5.3.5). Therefore there exists a sequence of constants {an }
such that 0 < an < 1, lim an = 0 and
n→∞
P [ln (θ0 + δ; X) − ln (θ0 ; X) < 0] = 1 − an .
Let
An = An (δ) = {x : ln (θ0 + δ; x) − ln (θ0 ; x) < 0}
where x = (x1 , . . . , xn ). Then
lim P (An ; θ0 ) = lim (1 − an ) = 1.
n→∞
n→∞
Let
Bn = Bn (δ) = {x : ln (θ0 − δ; x) − ln (θ0 ; x) < 0} .
then by the same argument as above there exists a sequence of constants
{bn } such that 0 < bn < 1, lim bn = 0 and
n→∞
lim P (Bn ; θ0 ) = lim (1 − bn ) = 1.
n→∞
n→∞
Now
P (An ∩ Bn ; θ0 ) =
=
=
≥
P (An ; θ0 ) + P (Bn ; θ0 ) − P (An ∪ Bn ; θ0 )
1 − an + 1 − bn − P (An ∪ Bn ; θ0 )
1 − an − bn + [1 − P (An ∪ Bn ; θ0 )]
1 − an − bn
since 1 − P (An ∪ Bn ; θ0 ) ≥ 0. Therefore
lim P (An ∩ Bn ; θ0 ) = lim (1 − an − bn ) = 1
n→∞
n→∞
(5.4)
Continuity of ln (θ; x) for all θ ∈ Ω implies that for any x ∈ An ∩ Bn ,
there exists a value θ̂n (δ) = θ̂n (δ; x) ∈ (θ0 − δ, θ0 + δ) such that ln (θ; x)
5.4. PROOFS
177
has a local maximum at θ = θ̂n (δ). Since ln (θ; x) is differentiable with
respect to θ this implies (Fermat’s theorem)
n
∂ln (θ; x) X ∂
=
logf (xi ; θ) = 0 for θ = θ̂n (δ) .
∂θ
∂θ
i=1
Note that ln (θ; x) may have more than one local maximum on the interval
(θ0 − δ, θ0 + δ) and therefore θ̂n (δ) may not be unique. If x ∈
/ An ∩ Bn ,
then θ̂n (δ) may not exist in which case we define θ̂nn(δ) to o
be a fixed arbitrary value. Note also that the sequence of roots θ̂n (δ) depends on
δ.
Let θ̂n = θ̂n (x) be the value of θ closest to θ0 such that
∂ln (θ; x) /∂θ = 0. If such a root does not exist we define θ̂n to be a fixed
arbitrary value. Since
h
i
1 ≥ P θ̂n ∈ (θ0 − δ, θ0 + δ) ; θ0
h
i
≥ P θ̂n (δ) ∈ (θ0 − δ, θ0 + δ) ; θ0
≥ P (An ∩ Bn ; θ0 )
(5.5)
then by (5.4) and the Squeeze Theorem we have
h
i
lim P θ̂n ∈ (θ0 − δ, θ0 + δ) ; θ0 = 1.
n→∞
Since this is true for all δ > 0, θ̂n →p θ0 .
5.4.10
Theorem
Suppose (R1) − (R7) hold. Suppose θ̂n is a consistent root of the likelihood
equation as in Theorem 5.4.8. Then
p
J(θ0 )(θ̂n − θ0 ) →D Z ∼ N(0, 1)
where θ0 is the true value of the parameter.
5.4.11
Proof
Let
S1 (θ; x) =
and
I1 (θ; x) = −
∂
log f (x; θ)
∂θ
∂
∂2
S1 (θ; x) = − 2 log f (x; θ)
∂θ
∂θ
178
CHAPTER 5. APPENDIX
be the score and information functions respectively for one observation from
{f (x; θ) ; θ ∈ Ω}. Since {f (x; θ) ; θ ∈ Ω} is a regular model
E [S1 (θ; X) ; θ] = 0, θ ∈ Ω
(5.6)
and
V ar {[S1 (θ; X)] ; θ} = E [I1 (θ; x) ; θ] = J1 (θ) < ∞, θ ∈ Ω.
(5.7)
Let
½
¾
n ∂
n
P
P
S1 (θ; xi ) = 0 has a solution
logf (xi ; θ) =
An = (x1 , . . . , xn ) ; such that
i=1 ∂θ
i=1
and for (x1 , . . . , xn ) ∈ An , let θ̂n = θ̂n (x1 , . . . , xn ) be the value of θ such
n
P
that
S1 (θ̂n ; xi ) = 0.
i=1
Expand
n
P
S1 (θ̂n ; xi ) as a function of θ̂n about θ0 to obtain
i=1
n
P
S1 (θ̂n ; xi ) =
i=1
n
P
i=1
S1 (θ0 ; xi ) − (θ̂n − θ0 )
n
P
I1 (θ; xi )
i=1
n ∂3
P
1
log f (xi ; θ) |θ=θn∗
+ (θ̂n − θ0 )2
3
2
i=1 ∂θ
(5.8)
where θn∗ = θn∗ (x1 , . . . , xn ) lies between θ0 and θ̂n by Taylor’s Theorem.
Suppose (x1 , . . . , xn ) ∈ An . Then the left side of (5.8) equals zero and
thus
n
P
i=1
or
n
P
S1 (θ0 ; xi )
p
nJ1 (θ0 )
i=1
n ∂3
P
1
I1 (θ; xi ) − (θ̂n − θ0 )2
log f (xi ; θ) |θ=θn∗
3
2
i=1
i=1 ∂θ
∙n
¸
n ∂3
P
P
1
∗
= (θ̂n − θ0 )
I1 (θ; xi ) − (θ̂n − θ0 )2
log
f
(x
;
θ)
|
i
θ=θn
3
2
i=1
i=1 ∂θ
S1 (θ0 ; xi ) = (θ̂n − θ0 )
n
P
∙ n
¸
n ∂3
P
(θ̂ − θ0 ) P
1
∗
pn
I1 (θ; xi ) − (θ̂n − θ0 )2
log
f
(x
;
θ)
|
i
θ=θ
n
3
2
nJ1 (θ0 ) i=1
i=1 ∂θ
⎡ P
⎤
n
n
P
1
∂3
1
∗
I
(θ;
x
)
log
f
(x
;
θ)
|
i
i
θ=θn
n
∂θ3
p
⎢ n i=1 1
⎥
1
i=1
2
⎥
=
J (θ0 )(θ̂n − θ0 ) ⎢
−
θ
)
−
(
θ̂
n
0
⎣
⎦
J1 (θ0 )
2
J1 (θ0 )
=
5.4. PROOFS
179
Therefore for (X1 , . . . , Xn ) we have
⎤
⎡P
n
S1 (θ0 ; Xi )
⎥
⎢ i=1
⎥ I{(X1 , . . . , Xn ) ∈ An }
⎢ p
⎣
nJ1 (θ0 ) ⎦
=
⎡
p
⎢
J (θ0 )(θ̂n −θ0 ) ⎢
⎣
1
n
n
P
I1 (θ; Xi )
i=1
J1 (θ0 )
−
n
P
(θ̂n − θ0 )2 i=1
2J1 (θ0 )
∂3
∂θ3
(5.9)
log f (Xi ; θ) |θ=θn∗
n
⎤
⎥
⎥ I{(X1 , . . . , Xn ) ∈ An }
⎦
where θn∗ = θn∗ (X1 , . . . , Xn ) . By an argument similar to that used in Proof
5.4.4
lim P [(X1 , . . . , Xn ) ∈ An ; θ0 ] = 1.
(5.10)
n→∞
Since S1 (θ0 ; Xi ), i = 1, . . . , n are i.i.d. random variables with mean and
variance given by (5.6) and (5.7) then by the CLT
n
P
S1 (θ0 ; Xi )
p
→D Z v N (0, 1) .
nJ1 (θ0 )
i=1
(5.11)
Since I1 (θ0 ; Xi ), i = 1, . . . , n are i.i.d. random variables with mean
J1 (θ) then by the WLLN
and thus
n
1 P
I1 (θ; Xi ) →p J1 (θ)
n i=1
1
n
n
P
I1 (θ; Xi )
i=1
J1 (θ0 )
→p 1
by the Limit Theorems.
To complete the proof we need to show
∙ n
¸
3
2 1 P ∂
∗
log f (Xi ; θ) |θ=θn →p 0.
(θ̂n − θ0 )
n i=1 ∂θ3
(5.12)
(5.13)
Since θ̂n →p θ0 we only need to show that
n ∂3
1 P
log f (Xi ; θ) |θ=θn∗
n i=1 ∂θ3
(5.14)
180
CHAPTER 5. APPENDIX
is bounded in probability. Since θ̂n →p θ0 implies θn∗ →p θ0 then by (R5)
¯
¾
½¯ n
n
¯ 1 P
¯ 1 P ∂3
¯
¯
log f (Xi ; θ) |θ=θn∗ ¯ ≤
M (Xi ) ; θ0 = 1.
lim P ¯
n→∞
n i=1 ∂θ3
n i=1
Also by (R5) and the WLLN
n
1 P
M (Xi ) →p E[M (X); θ0 ] < ∞.
n i=1
It follows that (5.14) is bounded in probability (see Problem 5.3.6).
Therefore
p
J (θ0 )(θ̂n − θ0 ) →D Z v N (0, 1)
follows from (5.9), (5.11)-(5.13) and Slutsky’s Theorem.
Special Discrete Distributions
Notation and
Parameters
p.f.
Mean
Variance
m.g.f.
X  BINn, p
 nx  p x q n−x
np
npq
pe t  q n
0p1
x  0, 1, . . . , n
p
pq
pe t  q
kq
p
kq
p2
p
k
 1−qe
t 
Binomial
q  1−p
Bernoulli
X  Bernoullip
p x q 1−x
0p1
x  0, 1
q  1−p
Negative Binomial
X  NBk, p
 −kx  p k −q x
0p1
q  1−p
x  0, 1, . . .
t  − log q
Geometric
X  GEOp
0p1
q  1−p
pq x
x  0, 1, . . .
q/p
q/p 2
p
1−qe t
t  − log q
Special Discrete Distributions
Notation and
Parameters
p.f.
Mean
N
 Mx   N−M
n−x /  n 
nM/N
Variance
m.g.f.
Hypergeometric
X  HYPn, M, N
n  1, 2, , N
1 −
nM
N
M
N
 N−n
N−1
*
x  0, 1, , n
M  0, 1, , N
Poisson
X  POI
e −  x /x!
0
x  0, 1, 


e e −1
N1
2
N 2 −1
12
1 e t −e N1t
N
1−e t
t
Discrete Uniform
X  DUN
N  1, 2, 
* Not Tractable
1/N
x  1, 2, , N
t≠0
Special Continuous Distributions
Notation and
Parameters
p.d.f.
Mean
Variance
m.g.f.
1/b − a
ab
2
b−a 2
12
e bt −e at
b−at
Uniform
X  UNIFa, b
ab
a≤x≤b
t≠0
Normal
X  N,  2 
1
2 
e −x−/
2 /2

2

 2
e t
2 t 2 /2
2  0
Gamma
X  GAM, 
1
  Γ
0
x −1 e −x/
1 − t −
x0
0
t  1/
Inverted Gamma
X  IG, 
1
  Γ
x −−1 e −1/x
0
0
x0
1
−1
1
 2 −1 2 −2
*
Special Continuous Distributions
Notation and
Parameters
p.d.f.
Mean
Variance
m.g.f.

2
1 − t −1
Exponential
X  EXP
1

0
e −x/
x≥0
t  1/
Two-Parameter
Exponential
X  EXP, 
e −x−/
1

0

e t 1 − t −1
2
x≥
t  1/
Double
Exponential
X  DE, 
1
2
e −|x−|/

e t 1 −  2 t 2  −1
2 2
0
|t|  1/
Weibull
X  WEI, 


x −1 e −x/

Γ1 
1


 2 Γ1 
−Γ 2 1 
0
0
* Not Tractable.
x0
2

1



*
Special Continuous Distributions
Notation and
Parameters
p.d.f.
Mean
Variance
m.g.f.
 − 
22
6
e t Γ1  t
Extreme
Value
X  EV, 
1

e x−/−e
x−/ 
0
  0. 5772
t  −1/
(Euler’s
const.)
Cauchy
X  CAU, 
1
1x−/ 2 
**
**
**
 
x 1

−1
2
−1 2 −2
**
x≥
1
2
e −x−/
1e −x−/  2

22
3
0
Pareto
X  PAR, 
,   0
Logistic
X  LOG, 
0
** Does not exist.
e t Γ1 − tΓ1  t
|t|  1/
Special Continuous Distributions
Notation and
Parameters
p.d.f.
Mean
Variance
m.g.f.

2
1 − 2t −/2
Chi-Squared
X   2 
x /2−1 e −x/2
1
2 /2 Γ/2
  1, 2, 
x0
t  1/2
Student’s t
X  t
Γ 1

2
x2

1 
1

Γ 2 
−
1
2
  1, 2, 
0

−2
1
2
2
 2 −2
2 22  1  2 −2
 1  2 −2 2  2 −4
2  2
4  2
a
ab
ab
ab1ab 2
**
Snedecor’s F
X  F 1 ,  2 
 1  1, 2, 
 1  2

2
1
2
Γ 2
2
Γ
Γ
1 
 2  1, 2, 

  12 
1
2
x −
1
2
x
1
2
−1
 1  2
2
**
x0
Beta
X  BETAa, b
a0
b0
* Not Tractable.
** Does not exist.
Γab
ΓaΓb
x a−1 1 − x b−1
0x1
*
Special Multivariate Distributions
Notation and
Parameters
p.f./p.d.f.
m.g.f.
Multinomial
X  X 1 , X 2 , , X k 
X  MULTn, p 1 , , p k 
fx 1 , , x k  
k1
x k1
p x11 p x22 p k1
n!
x 1 !x 2 !x k1 !
p 1 e t 1    p k e t k  p k1  n
k
0  p i  1, ∑ p i  1
0 ≤ x i ≤ n, x k1  n − ∑ x i
i1
i1
Bivariate Normal
X
X1
X2
 BVN, 
fx 1 , x 2  
−1
 exp 21−
2  
0  1, 2  
x 1 − 1
1
e
1
2 1  2
 2 − 2
x 1 − 1
1
1− 2

x 2 − 2
2

−1    1


1
2

,
 21
 1  2
 1  2
 22
1
2|| 1/2
exp − 12 x −  T  −1 x − 
x 2 − 2
2
 2 
T t 1 t T t
2
t
t1
t2