Download lecture notes on statistical inference

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
L ECTURE N OTES ON
S TATISTICAL I NFERENCE
K RZYSZTOF P ODG ÓRSKI
Department of Mathematics and Statistics
University of Limerick, Ireland
November 23, 2009
Contents
1
Introduction
4
1.1
Models of Randomness and Statistical Inference . . . . . . . . . . . .
4
1.2
Motivating Example
. . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.1
Probability vs. likelihood . . . . . . . . . . . . . . . . . . . .
8
1.2.2
More data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.3
Likelihood and theory of statistics . . . . . . . . . . . . . . . . . . .
15
1.4
Computationally intensive methods of statistics . . . . . . . . . . . .
15
1.4.1
1.4.2
2
Monte Carlo methods – studying statistical methods using computer generated random samples . . . . . . . . . . . . . . . .
16
Bootstrap – performing statistical inference using computers .
18
Review of Probability
21
2.1
Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . .
21
2.2
Distribution of a Function of a Random Variable . . . . . . . . . . . .
22
2.3
Transforms Method Characteristic, Probability Generating and Mo-
2.4
2.5
ment Generating Functions . . . . . . . . . . . . . . . . . . . . . . .
24
Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.4.1
Sums of Independent Random Variables . . . . . . . . . . . .
26
2.4.2
Covariance and Correlation . . . . . . . . . . . . . . . . . .
27
2.4.3
The Bivariate Change of Variables Formula . . . . . . . . . .
28
Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . .
29
2.5.1
29
Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . .
1
2.6
2.7
3
4
5
2.5.2
Binomial Distribution . . . . . . . . . . . . . . . . . . . . .
29
2.5.3
Negative Binomial and Geometric Distribution . . . . . . . .
30
2.5.4
Hypergeometric Distribution . . . . . . . . . . . . . . . . .
31
2.5.5
Poisson Distribution . . . . . . . . . . . . . . . . . . . . . .
32
2.5.6
Discrete Uniform Distribution . . . . . . . . . . . . . . . . .
33
2.5.7
The Multinomial Distribution . . . . . . . . . . . . . . . . .
33
Continuous Random Variables . . . . . . . . . . . . . . . . . . . . .
34
2.6.1
Uniform Distribution . . . . . . . . . . . . . . . . . . . . . .
34
2.6.2
Exponential Distribution . . . . . . . . . . . . . . . . . . . .
35
2.6.3
Gamma Distribution . . . . . . . . . . . . . . . . . . . . . .
35
2.6.4
Gaussian (Normal) Distribution . . . . . . . . . . . . . . . .
36
2.6.5
Weibull Distribution . . . . . . . . . . . . . . . . . . . . . .
38
2.6.6
Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . .
38
2.6.7
Chi-square Distribution . . . . . . . . . . . . . . . . . . . . .
39
2.6.8
The Bivariate Normal Distribution . . . . . . . . . . . . . . .
39
2.6.9
The Multivariate Normal Distribution . . . . . . . . . . . . .
40
Distributions – further properties . . . . . . . . . . . . . . . . . . . .
42
2.7.1
Sum of Independent Random Variables – special cases . . . .
42
2.7.2
Common Distributions – Summarizing Tables
45
. . . . . . . .
Likelihood
48
3.1
Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . .
48
3.2
Multi-parameter Estimation . . . . . . . . . . . . . . . . . . . . . . .
55
3.3
The Invariance Principle . . . . . . . . . . . . . . . . . . . . . . . .
59
Estimation
61
4.1
General properties of estimators . . . . . . . . . . . . . . . . . . . .
61
4.2
Minimum-Variance Unbiased Estimation . . . . . . . . . . . . . . . .
64
4.3
Optimality Properties of the MLE . . . . . . . . . . . . . . . . . . .
69
The Theory of Confidence Intervals
71
5.1
71
Exact Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . .
2
6
5.2
Pivotal Quantities for Use with Normal Data . . . . . . . . . . . . . .
75
5.3
Approximate Confidence Intervals . . . . . . . . . . . . . . . . . . .
80
The Theory of Hypothesis Testing
87
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
6.2
Hypothesis Testing for Normal Data . . . . . . . . . . . . . . . . . .
92
6.3
Generally Applicable Test Procedures . . . . . . . . . . . . . . . . .
97
6.4
The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . .
101
6.5
Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . .
106
6.6
The χ2 Test for Contingency Tables . . . . . . . . . . . . . . . . . .
109
3
Chapter 1
Introduction
Everything existing in the universe is the fruit of chance.
Democritus, the 5th Century BC
1.1
Models of Randomness and Statistical Inference
Statistics is a discipline that provides with a methodology allowing to make an inference from real random data on parameters of probabilistic models that are believed to
generate such data. The position of statistics with relation to real world data and corresponding mathematical models of the probability theory is presented in the following
diagram.
The following is the list of few from plenty phenomena to which randomness is
attributed.
• Games of chance
– Tossing a coin
– Rolling a die
– Playing Poker
• Natural Sciences
4
Real World -
Science & Mathematics
?
?
Random Phenomena Probability Theory
-
?
?
Data – Samples
H
HH
H
-
HH
H
HH
H
j
Models
?
Statistics
?
Prediction and Discovery Statistical Inference
Figure 1.1: Position of statistics in the context of real world phenomena and mathematical models representing them.
5
– Physics (notable Quantum Physics)
– Genetics
– Climate
• Engineering
– Risk and safety analysis
– Ocean engineering
• Economics and Social Sciences
– Currency exchange rates
– Stock market fluctations
– Insurance claims
– Polls and election results
• etc.
1.2
Motivating Example
Let X denote the number of particles that will be emitted from a radioactive source
in the next one minute period. We know that X will turn out to be equal to one of
the non-negative integers but, apart from that, we know nothing about which of the
possible values are more or less likely to occur. The quantity X is said to be a random
variable.
Suppose we are told that the random variable X has a Poisson distribution with
parameter θ = 2. Then, if x is some non-negative integer, we know that the probability
that the random variable X takes the value x is given by the formula
P (X = x) =
θx exp (−θ)
x!
where θ = 2. So, for instance, the probability that X takes the value x = 4 is
P (X = 4) =
24 exp (−2)
= 0.0902 .
4!
6
We have here a probability model for the random variable X. Note that we are using
upper case letters for random variables and lower case letters for the values taken by
random variables. We shall persist with this convention throughout the course.
Let us still assume that the random variable X has a Poisson distribution with
parameter θ but where θ is some unspecified positive number. Then, if x is some nonnegative integer, we know that the probability that the random variable X takes the
value x is given by the formula
P (X = x|θ) =
θx exp (−θ)
,
x!
(1.1)
for θ ∈ R+ . However, we cannot calculate probabilities such as the probability that X
takes the value x = 4 without knowing the value of θ.
Suppose that, in order to learn something about the value of θ, we decide to measure
the value of X for each of the next 5 one minute time periods. Let us use the notation
X1 to denote the number of particles emitted in the first period, X2 to denote the
number emitted in the second period and so forth. We shall end up with data consisting
of a random vector X = (X1 , X2 , . . . , X5 ). Consider x = (x1 , x2 , x3 , x4 , x5 ) =
(2, 1, 0, 3, 4). Then x is a possible value for the random vector X. We know that the
probability that X1 takes the value x1 = 2 is given by the formula
P (X = 2|θ) =
θ2 exp (−θ)
2!
and similarly that the probability that X2 takes the value x2 = 1 is given by
P (X = 1|θ) =
θ exp (−θ)
1!
and so on. However, what about the probability that X takes the value x? In order for
this probability to be specified we need to know something about the joint distribution
of the random variables X1 , X2 , . . . , X5 . A simple assumption to make is that the random variables X1 , X2 , . . . , X5 are mutually independent. (Note that this assumption
may not be correct since X2 may tend to be more similar to X1 that it would be to X5 .)
However, with this assumption we can say that the probability that X takes the value x
7
is given by
P (X = x|θ)
=
5
Y
θxi exp (−θ)
i=1
2
=
=
xi !
,
θ exp (−θ) θ1 exp (−θ) θ0 exp (−θ)
×
×
2!
1!
0!
θ3 exp (−θ) θ4 exp (−θ)
×
,
×
3!
4!
θ10 exp (−5θ)
.
288
In general, if X = (x1 , x2 , x3 , x4 , x5 ) is any vector of 5 non-negative integers, then
the probability that X takes the value x is given by
P (X = x|θ)
=
5
Y
θxi exp (−θ)
i=1
P5
=
θ
i=1
xi !
,
xi
exp (−5θ)
.
5
Q
xi !
i=1
We have here a probability model for the random vector X.
Our plan is to use the value x of X that we actually observe to learn something
about the value of θ. The ways and means to accomplish this task make up the subject
matter of this course. The central tool for various statistical inference techniques is
the likelihood method. Below we present a simple introduction to it using the Poisson
model for radioactive decay.
1.2.1
Probability vs. likelihood
. In the introduced Poisson model for a given θ, say θ = 2, we can observe a function
p(x) of probabilities of observing values x = 0, 1, 2, . . . . This function is referred to
as probability mass function . The graph of it is presented below
The usage of such function can be utilized in bidding for a recorded result of future
experiments. If one wants to optimize chances of correctly predicting the future, the
choice of the number of recorded particles would be either on 1 or 2.
So far, we have been told that the random variable X has a Poisson distribution with
parameter θ where θ is some positive number and there are physical reason to assume
8
0.25
0.20
0.15
0.10
0.00
0.05
Probability
0
2
4
6
8
10
Number of particles
Figure 1.2: Probability mass function for Poisson model with θ = 2.
that such a model is correct. However, we have arbitrarily set θ = 2 and this is more
questionable. How can we know that it is correct a correct value of the parameter? Let
us analyze this issue in detail.
If x is some non-negative integer, we know that the probability that the random
variable X takes the value x is given by the formula
P (X = x|θ) =
θx e−θ
,
x!
for θ > 0. But without knowing the true value of θ, we cannot calculate probabilities
such as the probability that X takes the value x = 1.
Suppose that, in order to learn something about the value of θ, an experiment is
performed and a value of X = 5 is recorded. Let us take a look at the probability mass
function for θ = 2 in Figure 1.2. What is the probability of X to take value 2? Do we
like what we see? Why? Would you bet 1 or 2 in the next experiment?
We certainly have some serious doubt about our choice of θ = 2 which was arbitrary anyway. One can consider, for example, θ = 7 as an alternative to θ = 2. Here
are graphs of the pmf for the two cases. Which of the two choices do we like? Since it
9
0.15
0.25
Probability
0.10
0.20
0.15
0.05
Probability
0.10
0.00
0.05
0.00
0
2
4
6
8
0
10
2
4
6
8
10
Number of particles
Number of particles
Figure 1.3: The probability mass function for Poisson model with θ = 2 vs. the one
with θ = 7.
was more probable to get X = 5 under the assumption θ = 7 than when θ = 2, we say
θ = 7 is more likely to produce X = 5 than θ = 2. Based on this observation we can
develop a general strategy for chosing θ.
Let us summarize our position. So far we know (or assume) about the radioactive
emission that it follows Poisson model with some unknown θ > 0 and the value x = 5
has been once observed. Our goal is somehow to utilized this knowledge. First, we
note that the Poisson model is in fact not only a function of x but also of θ
p(x|θ) =
θx e−θ
.
x!
Let us plug in the observed x = 5, so that we get a function of θ that is called
likelihood function
l(θ) =
θ5 e−θ
.
120
The graph of it is presented on the next figure. Can you localize on this graph the values
of probabilities that were used to chose θ = 7 over θ = 2? What value of θ appears to
be the most preferable if the same argument is extended to all possible values of θ? We
observe that the value of θ = 5 is most likely to produce value x = 5. In the result of
our likelihood approach we have used the data x = 5 and the Poisson model to make
inference - an example of statistical inference .
10
0.15
0.10
0.00
0.05
Likelihood
0
5
10
15
theta
Figure 1.4: Likelihood function for the Poisson model when the observed value is
x = 5.
Likelihood – Poisson model backward
Poisson model can be stated as a probability mass function that maps possible values
x into probabilities p(x) or if we emphasize the dependence on θ into p(x|θ) that is
given below
p(x|θ) = l(θ|x) =
θx e−θ
,
x!
• With the Poisson model with given θ one can compute probabilities that various
possible numbers x of emitted particles can be recorded, i.e. we consider
x 7→ p(x|θ)
with θ fixed. We get the answer how probable are various outcomes x.
• With the Poisson model where x is observed and thus fixed one can evaluate how
likely it would be to get x under various values of θ, i.e. we consider
θ 7→ l(θ|x)
with θ fixed. We get the answer how likely various θ could produced the observed
x.
11
Exercise 1. For the general Poisson model
p(x|θ) = l(θ|x) =
θx e−θ
,
x!
1. for a given θ > find the most probable value of the observation x.
2. for a given observation x find the most likely value of θ.
Give a mathematical argument for your claims.
1.2.2
More data
Suppose that we perform another measurement of the number of emitted particles. Let
us use the notation X1 to denote the number of particles emitted in the first period, X2
to denote the number emitted in the second period. We shall end up with data consisting
of a random vector X = (X1 , X2 ). The second measurement yielded x2 = 2, so that
x = (x1 , x2 ) = (5, 2).
We know that the probability that X1 takes the value x1 = 5 is given by the formula
P (X = 5|θ) =
θ5 e−θ
5!
and similarly that the probability that X2 takes the value x2 = 2 is given by
P (X = 2|θ) =
θ2 e−θ
.
2!
However, what about the probability that X takes the value x = (5, 2)? In order for
this probability to be specified we need to know something about the joint distribution
of the random variables X1 , X2 . A simple assumption to make is that the random
variables X1 , X2 are mutually independent. In such a case the probability that X takes
the value x = (x1 , x2 ) is given by
P (X = (x1 , x2 )|θ) =
θx1 e−θ θx2 e−θ
θx1 +x2
·
= e−2θ
.
x1 !
x2 !
x1 !x2 !
After little of algebra we easily find the likelihood function of observing X = (5, 2)
as
l(θ|(5, 2)) = e−2θ
12
θ7
240
0.025
0.020
0.015
0.010
0.000
0.005
Likelihood
0
5
10
15
0.10
0.00
0.05
Likelihood
0.15
theta
0
5
10
15
theta
Figure 1.5: Likelihood of observing (5, 2) (top) vs. the one of observing 5 (bottom).
and its graph is presented in Figure 1.5 in comparison with the previous likelihood for
a single observation.
Two important effects of adding an extra information should be noted
• We observe that the location of the maximum shifted from 5 to 3 compared to
single observation.
• We also note that the range of likely values for θ has diminished.
Let us suppose that eventually we decide to measure three more values of X.
Let us use the vector notation X = (X1 , X2 , . . . , X5 ) to denote observable random
13
vector. Assume that three extra measurements yielded 3, 7, 7 so that we have x =
(x1 , x2 , x3 , x4 , x5 ) = (5, 2, 3, 7, 7). Under the assumption of independence the probability that X takes the value x is given by
P (X = x|θ) =
5
Y
θxi e−θ
xi !
i=1
.
The likelihood function of observing X = (5, 2, 3, 7, 7) under independence can
be easily derived to be
θ24 e−5θ
.
14515200
In general, if X = (x1 , . . . , xn ) is any vector of 5 non-negative integers, then the
likelihood is given by
l(θ|(x1 , . . . , xn )
=
θ
Pn
i=1
xi −nθ
e
n
Q
.
xi !
i=1
The value θb that maximizes this likelihood is called the maximum likelihood estimator
of θ.
In order to find values that effectively maximize likelihood, the method of calculus
can be implemented. We note that in our example we deal only with one variable θ and
computation of derivative is rather straightforward.
Exercise 2. For the general case of likelihood based on Poisson model
l(θ|x1 , . . . , xn )
=
θ
Pn
i=1
xi −nθ
n
Q
e
xi !
i=1
using methods of calculus derive a general formula for the maximum likelihood estimator of θ. Using the result find θb for (x1 , x2 , x3 , x4 , x5 ) = (5, 2, 3, 7, 7).
Exercise 3. It is generally believed that time X that passes until there is half of the
original radioactive material follow exponential distribution f (x|θ) = θe−θx , x > 0.
For beryllium 11 five experiments has been performed and values 13.21, 13.12, 13.95,
13.54, 13.88 seconds has been obtained. Find and plot the likelihood function for θ and
based on this determine the most likely θ.
14
1.3
Likelihood and theory of statistics
The strategy of making statistical inference based on the likelihood function as described above is the recurrent theme in mathematical statistics and thus in our lecture.
Using mathematical argument we would compare various strategies to infering about
the parameters and often we will demonstrate that the likelihood based methods are
optimal. It will show its strength also as a criterium deciding between various claims
about parameters of the model which is the leading story of testing hypotheses.
In the modern days, the role of computers has increased in statistical methodology.
New computationally intense methods of data explorations become one of the central
areas of modern statistcs. Even there, methods that refer to likelihood play dominant
roles, in particular, in Bayesian methodology.
Despite this extensive penetration of statistical methodology by likelihood techinques, by no means statistics can be reduced to analysis of likelihood. In every area of
statistics, there are important aspects that require reaching beyond likelihood, in many
cases, likelihood is not even a focus of studies and development. The purpose of this
course is to present both the importance of likelihood approach across statistics but also
presentation of topics for which likelihood plays a secondary role if any.
1.4
Computationally intensive methods of statistics
The second part of our presentation of modern statistical inference is devoted to computationally intensive statistical methods. The area of data explorations is rapidly growing
in importance due to
• common access to inexpensive but advance computing tools,
• emerging of new challenges associated with massive highly dimensional data far
exceeding traditional assumptions on which traditional methods of statistics have
been based.
In this introduction we give two examples that illustrate the power of modern computers
and computing software both in analysis of statistical models and in performing actual
15
statistical inference. We start with analyzing a performance of a statistical procedure
using random sample generation.
1.4.1
Monte Carlo methods – studying statistical methods using
computer generated random samples
Randomness can be used to study properties of a mathematical model. The model itself
may be probabilistic or not but here we focus on the probabilistic ones. Essentially, it
is based on repetitive simulations of random samples corresponding to the model and
observing behavior of objects of interests. An example of Monte Carlo method is approximate the area of circle by tossing randomly a point (typically computer generated)
on the paper where a circle is drawn. The percentage of points that fall inside the circle
represents (approximately) percentage of the area covered by the circle, as illustrated
in Figure 1.6.
Exercise 4. Write an R code that would explore the area of an elipsoid using Monte
Carlo method.
Below we present an application of Monte Carlo approach to studying fitting methods for the Poisson model.
Deciding for Poisson model
Recall that the Poisson model is given by
P (X = x|θ) =
θx e−θ
.
x!
It is relatively easy to demonstrate that the mean value of this distribution is equal to θ
and standard deviation is also equal to θ.
Exercise 5. Present a formal argument showing that for a Poisson random variable X
with parameter θ, EX = θ and VarX = θ.
Thus for a sample of observations x = (x1 , . . . , xn ) it is reasonable to consider
16
Figure 1.6: Monte Carlo study of the circle area – approximation for sample size of
10000 is 3.1248 which compares to the true value of π = 3.141593.
both
θb1 = x̄,
θb2 = x¯2 − x̄2
as estimators of θ.
We want to employ Monte Carlo method to decide which one is better. In the
process we run many samples from the Poisson distribution and check which of the
17
100
0
Frequency
Histogram of means
2.5
3.0
3.5
4.0
4.5
5.0
5.5
means
300
150
0
Frequency
Histogram of vars
0
5
10
15
vars
Figure 1.7: Monte Carlo results of comparing estimation of θ = 4 by the sample mean
(left) vs. estimation using the sample standard deviation right.
estimates performs better. The resulting histograms of the values of estimator are presented in Figure 1.8. It is quite clear from the graphs that the estimator based on the
mean is better than the one based on the variance.
1.4.2
Bootstrap – performing statistical inference using computers
Bootstrap (resampling) methods are one of the examples of Monte Carlo based statistical analysis. The methodology can be summarized as follows
• Collect statistical sample, i.e. the same type of data as in classical statistics.
• Used a properly chosen Monte Carlo based resampling from the data using RNG
– create so called bootstrap samples.
• Analyze bootstrap samples to draw conclusions about the random mechanism
18
that produced the original statistical data.
This way randomness is used to analyze statistical samples that, by the way, are also a
result of randomness. An example illustrating the approach is presented next.
Estimating nitrate ion concentration
Nitrate ion concentration measurements in a certain chemical lab has been collected
and their results are given in the following table. The goal is to estimate, based on
0.51
0.51
0.51
0.50
0.51
0.49
0.52
0.53
0.50
0.47
0.51
0.52
0.53
0.48
0.49
0.50
0.52
0.49
0.49
0.50
0.49
0.48
0.46
0.49
0.49
0.48
0.49
0.49
0.51
0.47
0.51
0.51
0.51
0.48
0.50
0.47
0.50
0.51
0.49
0.48
0.51
0.50
0.50
0.53
0.52
0.52
0.50
0.50
0.51
0.51
Table 1.1: Results of 50 determinations of nitrate ion concentration in µg per ml.
these values, the actual nitrate ion concentration. The overall mean of all observations
is 0.4998. It is natural to ask what is the error of this determination of the nitrate
concentration. If we would repeat our experiment of collecting 50 samples of nitrate
concentrations many times we would see the range of error that is made. However,
it would be a waste of resources and not a viable method at all. Instead we resample
‘new’ data from our data and use so obtained new samples for assessment of the error
and compare the obtained means (bootstrap means) with the original one. The differences of these represent the bootstrap “estimation” errors their distribution is viewed
as a good representation of the distribution of the true error. In Figure ??, we see the
bootstrap counterpart of the distribution of the estimation error.
Based on this we can safely say that the nitrate concentration is 49.99 ± 0.005.
Exercise 6. Consider a sample of daily number of buyers in a furniture store
8, 5, 2, 3, 1, 3, 9, 5, 5, 2, 3, 3, 8, 4, 7, 11, 7, 5, 12, 5
Consider the two estimators of θ for a Poisson distribution as discussed in the previous
section. Describe formally the procedure (in steps) of obtaining a bootstrap confidence
19
60
40
0
20
Frequency
80
Histogram of bootstrap
-0.006
-0.004
-0.002
0.000
0.002
0.004
0.006
0.008
bootstrap
Figure 1.8: Boostrap estimation error distribution.
interval for θ using each of the discussed estimatoand provide with 95% bootstrap
confidence intervals for each of them.
20
Chapter 2
Review of Probability
2.1
Expectation and Variance
The expected value E[Y ] of a random variable Y is defined as
∞
X
E[Y ] =
yi P (yi );
i=0
if Y is discrete, and
Z
∞
E[Y ] =
yf (y)dy;
−∞
if Y is continuous, where f (y) is the probability density function. The variance Var[Y ]
of a random variable Y is defined as
Var[Y ] = E(Y − E[Y ])2 ;
or
Var[Y ] =
∞
X
(yi − E[Y ])2 P (yi );
i=0
if Y is discrete, and
Z
∞
V ar[Y ] =
(y − E[Y ])2 f (y)dy;
−∞
if Y is continuous. When there is no ambiguity we often write EY for E[Y ], and VarY
for Var[Y ].
21
A function of a random variable is itself a random variable. If h(Y ) is function of
the random variable Y , then the expected value of h(Y ) is given by
E[h(Y )] =
∞
X
h(yi )P (yi );
i=0
if Y is discrete, and if Y is continuous
Z
∞
E[h(Y )] =
h(y)f (y) dy.
−∞
It is relatively straightforward to derive the following results for the expectation
and variance of a linear function of Y .
E[aY + b] = aE[Y ] + b,
V ar[aY + b] = a2 Var[Y ],
where a and b are constants. Also
Var[Y ] = E[Y 2] − (E[Y ])2
(2.1)
For expectations, it can be shown more generally that
E
k
X
i=1
ai hi (Y ) =
k
X
ai E[hi (Y )],
i=1
where ai , i = 1, 2, . . . , k are constants and hi (Y ), i = 1, 2, . . . , k are functions of the
random variable Y .
2.2
Distribution of a Function of a Random Variable
If Y is a random variable than for any regular function X = g(Y ) is also a random
variable. The cumulative distribution function of X is given as
FX (x) = P (X ≤ x) = P (Y ∈ g −1 (−∞, x]).
The density function of X if exists can be found by differentiating the right hand side
of the above equality.
22
Example 1. Let Y has a density fY and X = Y 2 . Then
√
√
√
√
FX (x) = P (Y 2 < x) = P (− x ≤ Y ≤ x) = FY ( x) − FY (− x).
By taking a derivative in x we obtain
√
√ 1
fX (x) = √ fY ( x) + fY (− x) .
2 x
If additionally the distribution of Y is symmetric around zero, i.e. fY (y) = fY (−y),
then
√
1
fX (x) = √ fY ( x).
x
Exercise 7. Let Z be a random variable with the density fZ (z) = e−z
2
/2
√
/ 2π, so
called the standard normal (Gaussian) random variable. Show that Z 2 is a Gamma(1/2, 1/2)
random variable, i.e. that it has the density given by
1
√ x−1/2 e−x/2 .
2π
The distribution of Z 2 is also called chi-square distribution with one degree of freedom.
Exercise 8. Let FY (y) be a cumulative distribution function of some random variable
Y that with probability one takes values in a set RY . Assume that there is an inverse
function FY−1 [0, 1] 7→ RY so that FY FY−1 (u) = u for u ∈ [0, 1]. Check that for U ∼
U nif (0, 1) the random variable Ỹ = FY−1 (U ) has FY as its cumulative distribution
function.
The densities of g(Y ) are particularly easy to express if g is a strictly monotone as
shown in the next result
Theorem 2.2.1. Let Y be a continuous random variable with probability density function fY . Suppose that g(y) is a strictly monotone (increasing or decreasing) differentiable (and hence continuous) function of y. The random variable Z defined by
Z = g(Y ) has probability density function given by
d −a −1
fZ (z) = fY g (z) g (z)
dz
where g −1 (z) is defined to be the inverse function of g(y).
23
(2.2)
Proof. Let g(y) be a monotone increasing (decreasing) function and let FY (y) and
FZ (z) denote the probability distribution functions of the random variables Y and Z.
Then
FZ (z) = P (Z ≤ z) = P (g(Y ) ≤ z) = P (Y ≤ (≥)g −1 (z)) = (1−)FY (g −1 (z))
By the chain rule,
−1 dg
d
d
−1
fZ (z) =
FZ (z) = (−) FY (g (z)) = fY (g−1 (z)) (z) .
dz
dz
dz
Exercise 9. (The Log-Normal Distribution) Suppose Z is a standard normal distribution and g(z) = eaz+b . Then Y = g(Z) is called a log-normal random variable.
Demonstrate that the density of Y is given by
fY (y) = √
2.3
log2 (y/eb )
y −1 exp −
.
2a2
2πa2
1
Transforms Method Characteristic, Probability Generating and Moment Generating Functions
The probability generating function of a random variable Y is a function denoted by
GY (t) and defined by
GY (t) = E(tY ),
for those t ∈ R for which the above expectation is convergent. The expectation defining
GY (t) converges absolutely if |t| ≤ 1. As the name implies, the p.g.f generates the
probabilities associated with a discrete distribution P (Y = j) = pj , j = 0, 1, 2, . . . .
GY (0) = p0 , G0Y (0) = p1 , G”Y (0) = 2!p2 .
In general the kth derivative of the p.g.f of Y satisfies
G( k)Y (0) = k!pk .
24
The p.g.f can be used to calculate the mean and variance of a random variable Y . Note
P∞
that in the discrete case G0Y (t) = j=1 jpj tj−1 for −1 < t < 1. Let t approach one
from the left, t → 1− , to obtain
G0Y (1) =
∞
X
jpj = E(Y ) = µY .
j=1
The second derivative of GY (t) satisfies
G”Y (t) =
∞
X
j(j − 1)pj tj−2 ,
j=1
and consequently
G”Y (1) =
∞
X
j = 1j(j − 1)pj = E(Y 2 ) − E2 (Y ).
The variance of Y satisfies
2
σY2 = EY 2 − EY + EY − E2 Y = G”Y (1) + G0Y (1) − G0 Y (1).
The moment generating function (m.g.f) of a random variable Y is denoted by
MY (t) and defined as
MY (t) = E etY ,
for some t ∈ R. The moment generating function generates the moments EY k
MY (0) = 1, MY0 (0) = µY = E(Y ), M ”Y (0) = EY 2 ,
and, in general,
M ( k)Y (0) = EY k .
The characteristic function (ch.f) of a random variable Y is defined by
φY (t) = EeitY ,
where i =
√
−1.
A very important result concerning generating functions states that the moment
generating function uniquely defines the probability distribution (provided it exists in
an open interval around zero). The characteristic function also uniquely defines the
probability distribution.
25
Property 1. If Y has the characteristic function φY (t) and the moment generating
function MY (t), then for X = a + bY
φX (t) =eait φY (bt)
MX (t) =eat MY (bt).
2.4
2.4.1
Random Vectors
Sums of Independent Random Variables
Suppose that Y1 , Y2 , . . . , Yn are independent random variables. Then the moment genPn
erating function of the linear combination Z = i=1 ai Yi is the product of the individual moment generating functions.
MZ (t) =Eet
P
a i Yi
=Eea1 tY1 Eea2 tY2 · · · Eean tYn
=
n
Y
MYi (ai Yi ).
i=1
The same argument gives also that φZ (t) =
Qn
i=1
φYi (aiY i).
When X and Y are discrete random variables, the condition of independence is
equiva- lent to pX,Y (x, y) = pX (x)pY (y) for all x, y. In the jointly continuous case
the condition of independence is equivalent to fX,Y (x, y) = fX (x)fY (y) for all x, y.
Consider random variables X and Y with probability densities fX (x) and fY (y) respectively. We seek the probability density of the random variable X + Y . Our general
result follows from
FX+Y (a) =P (X + Y < a)
Z Z
=
fX (x)fY (y) dxdy
Z
X+Y <a
∞ Z a−y
=
fX (x)fY (y) dxdy
−∞
Z ∞
−∞
Z a
−∞
Z a
−∞
Z ∞
−∞
−∞
fX (z − y) dz fY (y) dy
=
fX (z − y)fY (y) dy dz
=
26
(2.3)
Thus the density function fX+Y (z) =
R∞
−∞
fX (z − y)fY (y) dy which is called the
convolution of the densities fX and fY .
2.4.2
Covariance and Correlation
Suppose that X and Y are real-valued random variables for some random experiment.
The covariance of X and Y is defined by
Cov(X, Y ) = E[(X − EX)(Y − EY )]
and (assuming the variances are positive) the correlation of X and Y is defined by
Cov(X, Y )
p
.
ρ(X, Y ) = p
Var(X) Var(Y )
Note that the covariance and correlation always have the same sign (positive, negative, or 0). When the sign is positive, the variables are said to be positively correlated,
when the sign is negative, the variables are said to be negatively correlated, and when
the sign is 0, the variables are said to be uncorrelated. For an intuitive understanding
of correlation, suppose that we run the experiment a large number of times and that
for each run, we plot the values (X, Y ) in a scatterplot. The scatterplot for positively
correlated variables shows a linear trend with positive slope, while the scatterplot for
negatively correlated variables shows a linear trend with negative slope. For uncorrelated variables, the scatterplot should look like an amorphous blob of points with no
discernible linear trend.
Property 2. You should satisfy yourself that the following are true
Cov(X, Y ) =EXY − EXEY
Cov(X, Y ) =Cov(Y, X)
Cov(Y, Y ) =Var(Y )
Cov(aX + bY + c, Z) =aCov(X, Z) + bCov(Y, Z)
!
n
n
X
X
Var
Yi =
Cov(Yi , Yj )
i=1
i,j=1
If X and Y are independent, then they are uncorrelated. The converse is not true
however.
27
2.4.3
The Bivariate Change of Variables Formula
Suppose that (X, Y ) is a random vector taking values in a subset S of R2 with probability density function f . Suppose that U and V are random variables that are functions
of X and Y
U = U (X, Y ), V = V (X, Y ).
If these functions have derivatives, there is a simple way to get the joint probability
density function g of (U, V ). First, we will assume that the transformation (x, y) 7→
(u, v) is one-to-one and maps S onto a subset T of R2 . Thus, the inverse transformation
(u, v) 7→ (x, y) is well defined and maps T onto S. We will assume that the inverse
transformation is “smooth”, in the sense that the partial derivatives
∂x ∂x ∂y ∂y
,
,
,
,
∂u ∂v ∂u ∂v
exist on T , and the Jacobian
∂(x, y) =
∂(u, v) ∂x
∂u
∂y
∂u
∂x
∂v
∂y
∂v
∂x ∂y ∂x ∂y
=
∂u ∂v − ∂v ∂u
is nonzero on T . Now, let B be an arbitrary subset of T . The inverse transformation
maps B onto a subset A of S. Therefore,
Z Z
P ((U, V ) ∈ B) = P ((X, Y ) ∈ A) =
f (x, y) dxdy.
A
But, by the change of variables formula for double integrals, this can be written as
Z Z
∂(x, y) dudv.
P ((U, V ) ∈ B) =
f (x(u, v), y(u, y)) ∂(u, v) B
By the very meaning of density, it follows that the probability density function of
(U, V ) is
∂(x, y) , (u, v) ∈ T.
g(u, v) = f (x(u, v), y(u, v)) ∂(u, v) The change of variables formula generalizes to Rn .
Exercise 10. Let U1 and U2 be independent random variables with the density equal
to one over [0, 1], i.e. standard uniform random variables. Find the density of the
following vector of variables
p
p
(Z1 , Z2 ) = ( −2 log U1 cos(2πU2 ), −2 log U1 sin(2πU2 )).
28
2.5
Discrete Random Variables
2.5.1
Bernoulli Distribution
A Bernoulli trial is a probabilistic experiment which can have one of two outcomes,
success (Y = 1) or failure (Y = 0) and in which the probability of success is θ. We
refer to θ as the Bernoulli probability parameter. The value of the random variable Y is
used as an indicator of the outcome, which may also be interpreted as the presence or
absence of a particular characteristic. A Bernoulli random variable Y has probability
mass function
P (Y = y|θ) = θy (1 − θ)1 − y
(2.4)
for y = 0, 1 and some θ ∈ (0, 1). The notation Y ∼ Ber(θ) should be read as the
random variable Y follows a Bernoulli distribution with parameter θ.
A Bernoulli random variable Y has expected value E[Y ] = 0 · P (Y = 0) + 1 ·
P (Y = 1) = 0·(1−θ)+1·θ = θ, and variance Var[Y ] = (0−θ)2·(1−θ)+(1−θ)2 ·θ =
θ(1 − θ).
2.5.2
Binomial Distribution
Consider independent repetitions of Bernoulli experiments, each with a probability of
success θ. Next consider the random variable Y , defined as the number of successes in
a fixed number of independent Bernoulli trials, n . That is,
Y =
n
X
Xi ,
i=1
where Xi ∼ Bernoulli(θ) for i = 1, . . . , n. Each sequence of length n containing y
“ones” and (n − y) “zeros” occurs with probability θy(1 − θ)( n − y). The number of
sequences with y successes, and consequently (n − y) fails, is
n!
n
=
.
y!(n − y)!
y
The random variable Y can take on values y = 0, 1, 2, . . . , n with probabilities
n y
P (Y = y|θ) =
θ (1 − θ)n−y .
(2.5)
y
29
The notation Y ∼ Bin(n, θ) should be read as “the random variable Y follows a binomial distribution with parameters n and θ.” Finally using the fact that Y is the sum
of n independent Bernoulli random variables we can calculate the expected value as
P
P
P
E[Y ] = E[ Xi ] = P E[Xi ] =
θ = nθ and variance as Var[Y ] = V ar[ Xi ] =
P
P
Var[Xi ] = θ(1 − θ) = nθ(1 − θ).
2.5.3
Negative Binomial and Geometric Distribution
Instead of fixing the number of trials, suppose now that the number of successes, r,
is fixed, and that the sample size required in order to reach this fixed number is the
random variable N . This is sometimes called inverse sampling. In the case of r = 1,
using the independence argument again, leads to geomoetric distribution
P (N = n|θ) = θ(1 − θ)n−1 , n = 1, 2, . . .
(2.6)
for n = 1, 2, . . . which is the geometric probability function with parameter θ. The
distribution is so named as successive probabilities form a geometric series. The notation N ∼ Geo(θ) should be read as “the random variable N follows a geometric
distribution with parameter θ.” Write (1 − θ) = q. Then
∞
X
d n
d
nq n θ = θ
E[N ] =
(q ) = θ
dq
dq
n=0
n=1
d
1
1
θ
=θ
= .
=
dq 1 − q
(1 − q)2
θ
∞
X
∞
X
!
qn
n=0
Also,
∞
∞
X
d
d X n
E[N ] =
n q
θ=θ
(nq n ) = θ
nq
dq
dq n=1
n=1
n=1
d
q
d
=θ θ
E(N ) = θ
q(1 − q)−2
dq
1−q
dq
1
2(1 − θ)
2
1
=θ
+
= 2− .
θ2
θ3
θ
θ
2
∞
X
!
2 n−1
Using Var[N ] = E[N 2 ] − (E[N ])2 , we get Var[N ] = (1 − θ)/θ2 .
Consider now sampling continues until a total of r successes are observed. Again,
let the random variable N denote number of trial required. If the rth success occurs
30
on the nth trial, then this implies that a total of (r − 1) successes are observed by the
(n − 1)th trial. The probability of this happening can be calculated using the binomial
distribution as
n − 1 r−1
θ (1 − θ)n−r .
r−1
The probability that the nth trial is a success is θ. As these two events are independent we have that
P (N = n|r, θ) =
n−1 r
θ (1 − θ)n−r
r−1
(2.7)
for n = r, r + 1, . . . . The notation N ∼ N egBin(r, θ) should be read as “the random
variable N follows a negative binomial distribution with parameters r and θ.” This is
also known as the Pascal distribution.
∞
X
k
k n−1
θr (1 − θ)n−r
E[N ] =
n
r−1
n=r
∞
r X k−1 n r+1
n−1
n
n−r
=
n
θ (1 − θ)
since n
=r
θ n=r
r
r−1
r
∞
m − 1 r+1
r X
(m − 1)k−1
θ (1 − θ)m−(r+1)
=
r
θ m=r+1
=
r E (X − 1)k−1 ,
θ
where X ∼ N egativebinomial(r + 1, θ). Setting k = 1 we get E(N ) = r/θ. Setting
k = 2 gives
r
r
E[N 2] = E(X − 1) =
θ
θ
r+1
−1 .
θ
Therefore Var[N ] = r(1 − θ)/θ2 .
2.5.4
Hypergeometric Distribution
The hypergeometric distribution is used to describe sampling without replacement.
Consider an urn containing b balls, of which w are white and b − w are red. We
intend to draw a sample of size n from the urn. Let Y denote the number of white balls
selected. Then, for y = 0, 1, 2, . . . , n we have
P (Y = y|b, w, n) =
31
w
y
b−w
n−y
b
n
.
(2.8)
The expected value of the jth moment of a hypergeometric random variable is
w b−w
n
n
X
X
j
j y n−y
E[Y ] =
y P (Y = y) =
y
.
b
y=0
n
y=1
The identities
w
w−1
=w
y
y−1
b
b−1
n
=b
n
n−1
y
can be used to obtain
n
nw X j−1
E[Y ] =
y
b y=1
j
w−1 b−w
y−1 n−1
b−1
n−1
n−1
nw X
(x + 1)j−1
=
b x=0
=
w−1
x
b−w
n−1−x
b−1
n−1
nw
E[(X + 1)j−1 ]
b
where X is a hypergeometric random variable with parameters n−1, b−1, w−1. From
this it is easy to establish that E[Y ] = nθ and Var[Y ] = nθ(1 − θ)(b − n)/(b − 1),
where θ = w/b is the fraction of white balls in the population.
2.5.5
Poisson Distribution
Certain problems involve counting the number of events that have occurred in a fixed
time period. A random variable Y , taking on one of the values 0, 1, 2, . . . , is said to be
a Poisson random variable with parameter θ if for some θ > 0,
P (Y = y|θ) =
θy −θ
e , y = 0, 1, 2, . . .
y!
(2.9)
The notation Y ∼ P ois(θ) should be read as “random variable Y follows a Poisson
distribution with parameter θ.” Equation 2.9 defines a probability mass function, since
∞
X
θy
y=0
y!
e−θ = e−θ
∞
X
θy
y=0
y!
= e−θ eθ = 1.
The expected value of a Poisson random variable is
E[Y ] =
∞
X
y=0
ye−θ
∞
∞
X
X
θy
θy−1
θj
= θe−θ
= θe−θ
= θ.
y!
(y − 1)!
(j)!
y=1
j=0
32
To get the variance we first compute the second moment
E[Y 2 ] =
∞
X
y=0
y 2 e−θ
∞
∞
X
X
θy
θy−1
θj
=θ
=θ
= θ(θ + 1).
ye−θ
(j + 1)e−θ
y!
y − 1!
j!
y=1
j=0
Since we already have E[Y ] = θ, we obtain Var[Y ] = E[Y 2 ] − (E[Y ])2 = θ.
Suppose that Y ∼ Binomial(n, p), and let θ = np. Then
n y
P (Y = y|np) =
p (1 − p)n−y
y
n−y
y n
θ
θ
=
1−
y
n
n
n(n − 1) · · · (n − y + 1) θy (1 − θ/n)n
.
=
ny
y! (1 − θ/n)y
For n large and θ “moderate”, we have that
n
y
θ
n(n − 1) · · · (n − y + 1)
θ
1−
≈ e−θ ,
≈
1,
1
−
≈ 1.
n
ny
n
Our result is that a binomial random variable Bin(n, p) is well approximated by a
Poisson random variable P ois(θ = np) when n is large and p is small. That is
P (Y = y|n, p) ≈ e−np
2.5.6
(np)y
.
y!
Discrete Uniform Distribution
The discrete uniform distribution with integer parameter N has a random variable Y
that can take the vales y = 1, 2, . . . , N with equal probability 1/N . It is easy to show
that the mean and variance of Y are E[Y ] = (N + 1)/2, and Var[Y ] = (N 2 − 1)/12.
2.5.7
The Multinomial Distribution
Suppose that we perform n independent and identical experiments, where each experiment can result in any one of r possible outcomes, with respective probabilities
Pr
p1 , p2 , . . . , pr , where i=1 pi = 1. If we denote by Yi , the number of the n experiments that result in outcome number i, then
P (Y1 = n1 , Y2 = n2 , . . . , Yr = nr ) =
33
n!
pn1 pn2 · · · pn5 r
n1 !n2 ! · · · nr ! 1 2
(2.10)
where
Pr
i=1
ni = n. Equation 2.10 is justified by noting that any sequence of out-
comes that leads to outcome i occurring ni times for i = 1, 2, . . . , r, will, by the
assumption of independence of experiments, have probability pn1 1 pn2 2 · · · pnr r of occurring. As there are n! = (n1 !n2 ! · · · nr !) such sequence of outcomes equation 2.10 is
established.
2.6
2.6.1
Continuous Random Variables
Uniform Distribution
A random variable Y is said to be uniformly distributed over the interval (a, b) if its
probability density function is given by
1
, if a < y < b
b−a
Ru
and equals 0 for all other values of y. Since F (u) = −∞ f (y)dy, the distribution
f (y|a, b) =
function of a uniform random variable on the interval (a, b) is



0;
u ≤ a,


F (u) =
(u − a)/(b − a); a < u ≤ b,




1;
u>b
The expected value of a uniform random turns out to be the mid-point of the interval,
that is
Z
∞
E[Y ] =
Z
yf (y)dy =
−∞
a
b
y
b2 − a2
b+a
dy =
=
.
b−a
2(b − a)
2
The second moment is calculated as
Z b 2
y
b3 − a3
1
2
dy =
= (b2 + ab + a2 ),
E[Y ] =
3(b − a)
3
a b−a
hence the variance is
Var[Y ] = E[Y 2 ] − (E[Y ])2 =
1
(b − a)2 .
12
The notation Y ∼ U (a, b) should be read as “the random variable Y follows a uniform
distribution on the interval (a, b)”.
34
2.6.2
Exponential Distribution
A random variable Y is said to be an exponential random variable if its probability
density function is given by
f (y|θ) = θe−θy , y > 0, θ > 0.
The cumulative distribution of an exponential random variable is given by
Z a
F (a) =
θe−θy dy = −e−θy |a0 = 1 − e−θa , a > 0.
0
The expected value E[Y ] =
R∞
0
yθe−θy dy requires integration by parts, yielding
E[Y ] = −ye−θy |∞
0 +
∞
Z
e−θy dy =
0
−e−θy ∞
1
|0 = .
θ
θ
Integration by parts can be used to verify that E[Y 2 ] = 2θ−2 . Hence Var[Y ] = 1/θ2 .
The notation Y ∼ Exp(θ) should be read as “the random variable Y follows an exponential distribution with parameter θ”.
Exercise 11. Let Y ∼ U [0, 1]. Find the distribution of Y = − log U . Can you identify
it as a one of the common distributions?
2.6.3
Gamma Distribution
A random variable Y is said to have a gamma distribution if its density function is
given by
f (y|αθ) = θα e−θy y α−1 /Γ(α), 0 < y, λ > 0, θ > 0
where Γ(α), is called the gamma function and is defined by
Z ∞
Γ(α) =
e−u uα−1 du.
0
The integration by parts of Γ(α) yields the recursive relationship
Z ∞
Γ(α) = −e−u uα−1 |∞
+
e−u (α − 1)uα−2 du
0
0
Z ∞
= (α − 1)
e−u uα−2 du = (α − 1)Γ(α − 1).
0
35
(2.11)
(2.12)
For integer values α = n, this recursive relationship reduces to Γ(n + 1) = n!. Note,
by setting α = 1 the gamma distribution reduces to an exponential distribution. The
expected value of a gamma random variable is given by
Z ∞
Z ∞
θα
θα
E[Y ] =
y α e−θy dy = θ!
uα e−u du,
Γ(α) 0
Γ(α) 0
after the change of variable u = θy. Hence E[Y ] = Γ(α + 1)/(Γ(α)θ) = α/θ. Using
the same substitution
θα
E[Y ] =
Γ(α)
2
Z
∞
y ω+1 e−θy dy =
0
(α + 1)α
,
θ2
2
so that Var[Y ] = α/θ . The notation Y ∼ Gamma(α, θ) should be read as “the
random variable Y follows a gamma distribution with parameters α and θ”.
Exercise 12. Let Y ∼ Gamma(α, θ). Show that the moment generating function for
Y is given for t ∈ (−θ, θ) by
MY (t) =
2.6.4
1
.
(1 − t/θ)α
Gaussian (Normal) Distribution
A random variable Z is a standard normal (or Gaussian) random variable if the density
of Z is specified by
2
1
f (z) = √ e−z /2 .
2π
(2.13)
It is not immediately obvious that (2.13) specifies a probability density. To show that
this is the case we need to prove
Z
∞
2
1
√ e−z /2 dy = 1
2π
−∞
√
R ∞ −z2 /2
or, equivalently, that I = −∞ e
dz = 2π. This is a “classic” results and so is
well worth confirming. Consider
Z ∞
Z ∞
Z
e − z 2 /2 dz
e−w2 /2 dw =
I2 =
−∞
−∞
∞
−∞
Z
∞
e−(z
2
+w2 )/2
dzdw.
−∞
The double integral can be evaluated by a change of variables to polar coordinates.
Substituting z = r cos θ, w = r sin θ, and dzdw = rdθdr, we get
Z ∞Z π
Z ∞
2
2
−r 2 /2
I =
e
rdθdr = 2π
re−r2 /2 dr = −2πe−r /2 |10 = 2π.
0
0
0
36
√
√
Taking the square root we get I = 2π. The result I = 2π can also be used to
√
establish the result Γ(1/2) = π. To prove that this is the case note that
Z ∞
Z ∞
√
2
−u 1/2
e−z dz = π.
e u
du = 2
Γ(1/2) =
0
0
The expected value of Z equals zero because ze−z
2
/2
is integrable and asymmetric
around zero. The variance of Z is given by
Z ∞
2
1
√
z 2 e−z /2 dz.
Var[Z] =
2π −∞
Thus
Z ∞
2
1
√
Var[Z] =
z 2 e−z /2 dz
2π −∞
Z ∞
1
−z 2 /2 ∞
2
−ze
|−∞ + +
e − z /2 dz
=√
2π
−∞
Z ∞
2
1
=√
e−z /2 dz
2π −∞
=1.
If Z is a standard normal distribution then Y = µ + σZ is called general normal
(Gaussian distribution) with parameters µ and σ. The density of Y is given by
f (y|µ, σ) = √
1
2πσ 2
e−
(y−µ)2
2σ 2
.
We have obviously E[Y ] = µ and Var[Y ] = σ 2 . The notation Y ∼ N (µ, σ 2 ) should
be read as “the random variable Y follows a normal distribution with mean parameter
µ and variance parameter σ 2 ”. From the definition of Y it follows immediately that
a + bY , where a and b are known constants, is again normal distribution.
Exercise 13. Let Y ∼ N (µ, σ 2 ). What is the distribution of X = a + bY ?
Exercise 14. Let Y ∼ N (µ, σ 2 ). Show that the moment generating function if Y is
given by
MY (t) = eµt+σ
2 2
t /2
.
Hint Consider first the standard normal variable and then apply Property 1.
37
2.6.5
Weibull Distribution
The Weibull distribution function has the form
h y a i
, y > 0.
F (y) = 1 − exp −
b
The Weibull density can be obtained by differentiation as
a y a−1
h y a i
f (y|a, b) =
exp −
.
b
b
b
To calculate the expected value
Z ∞ a
h y a i
1
y a−1 exp −
dy
E[Y ] =
ya
b
b
0
we use the substitutions u = (y/b)a , and du = ab−a y a−1 dy. These yield
Z ∞
a+1
1/a −u
E[Y ] = b
u e du = bΓ
.
a
0
In a similar manner, it is straightforward to verify that
a+2
2
2
,
E[Y ] = b Γ
a
and thus
a+2
a+1
Var[Y ] = b2 Γ
− Γ2
.
a
a
2.6.6
Beta Distribution
A random variable is said to have a beta distribution if its density is given by
f (y|a, b) =
1
y a−1 (1 − y)b−1 , 0 < y < 1.
B(a, b)
Here the function
1
Z
ua−1 (1 − u)b−1 du
B(a, b) =
0
is the “beta” function, and is related to the gamma function through
B(a, b) =
Γ(a)Γ(b)
.
Γ(a + b)
Proceeding in the usual manner, we can show that
E[Y ] =
Var[Y ] =
a
a+b
ab
.
(a + b)2 (a + b + 1)
38
2.6.7
Chi-square Distribution
Let Z ∼ N (0, 1), and let Y = Z 2 . Then the cumulative distribution function
√
√
√
√
FY (y) = P (Y ≤ y) = P (Z 2 ≤ y) = P (− y ≤ Z ≤ y) = FZ ( y) − FZ (− y)
so that by differentiating in y we arrive to the density
1
1
√
√
fY (y) = √ [fz ( y) + fz (− y)] = √
e−y/2 ,
2 y
2πy
Pn
in which we recognize Gamma(1/2, 1/2). Suppose that Y = i=1 Zi2 , where the
Zi ∼ N (0, 1) for i = 1, . . . , n are independent. From results on the sum of independent Gamma random variables, Y ∼ Gamma(n/2, 1/2). This density has the form
fY (y|n) =
e−y/2 y n/2−1
, y>0
2n/2 Γ(n/2)
(2.14)
and is referred to as a chi-squared distribution on n degrees of freedom. The notation
Y ∼ Chi(n) should be read as “the random variable Y follows a chi-squared distribution with n degrees of freedom”. Later we will show that if X ∼ Chi(u) and
Y ∼ Chi(v), it follows that X + Y ∼ Chi(u + v).
2.6.8
The Bivariate Normal Distribution
Suppose that U and V are independent random variables each, with the standard normal
distribution. We will need the following parameters µX , µY , σX > 0, σY > 0,
ρ ∈ [−1, 1]. Now let X and Y be new random variables defined by
X =µX + σX U,
V =µY + ρσY U + σY
p
1 − ρ2 V.
Using basic properties of mean, variance, covariance, and the normal distribution, satisfy yourself of the following.
Property 3. The following properties hold
1. X is normally distributed with mean µX and standard deviation σX ,
2. Y is normally distributed with mean µY and standard deviation σY ,
39
3. Corr(X, Y ) = ρ,
4. X and Y are independent if and only if ρ = 0.
The inverse transformation is
x − µX
σX
y − µY
ρ(x − µX )
p
v= p
−
σY 1 − ρ2
σX 1 − ρ2
u=
so that the Jacobian of the transformation is
∂(x, y)
1
p
=
.
∂(u, v)
σX σY 1 − ρ2
Since U and V are independent standard normal variables, their joint probability density function is
g(u, v) =
1 − u2 +v2
2
e
.
2π
Using the bivariate change of variables formula, the joint density of (X, Y ) is
ρ(x − µX )(y − µY ) (y − µY )2
1
(x − µX )2
p
+
f (x, y) =
exp − 2
2σX (1 − ρ2 )
σX σY (1 − ρ2 ) 2σY2 (1 − ρ2 )
2πσX σY 1 − ρ2
Bivariate Normal Conditional Distributions
In the last section we derived the joint probability density function f of the bivariate
normal random variables X and Y . The marginal densities are known. Then,
(y − (µY + ρσY (x − µX )/σX ))2
fY,X (y, x)
1
exp
−
fY |X (y|x) =
=p
.
fX (x)
2σY2 (1 − ρ2 )
2πσY2 (1 − ρ2 )
Then the conditional distribution of Y given X = x is also Gaussian, with
E(Y |X = x) =µY + ρσY
Var(Y |X = x) = σY2 (1 − ρ2 )
2.6.9
The Multivariate Normal Distribution
Let Σ denote the 2 × 2 symmetric matrix


2
σX
σX σY ρ


σY σX ρ
σY2
40
x − µX
σX
Then
2 2
2 2
det|Σ| = σX
σY − (σX σY ρ)2 = σX
σY (1 − ρ2 )
and
Σ−1


2
1/σX
−ρ/(σX σY )
1 
.
=
1 − ρ2
−ρ/(σX σY )
1/σY2
Hence the bivariate normal distribution (X, Y ) can be written in matrix notation as





T
x − µX 
 1 x − µX 
 .
p
Σ−1 
f(X,Y ) (x, y) =
exp − 
2
2π det|Σ|
y − µY
y − µY
1
Let Y = (Y 1, . . . , Y p) be a random vector. Let E(Yi ) = µi , i = 1, . . . , p, and define
the p-length vector µ = (µ1 , . . . , µp ). Define the p × p matrix Σ through its elements
Cov(Yi , Yj ) for i, j = 1, . . . p. Then, the random vector Y has a p-dimensional multivariate Gaussian distribution if its density function is specified by
1
1
T −1
fY (y) =
exp
−
(y
−
µ)
Σ
(y
−
µ)
.
2
(2π)p/2 |Σ|1/2
(2.15)
The notation Y ∼ M V Np (µ, Σ) should be read as “the random variable Y follows a
multivariate Gaussian (normal) distribution with p-vector mean µ and p × p variancecovariance matrix Σ.”
41
2.7
2.7.1
Distributions – further properties
Sum of Independent Random Variables – special cases
Poisson variables
Suppose X ∼ P ois(θ) and Y ∼ P ois(λ). Assume that X and Y are independent.
Then
P (X + Y = n) =
=
=
n
X
k=0
n
X
k=0
n
X
P (X = k, Y = n − k)
P (X = k)P (Y = n − k)
e−θ
k=0
=
θk −λ λn−k
e
k!
(n − k)!
n
e−(θ+λ) X
n!
θk λn−k
n!
k!(n − k)!
k=0
=e − (θ + λ)
(θ + λ)n
.
n!
That is, X + Y ∼ P ois(θ + λ).
Binomial Random Variables
We seek the distribution of Y + X, where Y ∼ Bin(n, θ) and X ∼ Bin(m, θ).
Since X + Y is modelling the situation where the total number of trials is fixed at
n + m and the probability of a success in a single trial equals θ. Without performing
a calculations, we expect to find that X + Y ∼ Bin(n + m, θ). To verify that note
that X = X1 + · · · + Xn where Xi are independent Bernoulli variables with parameter
θ while Y = Y1 + · · · + Ym where Yi are also independent Bernoulli variables with
parameter θ. Assuming that Xi ’s are independent of Yi ’s we obtain that X + Y is the
sum of n + m indpendent Bernoulli random variables with parameter θ, i.e. X + Y
has Bin(n + m, θ) distribution.
42
Gamma, Chi-square, and Exponential Random Variables
Let X ∼ Gamma(α, θ) and Y ∼ Gamma(βθ) are independent. Then the moment
generating function of X + Y is given as
MX+Y (t) = MX (t)MY (t) =
1
1
1
=
α
β
(1 + t/θ) (1 + t/θ)
(1 + t/θ)α+β
But this is the moment generating function of a Gamma random variable distributed
as Gamma(α + β, θ). The result X + Y ∼ Chi(u + v) where X ∼ Chi(u) and
Y ∼ Chi(v), follows as a corollary.
Let Y1 , . . . , Yn be n independent exponential random variables each with parameter
θ. Then Z = Y1 + Y2 + · · · + Yn is a Gamma(n, θ) random variable. To see that
this is indeed the case, write Yi ∼ Exp(θ), or alternatively, Yi ∼ Gamma(1, θ). Then
Pn
Y1 + Y2 ∼ Gamma(2, θ), and by induction i=1 Yi ∼ Gamma(n, θ).
Gaussian Random Variables
2
Let X ∼ N (µX , σX
) and Y ∼ N (µY , σY2 ). Then the moment generating function of
X + Y is given by
2
MX+Y (t) = MX (t)MY (t) = eµX t+σX t
2
2 2
/2 µY t+σY
t /2
e
2
+ σY2 ).
which proves that X + Y ∼ N (µX + µY , σX
43
2
2
= e(µX +µY )t+(σX +σY )t
2
/2
44
2.7.2
Common Distributions – Summarizing Tables
Discrete Distributions
Bernoulli(θ)
pmf
P (Y = y|θ) = θy (1 − θ)1−y , y = 0, 1, 0 ≤ θ ≤ 1
mean/variance
E[Y ] = θ, Var[Y ] = θ(1 − θ)
mgf
MY (t) = θet + (1 − θ)
Binomial(n, θ)
n
y
y
θ (1 − θ)n−y , y = 0, 1, . . . , n, 0 ≤ θ ≤ 1
pmf
P (Y = y|θ) =
mean/variance
E[Y ] = nθ, Var[Y ] = nθ(1 − θ)
mgf
MY (t) = [θet + (1 − θ)]n
Discrete uniform(N )
pmf
P (Y = y|N ) = 1/N, y = 1, 2, . . . , N
mean/variance
E[Y ] = (N + 1)/2, Var[Y ] = (N + 1)(N − 1)/12
mgf
MY (t) =
1
N
Nt
et 1−e
1−et
Geometric(θ)
pmf
P (Y = y|N ) = θ(1 − θ)y−1 , y = 1, . . . , 0 ≤ θ ≤ 1
mean/variance
E[Y ] = 1/θ, Var[Y ] = (1 − θ)/θ2
mgf
MY (t) = θet /[1 − (1 − θ)et ], t < − log(1 − θ)
notes
The random variable X = Y − 1 is NegBin(1, θ).
Hypergeometric(b, w, n)
pmf
P (Y = y|b, w, n) =
w
y
b−w
n−y
b
/ n , y = 0, 1, . . . , n,
b − (b − w) ≤ y ≤ b, b, w, n ≥ 0
mean/variance
E[Y ] = nw/b, Var[Y ] = nw(b − w)(b − n)/(b2 (b − 1))
Negative binomial(r, θ)
pmf
P (Y = y|r, θ) =
r+y−1
y
r
θ (1 − θ)y , y = 0, 1, . . . , n,
b − (b − w) ≤ y ∈ N, 0 < θ ≤ 1
mean/variance
E[Y ] = r(1 − θ)/θ, Var[Y ] = r(1 − θ)/θ2
mgf
MY (t) = θ/(1 − (1 − θ)et )r , t < − log(1 − θ)
An alternative form of the pmf, used
in our notes, is
rin the derivation
given by P (N = n|r, θ) = n−1
θ (1 − θ)n−r , n = r, r + 1, . . .
r−1
where the random variable N = Y + r. The negative binomial can also
be derived as a mixture of Poisson random variables.
notes
Poisson(θ)
pmf
P (Y = y|θ) = θy e−θ /y!, y = 0, 1, 2, . . . , 0 < θ
mean/variance
E[Y ] = θ, Var[Y ] = θ,
mgf
MY (t) = eθ(e
t
−1)
45
Continuous Distributions
Uniform U(a, b)
pmf
f (y|a, b) = 1/(b − a), a < y < b
mean/variance
E[Y ] = (b + a)/2, Var[Y ] = (b − a)2 /12,
mgf
MY (t) = (ebt − eat )/((b − a)t)
A uniform distribution with a = 0 and b = 1 is a special case of the
beta distribution where (α = β = 1).
notes
Exponential E(θ)
pmf
f (y|θ) = θe−θy , y > 0, θ > 0
mean/variance
E[Y ] = 1/θ, Var[Y ] = 1/θ2 ,
mgf
MY (t) = 1/(1 − t/θ)
Special
case of the gamma distribution. X = Y 1/γ is Weibull, X =
√
2θY is Rayleigh, X = α − γ log(Y /β) is Gumbel.
notes
Gamma G(λ, θ)
pmf
f (y|λθ) = θλ e−θy y λ−1 /Γ(λ), y > 0, λ, θ > 0
mean/variance
E[Y ] = λ/θ, Var[Y ] = λ/θ2 ,
mgf
MY (t) = 1/(1 − t/θ)λ
notes
Includes the exponential (λ = 1) and chi squared (λ = n/2, θ = 1/2).
Normal N( µ, σ 2 )
2
2
√ 1
e−(y−µ) /(2σ ) , σ
2πσ 2
2
pmf
f (y|µ, σ 2 ) =
mean/variance
E[Y ] = µ, Var[Y ] = σ ,
mgf
MY (t) = eµt+σ
notes
Often called the Gaussian distribution.
>0
2 2
t /2
Transforms
The generating functions of the discrete and continuous random variables discussed
thus far are given in Table 2.7.2.
46
Distrib.
p.g.f.
m.g.f.
(θeit + θ̄)n
θt/(1 − θ̄t)
θ/(e−t − θ̄)
θ/(e−it − θ̄)
N egBin(r, θ)
θr (1 − θ̄t)−r
θr (1 − θ̄et )−r
θr (1 − θ̄eit )−r
P oi(θ)
e−θ(1−t)
(θt + θ̄)
Geo(θ)
t
ch.f.
(θe + θ̄)
Bi(n, θ)
n
eθ(e
n
t
−1)
eθ(e
it
−1)
U nif (α, β)
eαt (eβt − 1)/(βt)
eiαt (eiβt − 1)/(iβt)
Exp(θ)
(1 − t/θ)−1
(1 − it/θ)−1
Ga(c, λ)
(1 − t/θ)−c
N (µ, σ 2 )
exp −µt + σ 2 t2 /2
(1 − it/θ)−c
exp −iµt − σ 2 t2 /2
Table 2.1: Transforms of distributions. In the formulas θ̄ = 1 − θ.
47
Chapter 3
Likelihood
3.1
Maximum Likelihood Estimation
Let x be a realization of the random variable X with probability density fX (x|θ) where
θ = (θ1 , θ2 , . . . , θm )T is a vector of m unknown parameters to be estimated. The set
of allowable values for θ, denoted by Ω, or sometimes by Ωθ , is called the parameter
space. Define the likelihood function
l(θ|x) = fX (x|θ).
(3.1)
It is crucial to stress that the argument of fX (x|θ) is x, but the argument of l(θ|x) is θ.
It is therefore convenient to view the likelihood function l(θ) as the probability of the
observed data x considered as a function of θ. Usually it is convenient to work with the
natural logarithm of the likelihood called the log-likelihood, denoted by
log l(θ|x) = log l(θ|x).
When θ ∈ R1 we can define the score function as the first derivative of the loglikelihood
S(θ) =
∂
log l(θ).
∂θ
The maximum likelihood estimate (MLE) θ̂ of θ is the solution to the score equation
S(θ) = 0.
48
At the maximum, the second partial derivative of the log-likelihood is negative, so we
define the curvature at θ̂ as I(θ̂) where
I(θ) = −
∂2
log l(θ).
∂θ2
We can check that a solution θ̂ of the equation S(θ) = 0 is actually a maximum by
checking that I(θ̂) > 0. A large curvature I(θ̂) is associated with a tight or strong
peak, intuitively indicating less uncertainty about θ.
The likelihood function l(θ|x) supplies an order of preference or plausibility among
possible values of θ based on the observed y. It ranks the plausibility of possible values
of θ by how probable they make the observed y. If P (x|θ = θ1 ) > P (x|θ = θ2 ) then
the observed x makes θ = θ1 more plausible than θ = θ2 , and consequently from
(3.1), l(θ1 |x) > l(θ2 |x). The likelihood ratio l(θ1 |x)/l(θ2 |x) = f (θ1 |x)/f (θ2 |x) is
a measure of the plausibility of θ1 relative to θ2 based on the observed fact y. The
relative likelihood l(θ1 |x)/l(θ2 |x) = k means that the observed value x will occur k
times more frequently in repeated samples from the population defined by the value θ1
than from the population defined by θ2 . Since only ratios of likelihoods are meaningful,
it is convenient to standardize the likelihood with respect to its maximum.
When the random variables X1 , . . . , Xn are mutually independent we can write the
joint density as
fX (x) =
n
Y
fXj (xj )
j=1
where x = (x1 , . . . , xn )0 is a realization of the random vector X = (X1 , . . . , Xn )0 ,
and the likelihood function becomes
LX (θ|x) =
n
Y
fXj (xj |θ).
j=1
When the densities fXj (xj ) are identical, we unambiguously write f (xj ).
Example 2 (Bernoulli Trials). Consider n independent Bernoulli trials. The jth observation is either a “success” or “failure” coded xj = 1 and xj = 0 respectively,
and
P (Xj = xj ) = θxj (1 − θ)1−xj
49
for j = 1, . . . , n. The vector of observations y = (x1 , x2 , . . . , xn )T is a sequence of
ones and zeros, and is a realization of the random vector Y = (X1 , X2 , . . . , Xn )T . As
the Bernoulli outcomes are assumed to be independent we can write the joint probability mass function of Y as the product of the marginal probabilities, that is
l(θ)
=
=
n
Y
j=1
n
Y
j=1
P
= θ
P (Xj = xj )
θxj (1 − θ)1−xj
xj
P
(1 − θ)n−
xj
= θr (1 − θ)n−r
where r =
Pn
i=1
xj is the number of observed successes (1’s) in the vector y. The
log-likelihood function is then
log l(θ) = r log θ + (n − r) log(1 − θ),
and the score function is
S(θ) =
r (n − r)
∂
log l(θ) = −
.
∂θ
θ
1−θ
Solving for S(θ̂) = 0 we get θ̂ = r/n. We also have
I(θ) =
r
n−r
+
>0
θ2
(1 − θ)2
∀ θ,
guaranteeing that θ̂ is the MLE. Each Xi is a Bernoulli random variable and has expected value E(Xi ) = θ, and variance Var(Xi ) = θ(1 − θ). The MLE θ̂(y) is itself a
random variable and has expected value
Pn
n
n
r
1X
1X
i=1 Xi
E(θ̂) = E
=E
=
E (Xi ) =
θ = θ.
n
n
n i=1
n i=1
If an estimator has on average the value of the parameter that it is intended to estimate
than we call it unbiased, i.e. if Eθb = θ. From the above calculation it follows that θ̂(y)
is an unbiased estimator of θ. The variance of θ̂(y) is
Pn
n
n
1 X
1 X
(1 − θ)θ
i=1 Xi
Var(θ̂) = Var
= 2
Var (Xi ) = 2
(1 − θ)θ =
.
n
n i=1
n i=1
n
2
50
Example 3 (Binomial sampling). The number of successes in n Bernoulli trials is a
random variable R taking on values r = 0, 1, . . . , n with probability mass function
n r
P (R = r) =
θ (1 − θ)n−r .
r
This is the exact same sampling scheme as in the previous example except that instead
of observing the sequence y we only observe the total number of successes r. Hence
the likelihood function has the form
n r
LR (θ|r) =
θ (1 − θ)n−r .
r
The relevant mathematical calculations are as follows
n
log lR (θ|r) = log
+ r log(θ) + (n − r) log(1 − θ)
r
r
n−r
r
S (θ) =
+
⇒ θ̂ =
n
1−θ
n
n−r
r
I (θ) =
+
>0
∀θ
θ2
(1 − θ)2
E(r)
nθ
E(θ̂) =
=
=θ
⇒ θ̂ unbiased
n
n
Var(r)
nθ(1 − θ)
θ(1 − θ)
Var(θ̂) =
=
=
.
2
2
n
n
n
2
Example 4 (Prevalence of a Genotype). Geneticists interested in the prevalence of a
certain genotype, observe that the genotype makes its first appearance in the 22nd subject analysed. If we assume that the subjects are independent, the likelihood function
can be computed based on the geometric distribution, as l(θ) = (1 − θ)n−1 θ. The
score function is then S(θ) = θ−1 − (n − 1)(1 − θ)−1 . Setting S(θ̂) = 0 we get
θ̂ = n−1 = 22−1 . Moreover I(θ) = θ−2 + (n − 1)(1 − θ)−2 and is greater than zero
for all θ, implying that θ̂ is MLE.
Suppose that the geneticists had planned to stop sampling once they observed r =
10 subjects with the specified genotype, and the tenth subject with the genotype was
the 100th subject anaylsed overall. The likelihood of θ can be computed based on the
negative binomial distribution, as
l(θ) =
n−1 r
θ (1 − θ)n−r
r−1
51
2
for n = 100, r = 5. The usual calculation will confirm that θ̂ = r/n is MLE.
Example 5 (Radioactive Decay). In this classic set of data Rutherford and Geiger
counted the number of scintillations in 72 second intervals caused by radioactive decay of a quantity of the element polonium. Altogether there were 10097 scintillations
during 2608 such intervals
Count
0
1
2
3
4
5
6
7
Observed
57
203
383
525
532
408
573
139
Count
8
9
10
11
12
13
14
Observed
45
27
10
4
1
0
1
The Poisson probability mass function with mean parameter θ is
θx exp (−θ)
.
x!
fX (x|θ) =
The likelihood function equals
l(θ) =
Y θxi exp (−θ)
xi !
=
θ
P
xi
exp (−nθ)
Q
.
xi !
The relevant mathematical calculations are
(Σxi ) log (θ) − nθ − log [Π(xi !)]
P
xi
−n
S(θ) =
θ
P
xi
⇒ θ̂ =
= x̄
n
Σxi
> 0,
∀ θ
I(θ) =
θ2
P
P
implying θ̂ is MLE. Also E(θ̂) = E(xi ) = n1
θ = θ, so θ̂ is an unbiased estimator.
P
Next Var(θ̂) = n12
Var(xi ) = n1 θ. It is always useful to compare the fitted values
log l(θ)
=
from a model against the observed values.
i
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Oi
57
203
383
525
532
408
573
139
45
27
10
4
1
0
1
Ei
54
211
407
525
508
393
254
140
68
29
11
4
1
0
0
+3
-8
-24
0
+24
+15
+19
-1
-23
-2
-1
0
-1
+1
+1
The Poisson law agrees with the observed variation within about one-twentieth of its
2
range.
52
Example 6 (Exponential distribution). Suppose random variables X1 , . . . , Xn are i.i.d.
as Exp(θ). Then
l(θ)
=
n
Y
θ exp (−θxi )
i=1
X = θn exp −θ
xi
X
log l(θ) = n log θ − θ
xi
n
S(θ)
=
θ̂
=
I(θ)
=
⇒
n X
−
xi
θ i=1
n
P
xi
n
>0
∀ θ.
θ2
Exercise 15. Demonstrate that the expectation and variance of θ̂ are given as follows
n
θ
n−1
n2
θ2 .
Var[θ̂] =
(n − 1)2 (n − 2)
E[θ̂] =
Hint Find the probability distribution of Z =
n
P
Xi , where Xi ∼ Exp(θ).
i=1
n−1
n θ̂.
Exercise 16. Propose the alternative estimator θ̃ =
Show that θ̃ is unbiased
estimator of θ with the variance
Var[θ̃] =
θ2
.
n−2
As this example demonstrates, maximum likelihood estimation does not automatically produce unbiased estimates. If it is thought that this property is (in some sense)
desirable, then some adjustments to the MLEs, usually in the form of scaling, may be
required.
Example 7 (Gaussian Distribution). Consider data X1 , X2 . . . , Xn distributed as N(µ, υ).
Then the likelihood function is
l(µ, υ) =
1
√
πυ
n
exp







53
n
P


(xi − µ)2 

− i=1
2υ



and the log-likelihood function is
log l(µ, υ) = −
n
n
n
1 X
log (2π) − log (υ) −
(xi − µ)2
2
2
2υ i=1
(3.2)
Unknown mean and known variance As υ is known we treat this parameter as a constant when differentiating wrt µ. Then
S(µ) =
n
1X
(xi − µ),
υ i=1
n
µ̂ =
1X
xi ,
n i=1
and
I(θ) =
n
> 0 ∀ µ.
υ
Also, E[µ̂] = nµ/n = µ, and so the MLE of µ is unbiased. Finally
" n
#
X
1
υ
−1
Var[µ̂] = 2 Var
xi = = (E[I(θ)]) .
n
n
i=1
Known mean and unknown variance Differentiating (3.2) wrt υ returns
S(υ) = −
n
n
1 X
+ 2
(xi − µ)2 ,
2υ 2υ i=1
and setting S(υ) = 0 implies
n
υ̂ =
1X
(xi − µ)2 .
n i=1
Differentiating again, and multiplying by −1 yields
I(υ) = −
n
1 X
n
(xi − µ)2 .
+
2υ 2
υ 3 i=1
Clearly υ̂ is the MLE since
I(υ̂) =
n
> 0.
2υ 2
Define
√
Zi = (Xi − µ)2 / υ,
so that Zi ∼ N(0, 1). From the appendix on probability
n
X
Zi2 ∼ χ2n ,
i=1
implying E[
P
P
Zi2 ] = n, and Var[ Zi2 ] = 2n. The MLE
υ̂ = (υ/n)
n
X
i=1
54
Zi2 .
Then
"
#
n
υX 2
E[υ̂] = E
Z = υ,
n i=1 i
and
Var[υ̂] =
υ 2
n
Var
" n
X
#
Zi2
=
i=1
2υ 2
.
n
2
Our treatment of the two parameters of the Gaussian distribution in the last example
was to (i) fix the variance and estimate the mean using maximum likelihood; and then
(ii) fix the mean and estimate the variance using maximum likelihood. In practice we
would like to consider the simultaneous estimation of these parameters. In the next
section of these notes we extend MLE to multiple parameter estimation.
3.2
Multi-parameter Estimation
Suppose that a statistical model specifies that the data y has a probability distribution
f (y; α, β) depending on two unknown parameters α and β. In this case the likelihood
function is a function of the two variables α and β and having observed the value y is
defined as l(α, β) = f (y; α, β) with log l(α, β) = log l(α, β). The MLE of (α, β) is a
value (α̂, β̂) for which l(α, β) , or equivalently log l(α, β) , attains its maximum value.
Define S1 (α, β) = ∂ log l/∂α and S2 (α, β) = ∂ log l/∂β. The MLEs (α̂, β̂) can
be obtained by solving the pair of simultaneous equations
S1 (α, β)
=
0
S2 (α, β)
=
0
Let us consider the matrix I(α, β)



I11 (α, β) I12 (α, β)
 = −
I(α, β) = 
I21 (α, β) I22 (α, β)
∂2
∂α2
∂2
∂β∂α
log l
log l
∂2
∂α∂β
∂2
∂β 2
log l
log l


The conditions for a value (α0 , β0 ) satisfying S1 (α0 , β0 ) = 0 and S2 (α0 , β0 ) = 0
to be a MLE are that
I11 (α0 , β0 ) > 0, I22 (α0 , β0 ) > 0,
55
and
det(I(α0 , β0 ) = I11 (α0 , β0 )I22 (α0 , β0 ) − I12 (α0 , β0 )2 > 0.
This is equivalent to requiring that both eigenvalues of the matrix I(α0 , β0 ) be positive.
Example 8 (Gaussian distribution). Let X1 , X2 . . . , Xn be iid observations from a
N (µ, σ 2 ) density in which both µ and σ 2 are unknown. The log likelihood is
log l(µ, σ 2 )
n
X
1
1
exp [− 2 (xi − µ)2 ]
log √
2σ
2πσ 2
i=1
n
X
1
1
1
=
− log [2π] − log [σ 2 ] − 2 (xi − µ)2
2
2
2σ
i=1
=
= −
n
n
n
1 X
log [2π] − log [σ 2 ] − 2
(xi − µ)2 .
2
2
2σ i=1
Hence for v = σ 2
n
1X
∂ log l
=
(xi − µ) = 0
S1 (µ, v) =
∂µ
v i=1
which implies that
n
µ̂ =
Also
S2 (µ, v) =
1X
xi = x̄.
n i=1
(3.3)
n
∂ log l
n
1 X
=− + 2
(xi − µ)2 = 0
∂v
2v 2v i=1
implies that
n
σ̂ 2 = v̂ =
n
1X
1X
(xi − µ̂)2 =
(xi − x̄)2 .
n i=1
n i=1
Calculating second derivatives and multiplying by −1 gives that I(µ, v) equals


n
P
n
1
(x
−
µ)
i
v
v2


i=1

I(µ, v) = 
n
n
P
P
 1
n
1
2 
(x
−
µ)
−
+
(x
−
µ)
i
i
v2
2v 2
v3
i=1
i=1
Hence I(µ̂, v̂) is given by


n
v̂
0
0
n
2v 2
56


(3.4)
Clearly both diagonal terms are positive and the determinant is positive and so (µ̂, v̂)
are, indeed, the MLEs of (µ, v).
Go back to equation (3.3), and X̄ ∼ N (µ, v/n). Clearly E(X̄) = µ (unbiased) and
Var(X̄) = v/n. Go back to equation (3.4). Then from Lemma 1 that is proven below
we have
nv̂
∼ χ2n−1
v
so that
E
⇒
nv̂
v
n−1
n−1
v
n
=
=
E(v̂)
Instead, propose the (unbiased) estimator of σ 2
n
S 2 = ṽ =
n
1 X
v̂ =
(xi − x̄)2
n−1
n − 1 i=1
(3.5)
Observe that
E(ṽ) =
n
n−1
E(v̂) =
n
n−1
n−1
n
v = v
and ṽ is unbiased as suggested. We can easily show that
Var(ṽ) =
2v 2
(n − 1)
2
Lemma 1 (Joint distribution of the sample mean and sample variance). If X1 , . . . , Xn
are iid N (µ, v) then the sample mean X̄ and sample variance S 2 are independent.
Also X̄ is distributed N (µ, v/n) and (n − 1)S 2 /v is a chi-squared random variable
with n − 1 degrees of freedom.
Proof. Define
W =
n
X
(Xi − X̄)2
=
i=1
⇒
W
(X̄ − µ)
+
v
v/n
n
X
(Xi − µ)2 − n(X̄ − µ)2
i=1
2
=
n
X
(Xi − µ)2
i=1
57
v
The RHS is the sum of n independent standard normal random variables squared, and
so is distributed χ2n . Also, X̄ ∼ N (µ, v/n), therefore (X̄ − µ)2 /(v/n) is the square
of a standard normal and so is distributed χ21 These Chi-Squared random variables have
moment generating functions (1 − 2t)−n/2 and (1 − 2t)−1/2 respectively. Next, W/v
and (X̄ − µ)2 /(v/n) are independent
Cov(Xi − X̄, X̄)
=
=
=
=
Cov(Xi , X̄) − Cov(X̄, X̄)
1X
Cov Xi ,
Xj − Var(X̄)
n
v
1X
Cov(Xi , Xj ) −
n j
n
v
v
−
= 0
n n
But, Cov(Xi − X̄, X̄ − µ) = Cov(Xi − X̄, X̄) = 0 , hence
X
!
X
Cov(Xi − X̄, X̄ − µ) = Cov
(Xi − X̄), X̄ − µ = 0
i
i
As the moment generating function of the sum of independent random variables is
equal to the product of their individual moment generating functions, we see
h
i
E et(W/v) (1 − 2t)−1/2
h
i
⇒
E et(W/v)
=
(1 − 2t)−n/2
=
(1 − 2t)−(n−1)/2
But (1 − 2t)−(n−1)/2 is the moment generating function of a χ2 random variables with
(n−1) degrees of freedom, and the moment generating function uniquely characterizes
the random variable S = (W/v).
Suppose that a statistical model specifies that the data x has a probability distribution f (x; θ) depending on a vector of m unknown parameters θ = (θ1 , . . . , θm ).
In this case the likelihood function is a function of the m parameters θ1 , . . . , θm and
having observed the value of x is defined as l(θ) = f (x; θ) with log l(θ) = log l(θ).
The MLE of θ is a value θ̂ for which l(θ), or equivalently log l(θ), attains its
maximum value. For r = 1, . . . , m define Sr (θ) = ∂ log l/∂θr . Then we can (usually)
find the MLE θ̂ by solving the set of m simultaneous equations Sr (θ) = 0 for r =
58
1, . . . , m. The matrix I(θ) is defined to be the m × m matrix whose (r, s) element is
given by Irs where Irs = −∂ 2 log l/∂θr ∂θs . The conditions for a value θ̂ satisfying
Sr (θ̂) = 0 for r = 1, . . . , m to be a MLE are that all the eigenvalues of the matrix I(θ̂)
are positive.
3.3
The Invariance Principle
How do we deal with parameter transformation? We will assume a one-to-one transformation, but the idea applied generally. Consider a binomial sample with n = 10
independent trials resulting in data x = 8 successes. The likelihood ratio of θ1 = 0.8
versus θ2 = 0.3 is
θ8 (1 − θ1 )2
l(θ1 = 0.8)
= 18
= 208.7 ,
l(θ2 = 0.3)
θ2 (1 − θ2 )2
that is, given the data θ = 0.8 is about 200 times more likely than θ = 0.3.
Suppose we are interested in expressing θ on the logit scale as
ψ ≡ log{θ/(1 − θ)} ,
then ‘intuitively’ our relative information about ψ1 = log(0.8/0.2) = 1.29 versus
ψ2 = log(0.3/0.7) = −0.85 should be
L∗ (ψ1 )
l(θ1 )
=
= 208.7 .
L∗ (ψ2 )
l(θ2 )
That is, our information should be invariant to the choice of parameterization. ( For
the purposes of this example we are not too concerned about how to calculate L∗ (ψ). )
Theorem 3.3.1 (Invariance of the MLE). If g is a one-to-one function, and θ̂ is the
MLE of θ then g(θ̂) is the MLE of g(θ).
Proof. This is trivially true as we let θ = g −1 (µ) then f {y|g −1 (µ)} is maximized in µ
exactly when µ = g(θ̂). When g is not one-to-one the discussion becomes more subtle,
but we simply choose to define ĝMLE (θ) = g(θ̂)
It seems intuitive that if θ̂ is most likely for θ and our knowledge (data) remains
unchanged then g(θ̂) is most likely for g(θ). In fact, we would find it strange if θ̂ is an
59
estimate of θ, but θ̂2 is not an estimate of θ2 . In the binomial example with n = 10 and
x = 8 we get θ̂ = 0.8, so the MLE of g(θ) = θ/(1 − θ) is
g(θ̂) = θ̂/(1 − θ̂) = 0.8/0.2 = 4.
60
Chapter 4
Estimation
In the previous chapter we have seen an approach to estimation that is based on the
likelihood of observed results. Next we study general theory of estimation that is used
to compare between different estimators and to decide on the most efficient one.
4.1
General properties of estimators
Suppose that we are going to observe a value of a random vector X. Let X denote the
set of possible values X can take and, for x ∈ X , let f (x|θ) denote the probability that
X takes the value x where the parameter θ is some unknown element of the set Θ.
The problem we face is that of estimating θ. An estimator θ̂ is a procedure which
for each possible value x ∈ X specifies which element of Θ we should quote as an
estimate of θ. When we observe X = x we quote θ̂(x) as our estimate of θ. Thus θ̂ is
a function of the random vector X. Sometimes we write θ̂(X) to emphasise this point.
Given any estimator θ̂ we can calculate its expected value for each possible value
of θ ∈ Θ. As we have already mentioned when discussing the maximum likelihood
estimation, an estimator is said to be unbiased if this expected value is identically equal
to θ. If an estimator is unbiased then we can conclude that if we repeat the experiment
an infinite number of times with θ fixed and calculate the value of the estimator each
time then the average of the estimator values will be exactly equal to θ. To evaluate
61
the usefulness of an estimator θ̂ = θ̂(x) of θ, examine the properties of the random
variable θ̂ = θ̂(X).
Definition 1 (Unbiased estimators). An estimator θ̂ = θ̂(X) is said to be unbiased for
a parameter θ if it equals θ in expectation
E[θ̂(X)] = E(θ̂) = θ.
Intuitively, an unbiased estimator is ‘right on target’.
2
Definition 2 (Bias of an estimator). The bias of an estimator θ̂ = θ̂(X) of θ is defined
2
as bias(θ̂) = E[θ̂(X) − θ].
Note that even if θ̂ is an unbiased estimator of θ, g(θ̂) will generally not be an
unbiased estimator of g(θ) unless g is linear or affine. This limits the importance of the
notion of unbiasedness. It might be at least as important that an estimator is accurate
in the sense that its distribution is highly concentrated around θ.
Exercise 17. Show that for an arbitrary distribution the estimator S 2 as defined in (3.5)
is an unbiased estimator of the variance of this distribution.
Exercise 18. Consider the estimator S 2 of variance σ 2 in the case of the normal distribution. Demonstrate that although S 2 is an unbiased estimator of σ 2 , S is not an
unbiased estimator of σ. Compute its bias.
Definition 3 (Mean squared error). The mean squared error of the estimator θ̂ is defined as MSE(θ̂) = E(θ̂ − θ)2 . Given the same set of data, θ̂1 is “better” than θ̂2 if
M SE(θ̂1 ) ≤ MSE(θ̂2 ) (uniformly better if true ∀ θ).
Lemma 2 (The MSE variance-bias tradeoff). The MSE decomposes as
MSE(θ̂) = Var(θ̂) + bias(θ̂)2 .
62
2
Proof. We have
MSE(θ̂)
= E(θ̂ − θ)2
= E{ [ θ̂ − E(θ̂) ] + [ E(θ̂) − θ ]}2
= E[θ̂ − E(θ̂)]2 + E[E(θ̂) − θ]2
n
o
+2 E [θ̂ − E(θ̂)][E(θ̂) − θ]
|
{z
}
=0
= E[θ̂ − E(θ̂)]2 + E[E(θ̂) − θ]2
= Var(θ̂) + [E(θ̂) − θ]2 .
| {z }
2
bias(θ̂)
NOTE This lemma implies that the mean squared error of an unbiased estimator is
equal to the variance of the estimator.
Exercise 19. Consider X1 , . . . , Xn where Xi ∼ N(θ, σ 2 ) and σ is known. Three
Pn
estimators of θ are θ̂1 = X̄ = n1 i=1 Xi , θ̂2 = X1 , and θ̂3 = (X1 + X̄)/2. Discuss
their properties which one you would recommend and why.
Example 9. Consider X1 , . . . , Xn to be independent random variables with means
E(Xi ) = µ and variances Var(Xi ) = σi2 . Consider pooling the estimators of µ into a
common estimator using the linear combination µ̂ = w1 X1 + w2 X2 + · · · + wn Xn .
We will see that the following is true
(i) The estimator µ̂ is unbiased if and only if
P
wi = 1.
(ii) The estimator µ̂ has minimum variance among this class of estimators when the
weights are inversely proportional to the variances σi2 .
(iii) The variance of µ̂ for optimal weights wi is Var(µ̂) = 1/
P
i
σi−2 .
P
P
Indeed, we have E(µ̂) = E(w1 X1 + · · · + wn Xn ) = i wi E(Xi ) = i wi µ =
P
P
µ i wi so µ̂ is unbiased if and only if i wi = 1. The variance of our estimator is
P
P
Var(µ̂) = i wi2 σi2 , which should be minimized subject to the constraint i wi = 1.
P
P
Differentiating the Lagrangian L = i wi2 σi2 − λ ( i wi − 1) with respect to wi and
63
P
setting equal to zero yields 2wi σi2 = λ ⇒ wi ∝ σi−2 so that wi = σi−2 /( j σj−2 ).
P
P
P
Then, for optimal weights we get Var(µ̂) = i wi2 σi2 = ( i σi−4 σi2 )/( i σi−2 )2 =
P
1/( i σi−2 ).
Assume now that the instead of Xi we observe biased variable X̂i = Xi + β for
some β 6= 0. When σi2 = σ 2 we have that Var(µ̂) = σ 2 /n which tends to zero for
n → ∞ whereas bias(µ̂) = βand MSE(µ̂) = σ 2 /n + β 2 . Thus in the general case
when the bias is present it tends to dominate the variance as n gets larger, which is very
unfortunate.
Exercise 20. Let X1 , . . . , Xn be an independent sample of size n from the uniform
distribution on the interval (0, θ), with density for a single observation being f (x|θ) =
θ−1 for 0 < x < θ and 0 otherwise, and consider θ > 0 unknown.
(i) Find the expected value and variance of the estimator θ̂ = 2X̄.
(ii) Find the expected value of the estimator θ̃ = X(n) , i.e. the largest observation.
(iii) Find an unbiased estimator of the form θ̌ = cX(n) and calculate its variance.
(iv) Compare the mean square error of θ̂ and θ̌.
4.2
Minimum-Variance Unbiased Estimation
Getting a small MSE often involves a tradeoff between variance and bias. For unbiased
estimators, the MSE obviously equals the variance, MSE(θ̂) = Var(θ̂), so no tradeoff
can be made. One approach is to restrict ourselves to the subclass of estimators that are
unbiased and minimum variance.
Definition 4 (Minimum-variance unbiased estimator). If an unbiased estimator of g(θ)
has minimum variance among all unbiased estimators of g(θ) it is called a minimum
2
variance unbiased estimator (MVUE).
We will develop a method of finding the MVUE when it exists. When such an
estimator does not exist we will be able to find a lower bound for the variance of an
unbiased estimator in the class of unbiased estimators, and compare the variance of our
unbiased estimator with this lower bound.
64
Definition 5 (Score function). For the (possibly vector valued) observation X = x to
be informative about θ, the density must vary with θ. If f (x|θ) is smooth and differentiable, then for finding MLE we have used the score function
S(θ) = S(θ|x) =
∂
∂f (x|θ)/∂θ
log f (x|θ) ≡
.
∂θ
f (x|θ)
2
Under suitable regularity conditions (differentiation wrt θ and integration wrt x can
be interchanged), we have for X distributed according to f (x|θ):
Z
Z
∂f (x|θ)/∂θ
E{S(θ|X)} =
f (x|θ)dx =
∂f (x|θ)/∂θdx ,
f (x|θ)
Z
∂
∂
1 = 0.
=
f (x|θ)dx =
∂θ
∂θ
Thus the score function has expectation zero. The score function S(θ|x) is a random
variable if for x we substitute X – a random variable with f (x|θ) distribution. In this
case we often drop explicit dependence on X from the notation by simply writing S(θ).
The negative derivative of the score function measure how concave down is the
likelihood around value θ.
Definition 6 (Fisher information). The Fisher information is defined as the average
value of the negative derivative of the score function
∂
S(θ) .
I(θ) ≡ −E
∂θ
The negative derivative of the score function I(θ), which is a random variable dependent on X, is sometimes referred to as empirical or observed information about θ.
Lemma 3. The variance of S(θ) is equal to the Fisher information about θ
(
2 )
∂
2
I(θ) = E{S(θ) } ≡ E
log f (X|θ)
∂θ
Proof. Using the chain rule
∂2
log f
∂θ2
=
=
=
∂ 1 ∂f
∂θ f ∂θ
2
1 ∂f
1 ∂2f
− 2
+
f
∂θ
f ∂θ2
2
∂ log f
1 ∂2f
−
+
∂θ
f ∂θ2
65
2
If integration and differentiation can be interchanged
Z
Z
1 ∂2f
∂2
∂2f
∂2
E
=
dx
=
f dx = 2 1 = 0,
2
2
2
f ∂θ
∂θ X
∂θ
X ∂θ
thus
"
2 #
∂
∂2
log f (X|θ) = E
= I(θ).
−E
log f (X|θ)
∂θ2
∂θ
(4.1)
Theorem 4.2.1 (Cramér Rao lower bound). Let θ̂ be an unbiased estimator of θ. Then
Var(θ̂) ≥ { I(θ) }−1 .
Proof. Unbiasedness, E(θ̂) = θ, implies
Z
θ̂(x)f (x|θ)dx = θ.
Assume we can differentiate wrt θ under the integral, then
Z
o
∂ n
θ̂(x)f (x|θ) dx = 1.
∂θ
The estimator θ̂(x) can’t depend on θ, so
Z
∂
θ̂(x) f (x|θ) dx = 1.
∂θ
Since
∂f
∂
=f
(log f ) ,
∂θ
∂θ
so that now
Z
θ̂(x)f
∂
(log f ) dx = 1.
∂θ
Thus
∂
E θ̂(x) (log f ) = 1.
∂θ
Define random variables
U = θ̂(x),
and
S=
∂
(log f ) .
∂θ
66
Then E (U S) = 1. We already know that the score function has expectation zero,
E (S) = 0. Consequently Cov(U, S) = E(U S) − E(U )E(S) = E(U S) = 1. By the
well-known property of correlations (that follows from the Schwartz’s inequality) we
have
2
2
{Corr(U, S)} =
{Cov(U, S)}
≤ 1
Var(U )Var(S)
Since, as we mentioned, Cov(U, S) = 1 we get
Var(U )Var(S) ≥ 1
This implies
Var(θ̂) ≥
1
I(θ)
which is our main result. We call { I(θ) }−1 the Cramér Rao lower bound (CRLB).
Why information? Variance measures lack of knowledge. Reasonable that the
reciprocal of the variance should be defined as the amount of information carried by
the (possibly vector valued) random observation X about θ.
Sufficient conditions for the proof of CRLB are that all the integrands are finite,
within the range of x. We also require that the limits of the integrals do not depend on
θ. That is, the range of x, here f (x|θ), cannot depend on θ. This second condition is
violated for many density functions, i.e. the CRLB is not valid for the uniform distribution. We can have absolute assessment for unbiased estimators by comparing their
variances to the CRLB. We can also assess biased estimators. If its variance is lower
than CRLB then it can be indeed a very good estimate, although it is bias.
Example 10. Consider IID random variables Xi , i = 1, . . . , n, with
1
1
fXi (xi |µ) = exp − xi .
µ
µ
Denote the joint distribution of X1 , . . . , Xn by
!
n
n
n
Y
1
1X
f=
fXi (xi |µ) =
exp −
xi ,
µ
µ i=1
i=1
so that
n
log f = −n log(µ) −
67
1X
xi .
µ i=1
The score function is the partial derivative of log f wrt the unknown parameter µ,
S(µ) =
and
n
∂
n
1 X
log f = − + 2
xi
∂µ
µ µ i=1
(
n
n
1 X
E {S(µ)} = E − + 2
Xi
µ µ i=1
)
( n
)
X
1
n
Xi
= − + 2E
µ µ
i=1
For X ∼ Exp(1/µ), we have E(X) = µ implying E(X1 + · · · + Xn ) = E(X1 ) +
· · · + E(Xn ) = nµ and E {S(µ)} = 0 as required.
(
!)
n
∂
n
1 X
Xi
I(θ) = −E
− + 2
∂µ
µ µ i=1
(
= −E
n
n
2 X
−
Xi
µ2
µ3 i=1
)
( n
)
X
2
n
Xi
= − 2 + 3E
µ
µ
i=1
= −
n
2nµ
n
+ 3 = 2
µ2
µ
µ
Hence
CRLB =
µ2
.
n
Let us propose µ̂ = X̄ as an estimator of µ. Then
( n
( n
)
)
X
1X
1
E(µ̂) = E
Xi = E
Xi = µ,
n i=1
n
i=1
verifying that µ̂ = X̄ is indeed an unbiased estimator of µ. For X ∼ Exp(1/µ), we
p
have E(X) = µ = Var(X), implying
n
1 X
nµ2
µ2
Var(µ̂) = 2
Var(Xi ) = 2 =
.
n i=1
n
n
−1
We have already shown that Var(µ̂) = { I(θ) }
, and therefore conclude that the
2
unbiased estimator µ̂ = x̄ achieves its CRLB.
Definition 7 (Efficiency ). Define the efficiency of the unbiased estimator θ̂ as
eff(θ̂) =
CRLB
Var(θ̂)
68
,
where CRLB = { I(θ) }−1 . Clearly 0 < eff(θ̂) ≤ 1. An unbiased estimator θ̂ is said
2
to be efficient if eff(θ̂) = 1.
Exercise 21. Consider the MLE θ̂ = r/n for the binomial distribution that was considered in Example 3. Show that for this estimator efficiency is 100%, i.e. its variance
attains CRLB.
Exercise 22. Consider the MLE for the Poisson distribution that was considered in
Example 5. Show that also in this case the MLE is 100% efficient.
Definition 8 (Asymptotic efficiency ). The asymptotic efficiency of an unbiased estimator θ̂ is the limit of the efficiency as n → ∞. An unbiased estimator θ̂ is said to be
asymptotically efficient if its asymptotic efficiency is equal to 1.
2
Exercise 23. Consider the MLE θ̂ for the exponential distribution with parameter θ that
was considered in Exercise 16. Find its variance, and its mean square error. Consider
also θ̃ that was considered in this example. Which of the two has smaller variance and
which has smaller mean square error? Is θ̃ asymptotically efficient?
Exercise 24. Discuss efficiency of the estimator of variance in the normal distribution
in the case when the mean is known (see Example 7).
4.3
Optimality Properties of the MLE
Suppose that an experiment consists of measuring random variables X1 , X2 , . . . , Xn
which are iid with probability distribution depending on a parameter θ. Let θ̂ be the
MLE of θ. Define
W1
W2
=
p
I(θ)(θ̂ − θ)
=
p
I(θ)(θ̂ − θ)
q
W3
=
q
W4
=
I(θ̂)(θ̂ − θ)
I(θ̂)(θ̂ − θ).
Then, W1 , W2 , W3 , and W4 are all random variables and, as n → ∞, the probabilistic
behaviour of each of W1 , W2 , W3 , and W4 is well approximated by that of a N (0, 1)
random variable.
69
Since E[W1 ] ≈ 0, we have that E[θ̂] ≈ θ and so θ̂ is approximately unbiased. Also
Var[W1 ] ≈ 1 implies that Var[θ̂] ≈ (I(θ))−1 and so θ̂ is asymptotically efficient. The
above properties of the MLE estimators carry to the multivariate case. Here is a brief
account of these properties.
Let the data X have probability distribution g(X; θ ) where θ = (θ1 , θ2 , . . . , θm ) is
a vector of m unknown parameters.
Let I(θθ ) be the m × m observed information matrix and let I(θθ ) be the m ×
m Fisher’s information matrix obtained by replacing the elements of I(θθ ) by their
expected values. Let θ̂θ be the MLE of θ . Let CRLBr be the rth diagonal element of the
√
Fisher’s information matrix. For r = 1, 2, . . . , m, define W1r = (θ̂r − θr )/ CRLBr .
Then, as n → ∞, W1r behaves like a standard normal random variable.
Suppose we define W2r by replacing CRLBr by the rth diagonal element of the
matrix I(θθ )−1 , W3r by replacing CRLBr by the rth diagonal element of the matrix
I(θ̂θ )−1 and W4r by replacing CRLBr by the rth diagonal element of the matrix
I(θ̂θ )−1 . Then it can be shown that as n → ∞, W2r , W3r , and W4r all behave like
standard normal random variables.
70
Chapter 5
The Theory of Confidence
Intervals
5.1
Exact Confidence Intervals
Suppose that we are going to observe the value of a random vector X. Let X denote the
set of possible values that X can take and, for x ∈ X , let g(x|θ) denote the probability
that X takes the value x where the parameter θ is some unknown element of the set Θ.
Consider the problem of quoting a subset of θ values which are in some sense plausible
in the light of the data x. We need a procedure which for each possible value x ∈ X
specifies a subset C(x) of Θ which we should quote as a set of plausible values for θ.
Definition 9. Let X1 , . . . , Xn be a sample form a distribution that is parameterized
by some parameter θ. A random set C(X1 , . . . , Xn ) of possible values for θ that is
computable from the sample is called a confidence region at confidence level 1 − α if
P(θ ∈ C(X1 , . . . , Xn )) = 1 − α.
If the set C(X1 , . . . , Xn ) has the form of an interval, then we call it a confidence
interval.
Example 11. Suppose we are going to observe data x where x = (x1 , x2 , . . . , xn ), and
71
x1 , x2 , . . . , xn are the observed values of random variables X1 , X2 , . . . , Xn which are
thought to be iid N (θ, 1) for some unknown parameter θ ∈ (−∞, ∞) = Θ. Consider
√
√
the subset C(x) = [x̄ − 1.96/ n, x̄ + 1.96/ n]. If we carry out an infinite sequence
of independent repetitions of the experiment then we will get an infinite sequence of x
values and thereby an infinite sequence of subsets C(x). We might ask what proportion
of this infinite sequence of subsets actually contain the fixed but unknown value of θ?
Since C(x) depends on x only through the value of x̄ we need to know how x̄
behaves in the infinite sequence of repetitions. This follows from the fact that X̄ has a
√
N (θ, n1 ) density and so Z = X̄−θ
= n(X̄ − θ) has a N (0, 1) density. Thus even√1
n
though θ is unknown we can calculate the probability that the value of Z will exceed
2.78, for example, using the standard normal tables. Remember that the probability is
the proportion of experiments in the infinite sequence of repetitions which produce a
value of Z greater than 2.78.
In particular we have that P [|Z| ≤ 1.96] = 0.95. Thus 95% of the time Z will lie
between −1.96 and +1.96. But
−1.96 ≤ Z ≤ +1.96
√
−1.96 ≤ n(X̄ − θ) ≤ +1.96
√
√
⇒ −1.96/ n ≤ X̄ − θ ≤ +1.96/ n
√
√
⇒ X̄ − 1.96/ n ≤ θ ≤ X̄ + 1.96/ n
⇒
⇒ θ ∈ C(X)
Thus we have answered the question we started with. The proportion of the infinite sequence of subsets given by the formula C(X) which will actually include the fixed but
unknown value of θ is 0.95. For this reason the set C(X) is called a 95% confidence
set or confidence interval for the parameter θ.
2
It is well to bear in mind that once we have actually carried out the experiment and
observed our value of x, the resulting interval C(x) either does or does not contain
the unknown parameter θ. We do not know which is the case. All we know is that the
procedure we used in constructing C(x) is one which 95% of the time produces an
interval which contains the unknown parameter.
72
25
20
15
5
10
c(0, mu)
0
20
40
60
80
100
Index
Figure 5.1: One hundred confidence intervals for the mean of a normal variable with
“unknown” mean and variance for sample size of ten. In fact the samples have been
drawn from the normal distribution with the mean 15 and standard deviation 6.
The crucial step in the last example was finding the quantity Z =
√
n(X̄ − θ)
whose value depended on the parameter of interest θ but whose distribution was known
to be that of a standard normal variable. This leads to the following definition.
Definition 10 (Pivotal Quantity). A pivotal quantity for a parameter θ is a random
variable Q(X|θ) whose value depends both on (the data) X and on the value of the
unknown parameter θ but whose distribution is known.
2
The quantity Z in the example above is a pivotal quantity for θ. The following
lemma provides a method of finding pivotal quantities in general.
Lemma 4. Let X be a random variable and define F (a) = P [X ≤ a]. Consider the
random variable U = −2 log [F (X)]. Then U has a χ22 density. Consider the random
variable V = −2 log [1 − F (X)]. Then V has a χ22 density.
73
Proof. Observe that, for a ≥ 0,
P [U ≤ a]
Hence, U has density
1
2
= P [F (X) ≥ exp (−a/2)]
=
1 − P [F (X) ≤ exp (−a/2)]
=
1 − P [X ≤ F −1 (exp (−a/2))]
=
1 − F [F −1 (exp (−a/2))]
=
1 − exp (−a/2).
exp (−a/2) which is the density of a χ22 variable as required.
The corresponding proof for V is left as an exercise.
This lemma has an immediate, and very important, application.
Suppose that we have data X1 , X2 , . . . , Xn which are iid with density f (x|θ). DeRa
fine F (a|θ) = −∞ f (x|θ)dx and, for i = 1, 2, . . . , n, define Ui = −2 log[F (Xi |θ)].
Pn
Then U1 , U2 , . . . , Un are iid each having a χ22 density. Hence Q1 (X, θ) = i=1 Ui has
a χ22n density and so is a pivotal quantity for θ. Another pivotal quantity ( also having
Pn
a χ22n density ) is given by Q2 (X, θ) = i=1 Vi where Vi = −2 log[1 − F (Xi |θ)].
Example 12. Suppose that we have data X1 , X2 , . . . , Xn which are iid with density
f (x|θ) = θ exp (−θx)
for x ≥ 0 and suppose that we want to construct a 95% confidence interval for θ. We
need to find a pivotal quantity for θ. Observe that
Z a
F (a|θ) =
f (x|θ)dx
−∞
Z a
=
θ exp (−θx)dx
0
=
Hence
Q1 (X, θ) = −2
1 − exp (−θa).
n
X
log [1 − exp (−θXi )]
i=1
is a pivotal quantity for θ having a χ22n density. Also
Q2 (X, θ) = −2
n
X
log [exp (−θXi )] = 2θ
i=1
n
X
i=1
74
Xi
is another pivotal quantity for θ having a χ22n density.
Using the tables, find A < B such that P [χ22n < A] = P [χ22n > B] = 0.025.
Then
0.95
P [A ≤ Q2 (X, θ) ≤ B]
n
X
= P [A ≤ 2θ
Xi ≤ B]
=
i=1
B
≤ θ ≤ Pn
]
= P [ Pn
2 i=1 Xi
2 i=1 Xi
A
and so the interval
B
A
, Pn
]
[ Pn
2 i=1 Xi 2 i=1 Xi
is a 95% confidence interval for θ.
5.2
Pivotal Quantities for Use with Normal Data
Many exact pivotal quantities have been developed for use with Gaussian data.
Exercise 25. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations
from a N (θ, σ 2 ) density where σ is known. Define
√
n(X̄ − θ)
.
Q=
σ
Show that the defined random variable is pivotal for µ. Construct confidence intervals
for µ based on this pivotal quantity.
Example 13. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations
from a N (θ, σ 2 ) density where θ is known. Define
n
P
Q=
We can write Q =
n
P
i=1
(Xi − θ)2
i=1
σ2
Zi2 where Zi = (Xi − θ)/σ. If Zi has a N (0, 1) density then
Zi2 has a χ21 density. Hence, Q has a χ2n density and so is a pivotal quantity for σ. If
n = 20 then we can be 95% sure that
n
P
9.591 ≤
(Xi − θ)2
i=1
σ2
75
≤ 34.170
which is equivalent to
v
v
u
u
n
n
u 1 X
u 1 X
2
t
(Xi − θ) ≤ σ ≤ t
(Xi − θ)2 .
34.170 i=1
9.591 i=1
The R command qchisq(p=c(.025,.975),df=20) returns the values 9.590777
and 34.169607 as the 2 12 % and 97 21 % quantiles from a Chi-squared distribution on 20
2
degrees of freedom.
Lemma 5 (The Student t-distribution). Suppose the random variables X and Y are
independent, and X ∼ N (0, 1) and Y ∼ χ2n . Then the ratio
X
T =p
Y /n
has pdf
1 Γ([n + 1]/2)
fT (t|n) = √
Γ(n/2)
πn
t2
1+
n
−(n+1)/2
,
and is known as Student’s t-distribution on n degrees of freedom.
Proof. The random variables X and Y are independent and have joint density
1 2−n/2 −x2 /2 n/2−1 −y/2−1 −y/2
e
y
e
e
fX,Y (x, y) = √
2π Γ(n/2)
for y > 0.
The Jacobian ∂(t, u)/∂(x, y) of the change of variables
t= p
x
y/n
and
equals
∂t
∂(t, u)
≡ ∂x
∂(x, y) ∂u
∂x
p
n/y
= ∂u 0
∂y
∂t ∂y u=y
√ x n − 12 (y)
3/2 = (n/y)1/2
1
and the inverse Jacobian
∂(x, y)/∂(t, u) = (u/n)1/2 .
76
Then
Z
fT (t)
∞
=
0
=
=
u 1/2
fX,Y t(u/n)1/2 , u
du
n
1 2−n/2
√
2π Γ(n/2)
∞
Z
2
e−t
u/2n n/2−1 −u/2
u
e
u 1/2
n
0
1
2−n/2
√
2π Γ(n/2)n1/2
Z
∞
2
e−(1+t
/n)u/2 (n+1)/2−1
u
du
du .
0
The last integrand comes from the pdf of a Gam((n + 1)/2, 1/2 + t2 /(2n)) random
variable. Hence
1 Γ([n + 1]/2)
fT (t) = √
Γ(n/2)
πn
1
1 + t2 /n
(n+1)/2
,
which gives the above formula.
Example 14. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations
from a N (θ, σ 2 ) density where both θ and σ are unknown. Define
√
n(X̄ − θ)
Q=
s
where
n
P
2
s =
(Xi − X̄)2
i=1
n−1
.
We can write
Q= p
where
Z
W/(n − 1)
√
Z=
n(X̄ − θ)
σ
has a N (0, 1) density and
n
P
W =
(Xi − X̄)2
i=1
σ2
has a χ2n−1 density ( see lemma 1 ). It follows immediately that W is a pivotal quantity
for σ. If n = 31 then we can be 95% sure that
n
P
(Xi − X̄)2
i=1
16.79077 ≤
≤ 46.97924
σ2
77
which is equivalent to
v
v
u
u
n
n
X
X
u
u
1
1
2
t
(Xi − X̄) ≤ σ ≤ t
(Xi − X̄)2 .
46.97924 i=1
16.79077 i=1
(5.1)
The R command qchisq(p=c(.025,.975),df=30) returns the values 16.79077
and 46.97924 as the 2 12 % and 97 12 % quantiles from a Chi-squared distribution on 30
degrees of freedom. In lemma 5 we show that Q has a tn−1 density, and so is a pivotal
quantity for θ. If n = 31 then we can be 95% sure that
√
n(X̄ − θ)
≤ +2.042
−2.042 ≤
s
which is equivalent to
s
s
X̄ − 2.042 √ ≤ θ ≤ X̄ + 2.042 √ .
n
n
(5.2)
The R command qt(p=.975,df=30) returns the value 2.042272 as the 97 21 % quantile from a Student t-distribution on 30 degrees of freedom. ( It is important to point
out that although a probability statement involving 95% confidence has been attached
the two intervals (5.2) and (5.1) separately, this does not imply that both intervals si2
multaneously hold with 95% confidence. )
Example 15. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations
from a N (θ1 , σ 2 ) density and data Y1 , Y2 , . . . , Ym which are iid observations from a
N (θ2 , σ 2 ) density where θ1 , θ2 , and σ are unknown. Let δ = θ1 − θ2 and define
(X̄ − Ȳ ) − δ
Q= q
1
s2 ( n1 + m
)
where
Pn
2
s =
i=1 (Xi
− X̄)2 +
Pm
j=1 (Yj
− Ȳ )2
n+m−2
2
.
2
We know that X̄ has a N (θ1 , σn ) density and that Ȳ has a N (θ2 , σm ) density. Then
the difference X̄ − Ȳ has a N (δ, σ 2 [ n1 +
1
m ])
density. Hence
X̄ − Ȳ − δ
Z=q
1
σ 2 [ n1 + m
]
78
Pn
has a N (0, 1) density. Let W1 =
i=1 (Xi
− X̄)2 /σ 2 and let W2 =
Pm
j=1 (Yj
−
Ȳ )2 /σ 2 . Then, W1 has a χ2n−1 density and W2 has a χ2m−1 density, and W = W1 +W2
has a χ2n+m−2 density. We can write
p
Q1 = Z/ W/(n + m − 2)
and so, Q1 has a tn+m−2 density and so is a pivotal quantity for δ. Define
Pn
Pm
2
2
i=1 (Xi − X̄) +
j=1 (Yj − Ȳ )
Q2 =
.
σ2
Then Q2 has a χ2n+m−2 density and so is a pivotal quantity for σ.
2
Lemma 6 (The Fisher F-distribution). Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym be iid
N (0, 1) random variables. The ratio
n
P
Z=
i=1
m
P
i=1
Xi2 /n
Yi2 /m
has the distribution called Fisher, or F distribution with parameters (degrees of freedom) n, m, or the Fn,m distribution for short. The corresponding pdf fFn,m is concentrated on the positive half axis
fFn,m (z) =
Γ((n + m)/2) n n/2 n/2−1 n −(n+m)/2
z
1+ z
Γ(n/2)Γ(m/2) m
m
for z > 0.
Observe that if T ∼ tm , then T 2 = Z ∼ F1,m , and if Z ∼ Fn,m , then Z −1 ∼ Fm,n .
If W1 ∼ χ2n and W2 ∼ χ2m , then Z = (mW1 )/(nW2 ) ∼ Fn,m .
2
Example 16. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations
2
from a N (θX , σX
) density and data Y1 , Y2 , . . . , Ym which are iid observations from a
N (θY , σY2 ) density where θX , θY , σX , and σY are all unknown. Let
λ = σX /σY
and define
F∗ =
ŝ2X
=
ŝ2Y
Pn
− X̄)2
(m − 1)
Pm
.
2
(n − 1)
j=1 (Yj − Ȳ )
i=1 (Xi
79
Let
WX =
n
X
2
(Xi − X̄)2 /σX
i=1
and let
WY =
m
X
(Yj − Ȳ )2 /σY2 .
j=1
Then, WX has a
χ2n−1
density and WY has a χ2m−1 density. Hence, by lemma 6,
Q=
F∗
WX /(n − 1)
≡ 2
WY /(m − 1)
λ
has an F density with n−1 and m−1 degrees of freedom and so is a pivotal quantity for
λ. Suppose that n = 25 and m = 13. Then we can be 95% sure that 0.39 ≤ Q ≤ 3.02
which is equivalent to
r
F∗
≤λ≤
3.02
r
F∗
.
0.39
To see how this might work in practice try the following R commands one at a time
x = rnorm(25, mean = 0, sd = 2)
y = rnorm(13, mean = 1, sd = 1)
Fstar = var(x)/var(y); Fstar
CutOffs = qf(p=c(.025,.975), df1=24, df2=12)
CutOffs; rev(CutOffs)
Fstar / rev(CutOffs)
var.test(x, y)
2
The search for a nice pivotal quantity for δ = θ1 − δ2 continues and is one of the
unsolved problems in Statistics - referred to as the Behrens-Fisher Problem.
5.3
Approximate Confidence Intervals
Let X1 , X2 , . . . , Xn be iid with density f (x|θ). Let θ̂ be the MLE of θ. We saw before
q
p
p
that the quantities W1 = I(θ)(θ̂ − θ), W2 = I(θ)(θ̂ − θ), W3 = I(θ̂)(θ̂ −
q
θ), and W4 = I(θ̂)(θ̂ − θ) all had densities which were approximately N (0, 1).
80
Hence they are all approximate pivotal quantities for θ. W3 and W4 are the simplest
to use in general. For W3 the approximate 95% confidence interval is given by [θ̂ −
q
q
1.96/ I(θ̂), θ̂ + 1.96/ I(θ̂)]. For W4 the approximate 95% confidence interval is
q
q
q
q
given by [θ̂ − 1.96/ I(θ̂), θ̂ + 1.96/ I(θ̂)]. The quantity 1/ I(θ̂) ( or 1/ I(θ̂))
is often referred to as the approximate standard error of the MLE θ̂.
Let X1 , X2 , . . . , Xn be iid with density f (x|θ) where θ = (θ1 , θ2 , . . . , θm ) consists of m unknown parameters. Let θ = (θ̂1 , θ̂2 , . . . , θ̂m ) be the MLE of θ. We saw
√
before that for r = 1, 2, . . . , m the quantities W1r = (θ̂r − θr )/ CRLBr where
CRLBr is the lower bound for Var(θ̂r ) given in the generalisation of the Cramer-Rao
theorem had a density which was approximately N (0, 1). Recall that CRLBr is the
rth diagonal element of the matrix [I(θ)]−1 . In certain cases CRLBr may depend on
the values of unknown parameters other than θr and in those cases W1r will not be an
approximate pivotal quantity for θr .
We also saw that if we define W2r by replacing CRLBr by the rth diagonal element of the matrix [I(θ)]−1 , W3r by replacing CRLBr by the rth diagonal element
of the matrix [I(θ̂)]−1 and W4r by replacing CRLBr by the rth diagonal element of
the matrix [I(θ̂)]−1 we get three more quantities all of whom have a density which is
approximately N (0, 1). W3r and W4r only depend on the unknown parameter θr and
so are approximate pivotal quantities for θr . However in certain cases the rth diagonal element of the matrix [I(θ)]−1 may depend on the values of unknown parameters
other than θr and in those cases W2r will not be an approximate pivotal quantity for
θr . Generally W3r and W4r are most commonly used.
We now examine the use of approximate pivotal quantities based on the MLE in a
series of examples
Example 17 (Poisson sampling continued). Recall that θ̂ = x̄ and I(θ) =
Pn
i=1
xi /θ2 =
nθ̂/θ2 with E[I(θ)] = n/θ. Hence I(θ̂) = I(θ̂) = n/θ̂ and the usual approximate
95% confidence interval is given by
s
[ θ̂ − 1.96
s
θ̂
θ̂
, θ̂ + 1.96
].
n
n
2
81
Example 18 (Bernoulli trials continued). Recall that θ̂ = x̄ and
Pn
Pn
xi
n − i=1 xi
I(θ) = i=1
+
θ2
(1 − θ)2
with
I(θ) = EI(θ) =
n
.
θ(1 − θ)
Hence
I(θ̂) = I(θ̂) =
n
θ̂(1 − θ̂)
and the usual approximate 95% confidence interval is given by
s
s
θ̂(1 − θ̂)
θ̂(1 − θ̂)
, θ̂ + 1.96
].
[ θ̂ − 1.96
n
n
2
Example 19. Let X1 , X2 , . . . , Xn be iid observations from the density
f (x|α, β) = αβxβ−1 exp [−αxβ ]
for x ≥ 0 where both α and β are unknown. In can be verified by straightforward
calculations that the information matrix I(α, β) is given by


Pn
β
n/α2
x
log[x
]
i
i=1 i
 P

Pn
n
β
β
2
2
x
log[x
]
n/β
+
α
x
log[x
]
i
i
i=1 i
i=1 i
Let V11 and V22 be the diagonal elements of the matrix [I(α̂, β̂)]−1 . Then the approximate 95% confidence interval for α is
[α̂ − 1.96
p
p
V11 , α̂ + 1.96 V11 ]
and the approximate 95% confidence interval for β is
p
p
[β̂ − 1.96 V22 , β̂ + 1.96 V22 ].
Finding α̂ and β̂ is an interesting exercise that you can try to do on your own.
82
2
Exercise 26. Components are produced in an industrial process and the number of
flaws indifferent components are independent and identically distributed with probability mass function p(x) = θ(1 − θ)x , x = 0, 1, 2, . . ., where 0 < θ < 1. A random
sample of n components is inspected; n0 components are found to have no flaws, n1
components are found to have two or more flaws.
1. Show that the likelihood function is l(θ) = θn0 +n1 (1 − θ)2n−2n0 −n1 .
2. Find the MLE of θ and the sample information in terms of n, n0 and n1 .
3. Hence calculate an approximate 90% confidence interval for θ where 90 out of
100 components have no flaws, and seven have exactly one flaw.
Exercise 27. Suppose that X1 , X2 , . . . , Xn is a random sample from the shifted exponential distribution with probability density function
f (x|θ, µ) =
1 −(x−µ)/θ
e
,
θ
µ < x < ∞,
where θ > 0 and −∞, µ < ∞. Both θ and µ are unknown, and n > 1.
1. The sample range W is defined as W = X(n) − X(1) , where X(n) = maxi Xi
and X(1) = mini Xi . It can be shown that the joint probability density function
of X(1) and W is given by
fX(1) ,W (x(1) , w) = n(n − 1)θ−2 e−n(x(1) −µ)/θ e−w/θ (1 − e−w/θ )n−2 ,
for x(1) > µ and w > 0. Hence obtain the marginal density function of W and
show that W has distribution function P (W ≤ w) = (1 − e−w/θ )n−1 , w > 0.
2. Show that W/θ is a pivotal quantity for θ. Without carrying out any calculations,
explain how this result may be used to construct a 100(1 − α)% confidence
interval for θ for 0, α < 1.
Exercise 28. Let X have the logistic distribution with probability density function
f (x) =
ex−θ
,
(1 + ex−θ )2
where −∞ < θ < ∞ is an unknown parameter.
83
−∞ < x < ∞,
1. Show that X − θ is a pivotal quantity and hence, given a single observation X,
construct an exact 100(1 − α)% confidence interval for θ. Evaluate the interval
when α = 0.05 and X = 10.
2. Given a random sample X1 , X2 , . . . , Xn from the above distribution, briefly explain how you would use the central limit theorem to construct an approximate
95% confidence interval for θ. Hint E(X) = θ and Var(X) = π 2 /3.
Exercise 29. Let X1 , . . . , Xn be iid with density fX (x|θ) = θ exp (−θx) for x ≥ 0.
1. Show that
Rx
0
f (u|θ)du = 1 − exp (−θx).
2. Use the result in (a) to establish that Q = 2θ
Pn
i=1
Xi is a pivotal quantity for θ
and explain how to use Q to find a 95% confidence interval for θ.
3. Derive the information I(θ). Suggest an approximate pivotal quantity for θ
involving I(θ) and another approximate pivotal quantity involving I(θ̂) where
θ̂ = 1/x̄ is the maximum likelihood estimate of θ. Show how both approximate
pivotal quantities may be used to find approximate 95% confidence intervals
for θ. Prove that the approximate confidence interval calculated using the approximate pivotal quantity involving I(θ̂) is always shorter than the approximate
confidence interval calculated using the approximate pivotal quantity involving
I(θ) but that the ratio of the lengths converges to 1 as n → ∞.
4. Suppose n = 25 and
P20
i=1
xi = 250. Use the method explained in (b) to
calculate a 95% confidence interval for θ and the two methods explained in (c)
to calculate approximate 95% confidence intervals for θ. Compare the three
intervals obtained.
Exercise 30. Let X1 , X2 , . . . , Xn be iid with density
f (x|θ) =
θ
(x + 1)θ+1
for x ≥ 0.
1. Derive an exact pivotal quantity for θ and explain how it may be used to find a
95% confidence interval for θ.
84
2. Derive the information I(θ). Suggest an approximate pivotal quantity for θ involving I(θ) and another approximate pivotal quantity involving I(θ̂) where θ̂
is the maximum likelihood estimate of θ. Show how both approximate pivotal
quantities may be used to find approximate 95% confidence intervals for θ.
3. Suppose n = 25 and
P25
i=1
log [xi + 1] = 250. Use the method explained in (a)
to calculate a 95% confidence interval for θ and the two methods explained in
(b) to calculate approximate 95% confidence intervals for θ. Compare the three
intervals obtained.
Exercise 31. Let X1 , X2 , . . . , Xn be iid with density
f (x|θ) = θ2 x exp (−θx)
for x ≥ 0.
1. Show that
Rx
0
f (u|θ)du = 1 − exp (−θx)[1 + θx].
2. Describe how the result from (a) can be used to construct an exact pivotal quantity for θ.
3. Construct FOUR approximate pivotal quantities for θ based on the MLE θ̂.
4. Suppose that n = 10 and the data values are 1.6, 2.5, 2.7, 3.5, 4.6, 5.2, 5.6, 6.4,
7.7, 9.2. Evaluate the 95% confidence interval corresponding to ONE of the
exact pivotal quantities ( you may need to use a computer to do this ). Compare
your answer to the 95% confidence intervals corresponding to each of the FOUR
approximate pivotal quantities derived in (c).
Exercise 32. Let X1 , X2 , . . . , Xn be iid each having a Poisson density
f (x|θ) =
θx exp (−θ)
x!
for x = 0, 1, 2, . . . ., ∞. Construct FOUR approximate pivotal quantities for θ based on
the MLE θ̂. Show how each may be used to construct an approximate 95% confidence
interval for θ. Evaluate the four confidence intervals in the case where the data consist
of n = 64 observations with an average value of x̄ = 4.5.
85
Exercise 33. Let X1 , X2 , . . . , Xn be iid with density
f1 (x|θ) =
1
exp [−x/θ]
θ
for 0 ≤ x < ∞. Let Y1 , Y2 , . . . , Ym be iid with density
f2 (y|θ, λ) =
λ
exp [−λy/θ]
θ
for 0 ≤ y < ∞.
1. Derive approximate pivotal quantities for each of the parameters θ and λ.
2. Suppose that n = 10 and the average of the 10 x values is 24.0. Suppose that
m = 40 and the average of the 40 y values is 12.0. Calculate approximate 95%
confidence intervals for both θ and λ.
86
Chapter 6
The Theory of Hypothesis
Testing
6.1
Introduction
Suppose that we are going to observe the value of a random vector X. Let X denote
the set of possible values that X can take and, for x ∈ X , let f (x, θ) denote the density
(or probability mass function) of X where the parameter θ is some unknown element
of the set Θ.
A hypothesis specifies that θ belongs to some subset Θ0 of Θ. The question arises
as to whether the observed data x is consistent with the hypothesis that θ ∈ Θ0 , often
written as H0 : θ ∈ Θ0 . The hypothesis H0 is usually referred to as the null hypothesis.
The null hypothesis is contrasted with the so-called alternative hypothesis H1 : θ ∈
Θ1 , where Θ0 ∩ Θ1 = ∅.
The testing hypothesis is aiming at finding in the data x enough evidence to reject
the null hypothesis:
H0 : θ ∈ Θ0
in favor of the alternative hypothesis
H1 : θ ∈ Θ1 .
87
Due to the focus on rejecting H0 and control of the error rate for such a decision,
the set up in the role of the hypotheses is not exchangeable.
In a hypothesis testing situation, two types of error are possible.
• The first type of error is to reject the null hypothesis H0 : θ ∈ Θ0 as being
inconsistent with the observed data x when, in fact, θ ∈ Θ0 i.e. when, in fact,
the null hypothesis happens to be true. This is referred to as Type I Error.
• The second type of error is to fail to reject the null hypothesis H0 : θ ∈ Θ0 as
being inconsistent with the observed data x when, in fact, θ ∈ Θ1 i.e. when, in
fact, the null hypothesis happens to be false. This is referred to as Type II Error.
The goal is to propose a procedure that for given data X = x would automatically
point which of the hypothesis is more favorable and in such a way that chances of
making Type I Error are some prescribed small α ∈ (0, 1), that is refered to as the
significance level of a test. More precisely for given data x we evaluate a certain
numerical characteristics T (x) that is called a test statistic and if it falls in a certain
critical region Rα (often also called rejection region), we reject H0 in the favor of H1 .
We demand that T (x) and Rα are chosen in such a way that for θ ∈ Θ0
P(T (X) ∈ Rα |θ) ≤ α,
i.e. Type I Error is at most α.
Therefore the test procedure can be identified with a test statistic T (x) and a rejection region Rα . It is quite natural to expected that Rα is decreasing with α (it should
be harder to reject H0 if error 1 is smaller). Thus for a given sample x, there should be
an α̂ such that for α > α̂ we have T (x) ∈ Rα and for α < α̂ the test statistics T (x) is
outside Rα . The value α̂ is called the p-value for a given test.
While the focus in setting a testing hypothesis problem is on Type I Error so it
is controlled by the significance level, it is also important to have chances of Type II
Error as small as possible. For a given testing procedure smaller chances of Type I
Error are at the cost of bigger chances of Type II Error. However, the chances of Type
II Error can serve for comparison of testing procedures for which the significance level
88
is the same. For this reason, the concept of the power of a test has been introduced. In
general, the power of a test is a function p(θ) of θ ∈ Θ1 and equals the probability of
rejecting H0 while the true parameter is θ, i.e. under the alternative hypothesis. Among
two tests in the same problem and at the same significance level, the one with larger
power for all θ ∈ Θ1 is considered better. The power of a given procedure is increasing
with the sample size of data, therefore it is often used to determine a sample size so
that not rejecting H0 will represent a strong support for H0 not only a lack of evidence
for the alternative.
Example 20. Suppose the data consist of a random sample X1 , X2 , . . . , Xn from a
N (θ, 1) density. Let Θ = (−∞, ∞) and Θ0 = (−∞, 0] and consider testing H0 : θ ∈
Θ0 , i.e. H0 : θ ≤ 0.
The standard estimate of θ for this example is X̄. It would seem rational to consider
that the bigger the value of X̄ that we observe the stronger is the evidence against the
null hypothesis that θ ≤ 0, in favor of the alternative θ > 0. Thus we decide to use
T (X) = X̄ as our test statistics. How big does X̄ have to be in order for us to reject
H0 ? In other words we want to determine the rejection region Rα . It is quite natural to
consider Rα = [aα , ∞), so we reject H0 if X̄ is too large, i.e. X̄ ≥ aα . To determine
aα we recall that controlling Type I Error means that
P(X̄ ≥ aα |θ) ≤ α,
where θ ≤ 0. For such θ, we clearly have
P(X̄ ≥ aα |θ) ≤ P(X̄ ≥ aα |θ = 0)
√
= 1 − Φ(aα n),
√
from which we get that aα = z1−α / n assures that Type I Error is controlled at α
level.
Suppose that n = 25 and we observe x̄ = 0.32. Finding the p-value is then equivalent to determining the chances of getting such a large value for x̄ by a random variable
that has the distribution of X̄, i.e. N (0, n1 ). In our particular case, it is a N (θ, 0.04), the
probability of getting a value for X̄ as large as 0.32 is the area under a N (0, 0.04) curve
89
between 0.32 and ∞ which is the area under a N (0, 1) curve between
0.32
0.20
= 1.60 and
∞ or 0.0548. This quantity is called the p-value. The p-value is used to measure
the strength of the evidence against H0 : θ ≤ 0 and H0 is rejected if the p-value is
less than some small number such as 0.05. You might like to try the R commands
1-pnorm(q=0.32,mean=0,sd=sqrt(.04)) and 1-pnorm(1.6).
√
Consider the test statistic T (X) = nX̄ and suppose we observe T (x) = t. A
rejection region that results in the significance level α can be defined as
Rα = [z1−α , ∞).
In order to calculate the p-value we need to find α̂ such that t = z1−α̂ which is equiva2
lent to α̂ = P(T > t).
Exercise 34. Since the images on the two sides of coins are made of raised metal, the
toss may slightly favor one face or the other, if the coin is allowed to roll on one edge
upon landing. For the same reason coin spinning is much more likely to be biased than
flipping. Conjurers trim the edges of coins so that when spun they usually land on a
particular face.
To investigate this issue a strict method of coin spinning has been designed and
the results of it recorded for various coins. We assume that the number of considered
tosses if fairly large (bigger than 100). Formulate a testing hypothesis problem for this
situation and in the process answer the following questions.
1. Formulate the null and alternative hypotheses.
2. Propose a test statistic used to decide for one of the hypotheses.
3. Derive a rejection region that guarantees the chances of Type I Error to be at
most α.
4. Explain how for an observed proportions p̂ of “Heads” one could obtain p-value
for the proposed test.
5. Derive a formula for the power of the test.
6. Study how the power depends on the sample size. In particular, it is believed
that a certain coin is tinted toward “Heads” and that the true chances of landing
90
“Heads” are at least 0.51. Design an experiment in which the chances of making
a correct decision using your procedure are 95%.
7. If one hundred thousands spins of a coin will be made. What are chances that
the procedure will lead to the correct decision?
8. Suppose that one hundred thousands spins of a coin have been made and the coin
landed “Heads” 50877 times. Find the p-value and report a conclusion.
Example 21 (The power function). Suppose our rule is to reject H0 : θ ≤ 0 in favor
H1 : θ > 0 if the p-value is less than 0.05. In order for the p-value to be less than 0.05
√
√
we require nt > 1.65 and so we reject H0 if x̄ > 1.65/ n. What are the chances of
rejecting H0 if θ = 0.2 ? If θ = 0.2 then X̄ has the N [0.2, 1/n] distribution and so the
probability of rejecting H0 is
√ 1
1.65
P N 0.2,
≥ √
= P N (0, 1) ≥ 1.65 − 0.2 n .
n
n
For n = 25 this is given by P {N (0, 1) ≥ 0.65} = 0.2578. This calculation can be
verified using the R command 1-pnorm(1.65-0.2*sqrt(25)). The following
table gives the results of this calculation for n = 25 and various values of θ ∈ Θ1 .
θ:
0.00
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Prob:
0.50
.125
.258
.440
.637
.802
.912
.968
.991
.998
.999
This is called the power function of the test. The R command
Ns=seq(from=(-1),to=1, by=0.1)
generates and stores the sequence −1.0, −0.9, . . . , +1.0 and the probabilities in the
table were calculated using 1-pnorm(1.65-Ns*sqrt(25)). The graph of the
2
power function is presented in Figure 6.1.
Example 22 (Sample size). How large would n have to be so that the probability of
√
rejecting H0 when θ = 0.2 is 0.90 ? We would require 1.65 − 0.2 n = −1.28 which
√
implies that n = (1.65 + 1.28)/0.2 or n = 215.
2
91
1.0
0.8
0.6
0.0
0.2
0.4
y
-1.0
-0.5
0.0
0.5
1.0
Ns
Figure 6.1: The power function of the test that a normal sample of size 25 has the mean
value bigger than zero.
So the general plan for testing a hypothesis is clear: choose a test statistic T , observe the data, calculate the observed value t of the test statistic T , calculate the p-value
as the maximum over all values of θ in Θ0 of the probability of getting a value for T
as large as t, and reject H0 : θ ∈ Θ0 if the p-value so obtained is too small.
6.2
Hypothesis Testing for Normal Data
Many standard test statistics have been developed for use with normally distributed
data.
Example 23. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations
from a N (µ, σ 2 ) density where both µ and σ are unknown. Here θ = (µ, σ) and
Θ = {(µ, σ) : −∞ < µ < ∞, 0 < σ < ∞}. Define
Pn
PN
(Xi − X̄)2
2
i=1 Xi
X̄ =
and
s = i=1
.
n
n−1
92
(a) Suppose that for a fixed value µ0 we consider Θ0 = {(µ, σ) : −∞ < µ ≤
µ0 , 0 < σ < ∞}, which can be simply reported as H0 : µ ≤ µ0 . Define
√
T = n(X̄ − µ0 )/s. Let t denote the observed value of T . Then the rejection
region at the level α is defined as
Rα = [t1−α,n−1 , ∞),
where tp,k is, as usual, the p-quantile of Student t-distribution with k degrees of
freedom. It is clear that the p-value is α̂ that is determined from the equality
t1−α̂,n−1 = t,
which equivlant to α̂ = P(T > t) = 1 − F (t), where F is the cdf of Student
t-distribution with n − 1 degrees of freedom.
(b) Suppose H0 : µ ≥ µ0 . Let T be as before and t denote the observed value of T .
By the analogy with the previous case
Rα = (−∞, tα,n−1 ],
and the p-value is given by α̂ = P(T < t) = F (t).
(c) Suppose H0 : µ = µ0 . Define T as before and t denote the observed value of T .
Then Then a rejection region at the level α can be defined as
Rα = (−∞, tα/2,n−1 ] ∪ [t1−α/2,n−1 , ∞),
. It is clear that the p-value is can be obtained as α̂ from the equation |t| =
t1−α̂/2,n−1 or equivalently
α̂ = 2P(T > |t|) = 2(1 − F (|t|)).
(d) Suppose H0 : σ ≤ σ0 . Define T =
Pn
i=1 (Xi
− X̄)2 /σ02 . Let t denote the
observed value of T . Then the rejection region can be set as
Rα = [χ21−α,n−1 , ∞),
93
where χ2p,k is as usually the p-quantile of the chi-squared distribution with kdegrees of freedom. Let us verify that the test statistic with the rejection region
gives indeed the significance at the level α. We have
P(T ≥
χ21−α,n−1 |σ
n
X
≤ σ0 ) = P( (Xi − X̄)2 /σ 2 ≥ χ21−α,n−1 σ02 /σ 2 |σ ≤ σ0 )
i=1
n
X
≤ P( (Xi − X̄)2 /σ 2 ≥ χ21−α,n−1 ) = α.
i=1
The p-value is α̂ obtained from t = χ21−α̂,n−1 which is equivalent to α̂ = P(T >
t) = 1−F (t) where F is the cdf of the chi-squared distribution with n−1 degrees
of freedom.
(e) The case H0 : σ 2 ≥ σ02 can be treated analogously, so that
Rα = [0, χ2α,n−1 ],
and the p-value is obtained as α̂ = P(T < t) = F (t).
(f) Finally for H0 : σ = σ0 and T defined as before we consider a rejection region
Rα = [0, χ2α/2,n−1 ] ∪ [χ21−α/2,n−1 , ∞),
It is easy to see that Rα and T gives the significance level α.
More over the p-value is determined by
α̂ = 2 (P(T < t) ∧ P(T > t)) = 2 (F (t) ∧ (1 − F (t))) ,
where ∧ stands for the minimum operator and F is the cdf of the chi-squared
distribution with n − 1 degrees of freedom.
2
Exercise 35. The following data has been obtained as result of measuring temperature on the first of December, noon, for ten consequitive years in a certain location in
Ireland:
3.7, 6.6, 8.0, 2.5, 4.5, 4.5, 3.5, 7.7, 4.0, 5.7.
Using this data set perform tests for the following problems
94
• H0 : µ = 6 vs. H1 : µ 6= 6,
• H0 : σ ≤ 1 vs. H1 : σ ≥ 1.
Report p values and write conclusions.
In the following we examine two samples problems with the normal distribution
assumption.
Example 24. Suppose that we have data X1 , X2 , . . . , Xn which are iid observations
from a N (µ1 , σ 2 ) density and data y1 , y2 , . . . , ym which are iid observations from a
N (µ2 , σ 2 ) density where µ1 , µ2 , and σ are unknown.
Here
θ = (µ1 , µ2 , σ)
and
Θ = {(µ1 , µ2 , σ) : −∞ < µ1 < ∞, −∞ < µ2 < ∞, 0 < σ < ∞}.
Recall the pooled estimator of the common variance
Pn
2
(xi − x̄)2 + Σm
j=1 (yj − ȳ)
2
.
s = i=1
n+m−2
(a) Suppose Θ0 = {(µ1 , µ2 , σ) : −∞ < µ1 < ∞, µ1 ≤ µ2 < ∞, 0 < σ < ∞},
which can be simply expressed by H0 : µ1 ≤ µ2 vs. H1 : µ1 > µ2 . Define
p
T = (x̄ − ȳ)/ s2 (1/n + 1/m). Let t denote the observed value of T . Then the
following rejection region
Rα = [t1−α,n+m−2 , ∞)
for T defines a test at significance level α. It is clear by the same arguments as
before that α̂ = P(T > t) = 1 − F (t) is the p value for the discussed procedure.
Here F is the cdf of the Student t-distribution with n+m−2 degrees of freedom.
(b) The symmetric case to the previous one of H0 : µ1 ≥ µ2 can be treated by taking
Rα = (−∞, tα,n+m−2 ].
and the p-value α̂ = P(T < t).
95
(c) Two-sided testing for H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 is addressed by
Rα = (−∞, tα/2,n+m−2 ] ∪ [t1−α/2,n+m−2 , ∞).
with the p-value given by α̂ = P(|T | > |t|) = 2P(T > |t|) = 2(1 − F (t)).
(d) Suppose that we have data X1 , X2 , . . . , Xn which are iid observations from
a N (µ1 , σ12 ) density and data y1 , y2 , . . . , ym which are iid observations from
a N (µ2 , σ12 ) density where µ1 , µ2 , σ1 , and σ2 are all unknown. Here θ =
(µ1 , µ2 , σ1 , σ2 ) and Θ = {(µ1 , µ2 , σ1 , σ2 ) : −∞ < µ1 < ∞, −∞ < µ2 <
∞, 0 < σ1 < ∞, 0 < σ2 < ∞}. Define
s21
Pn
=
− x̄)2
,
n−1
i=1 (xi
and
s22 =
2
Σm
j=1 (yj − ȳ)
.
m−1
Suppose Θ0 = {(µ1 , µ2 , σ, σ) : −∞ < µ1 < ∞, µ1 < µ2 < ∞, 0 < σ < ∞}
or simply H0 : σ1 = σ2 vs. H1 : σ1 6= σ2 .
Define
T =
(n − 1)s21
.
(m − 1)s22
Let t denote the observed value of T . Define a rejection region by
Rα = [0, Fn−1,m−1 (α/2)] ∪ [Fn−1,m−1 (1 − α/2), ∞),
where Fk,l (p) is the p-quantile of the Fischer distribution with k and l degrees of
freedom. We note that Fn−1,m−1 (α/2) = 1/Fm−1,n−1 (1 − α/2). The p-value
can be obtained by taking α̂ = 2 (P(T < t) ∧ P(T > t)) = 2 F (t) ∧ F̃ (1/t) ,
where F is the cdf’s of the Fischer distributions with n − 1 and m − 1 and F̃ is
the one with m − 1 and n − 1 degrees of freedom.
2
Exercise 36. The following table gives the concentration of norepinephrine (µmol per
gram creatinine) in the urine of healthy volunteers in their early twenties.
Male 0.48 0.36 0.20 0.55 0.45 0.46 0.47 0.23
Female 0.35 0.37 0.27 0.29
96
The problem is to determine if there is evidence that concentration of norepinephrine
differs between genders.
1. Testing for the difference between means in two normal sample problem is the
main testing procedure. However it requires verification if the variances in the
samples are the same. Carry out a test that checks if there is a significant difference between variances. Evaluate the p-value and make a conclusion.
2. If the above procedure did not reject the equality of variance assumption, carry
out a procedure that examines the equality of concentrations between gender.
Report the p-value and write down conclusion.
6.3
Generally Applicable Test Procedures
Suppose that we observe the value of a random vector X whose probability density
function is g(X|θ) for x ∈ X where the parameter θ = (θ1 , θ2 , . . . , θp ) is some
unknown element of the set Θ ⊆ Rp . Let Θ0 be a specified subset of Θ. Consider the
hypothesis H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 . In this section we consider three ways in
which good test statistics may be found for this general problem.
The Likelihood Ratio Test: This test statistic is based on the idea that the maximum of the log likelihood over the subset Θ0 should not be too much less than
the maximum over the whole set Θ if, in fact, the parameter θ actually does lie in
the subset Θ0 . Let log l(θ) denote the log likelihood function. The test statistic
is
T1 (x) = 2 log
l(θ̂)
l(θ̂ 0 )
= 2[log l(θ̂) − log l(θ̂ 0 )]
where θ̂ is the value of θ in the set Θ for which log l(θ) is a maximum and θ̂ 0
is the value of θ in the set Θ0 for which log l(θ) is a maximum.
The Maximum Likelihood Test Statistic: This test statistic is based on the idea
that θ̂ and θ̂ 0 should be close to one another. Let I(θ) be the p × p information
97
matrix. Let B = I(θ̂). The test statistic is
T2 (x) = (θ̂ − θ̂ 0 )T B(θ̂ − θ̂ 0 )
Other forms of this test statistic follow by choosing B to be I(θ̂ 0 ) or I(θ̂) or
I(θ̂ 0 ).
The Score Test Statistic: This test statistic is based on the idea that θ̂θ 0 should
almost solve the likelihood equations. Let S(θ) be the p × 1 vector whose rth
element is given by ∂ log l/∂θr . Let C be the inverse of I(θ̂ 0 ) i.e. C = I(θ̂ 0 )−1 .
The test statistic is
T3 (x) = S(θ̂ 0 )T CS(θ̂ 0 )
In order to calculate p-values we need to know the probability distribution of the test
statistic under the null hypothesis. Deriving the exact probability distribution may be
difficult but approximations suitable for situations in which the sample size is large are
available in the special case where Θ is a p dimensional set and Θ0 is a q dimensional
subset of Θ for q < p, whence it can be shown that, when H0 is true, the probability
distributions of T1 (x), T2 (x) and T3 (x) are all approximated by a χ2p−q density.
Example 25. Let X1 , X2 , . . . , Xn be iid each having a Poisson distribution with parameter θ. Consider testing H0 : θ = θ0 where θ0 is some specified constant. Recall
that
n
n
Y
X
xi ] log [θ] − nθ − log [ xi !].
log l(θ) = [
i=1
i=1
Here Θ = [0, ∞) and the value of θ ∈ Θ for which log l(θ) is a maximum is θ̂ = x̄.
Also Θ0 = {θ0 } and so trivially θ̂0 = θ0 . We saw also that
Pn
xi
S(θ) = i=1 − n
θ
and that
Pn
i=1
θ2
I(θ) =
xi
.
Suppose that θ0 = 2, n = 40 and that when we observe the data we get x̄ = 2.50.
98
Hence
Pn
i=1
xi = 100. Then
T1
=
2[log l(2.5) − log l(2.0)]
=
200 log (2.5) − 200 − 200 log (2.0) + 160 = 4.62.
The information is B = I(θ̂) = 100/2.52 = 16. Hence
T2 = (θ̂ − θ̂0 )2 B = 0.25 × 16 = 4.
We have S(θ0 ) = S(2.0) = 10 and I(θ0 ) = 25 and so
T3 = 102 /25 = 4.
Here p = 1, q = 0 implying p − q = 1. Since P [χ21 ≥ 3.84] = 0.05 all three test
statistics produce a p-value less than 0.05 and lead to the rejection of H0 : θ = 2.
2
Example 26. Let X1 , X2 , . . . , Xn be iid with density f (x|α, β) = αβxβ−1 exp(−αxβ )
for x ≥ 0. Consider testing H0 : β = 1. Here Θ = {(α, β) : 0 < α < ∞, 0 <
β < ∞} and Θ0 = {(α, 1) : 0 < α < ∞} is a one-dimensional subset of the twoPn
dimensional set Θ. Recall that log l(α, β) = n log[α]+n log[β]+(β−1) i=1 log[xi ]−
Pn
α i=1 xβi . Hence the vector S(α, β) is given by


Pn
n/α − i=1 xβi


Pn
Pn
n/β + i=1 log[xi ] − α i=1 xβi log[xi ]
and the matrix I(α, β) is given by


Pn
n/α2
xβi log[xi ]
i=1
 P

Pn
n
β
β
2
2
x
log[x
]
x
log[x
]
n/β
+
α
i
i
i=1 i
i=1 i
We have that θ̂ = (α̂, β̂) which require numerical method for their calculation which
is discussed in the sample of exam problems. Also θ̂0 = (α̂0 , 1) where α̂0 = 1/x̄.
Suppose that the observed value of T1 (x) is 3.20. Then the p-value is P [T1 (x) ≥
3.20] ≈ P [χ21 ≥ 3.20] = 0.0736. In order to get the maximum likelihood test statistic
plug in the values α̂, β̂ for α, β in the formula for I(α, β) to get the matrix B. Then
calculate T2 (X) = (θ̂ − θ̂0 )T B(θ̂ − θ̂0 ) and use the χ21 tables to calculate the p-value.
99
Finally, to calculate the score test statistic note that the vector S(θ̂0 ) is given by


0


Pn
Pn
n + i=1 log[xi ] − i=1 xi log[xi ]/x̄
and the matrix I(θ̂0 ) is given by


Pn
nx̄2
x
log[x
]
i
i=1 i
 P

Pn
n
2
i=1 xi log[xi ] n +
i=1 xi log[xi ] /x̄
Since T2 (x) = S(θ̂0 )T CS(θ̂0 ) where C = I(θ̂0 )−1 we have that T2 (x) is
[n +
n
X
log[xi ] −
i=1
n
X
xi log[xi ]/x̄]2
i=1
multiplied by the lower diagonal element of C which is given by
[nx̄2 ][n
+
nx̄2
Pn
2
2
i=1 xi log[xi ] /x̄] − [
i=1 xi log[xi ]]
Pn
Hence we get that
T2 (x) =
Pn
Pn
[n + i=1 log[xi ] − i=1 xi log[xi ]/x̄]2 nx̄2
P
Pn
n
[nx̄2 ][n + i=1 xi log[xi ]2 /x̄] − [ i=1 xi log[xi ]]2
No numerical techniques are need to calculate the value of T2 (X) and for this reason
the score test is often preferred to the other two. However there is some evidence that
the likelihood ratio test is more powerful in the sense that it has a better chance of
2
detecting departures from the null hypothesis.
Exercise 37. Suppose that household incomes in a certain country have a Pareto distribution with probability density function
f (x) =
θv θ
,
xθ+1
v≤x<∞,
where θ > 0 is unknown and v > 0 is known. Let x1 , x2 , . . . , xn denote the incomes
for a random sample of n such households. We wish to test the null hypothesis θ = 1
against the alternative that θ 6= 1.
1. Derive an expression for θ̂, the MLE of θ.
100
2. Show that the generalised likelihood ratio test statistic, λ(x), satisfies
ln{λ(x)} = n − n ln(θ̂) −
n
θ̂
.
3. Show that the test accepts the null hypothesis if
k1 <
n
X
ln(xi ) < k2 ,
i=1
and state how the values of k1 and k2 may be determined. Hint: Find the distribution of ln(X), where X has a Pareto distribution.
Exercise 38. A Geiger counter (radioactivity meter) is calibrated using a source of
known radioactivity. The counts recorded by the counter, xi , over 200 one second
intervals are recorded:
8 12 6 11 3 9 9 8 5 4 6 11 6 14 3 5 15 11 7 6 9 9 14 13
6 11 . . . . . . . . . . . . . . . . . . . . . . 9 8 5 8 9 14 14
The sum of the counts
P200
i=1
xi = 1800. The counts can be treated as observations of
iid Poisson random variables with parameter µ with p.m.f.
f (xi ; µ) = µxi e−µ /xi ! xi = 0, 1, . . . ; µ > 0.
If the Geiger counter is functioning correctly then µ = 10, and to check this we would
test H0 : µ = 10 versus H1 : µ 6= 10. Suppose that we choose to test at a significance
level of 5%. The test can be performed using a generalized likelihood ratio test. Carry
out such a test. What does this imply about the Geiger counter? Finally, given the form
of the MLE, what was the point of recording the counts in 200 one-second intervals
rather than recording the count in one 200-second interval?
6.4
The Neyman-Pearson Lemma
Suppose we are testing a simple null hypothesis H0 : θ = θ0 against a simple alternative H1 : θ = θ00 , where θ is the parameter of interest, and θ0 , θ00 are particular values
of θ. Observed values of the i.i.d. random variables X1 , X2 , . . . , Xn , each with p.d.f.
101
fX (x|θ), are available. We are going to reject H0 if (x1 , x2 , . . . , xn ) ∈ Rα , where
Rα is a region of the n-dimensional space called the critical or rejection region. The
critical region Rα is determined so that the probability of a Type I error is α:
P[ (X1 , X2 , . . . , Xn ) ∈ Rα |H0 ] = α.
Definition 11. We call a test defined through Rα as the most powerful at the significance level α in the testing problem H0 : θ = θ0 against the alternative H1 : θ = θ00 if
any other test of this problem has lower power.
The Neyman-Pearson lemma provides us with a way of finding most powerfull tests
in the above problem. It demonstrates that the likelihood ratio test is the most powerful
for the above problem. To avoid distracting technicalities of non-continuous case we
formulate and prove it for the continuous distribution case.
Lemma 7 (The Neyman-Pearson lemma). Let Rα be a subset of the sample space
defined by
Rα = {x : l(θ0 |x)/l(θ00 |x) ≤ k}
where k is uniquely determined from the equality
α = P[X ∈ Rα |H0 ].
Then Rα defines the most powerful test at the significance level α for testing the simple
hypothesis H0 : θ = θ0 against the alternative simple hypothesis H1 : θ = θ00 .
Proof. For any region R of n-dimensional space, we will denote the probability that
R
X ∈ R by l(θ), where θ is the true value of the parameter. The full notation, omitted
R
to save space, would be
Z
P(X ∈ R|θ) =
...
Z
l(θ|x1 , . . . , xn )dx1 . . . dxn .
R
We need to prove that if A is another critical region of size α, then the power of the
test associated with Rα is at least as great as the power of the test associated with A,
or in the present notation, that
Z
00
Z
l(θ ) ≤
A
Rα
102
l(θ00 ).
(6.1)
By the definition of Rα we have
Z
A0 ∩R
l(θ00 ) ≥
Z
1
k
A0 ∩R
α
l(θ0 ).
(6.2)
l(θ0 ).
(6.3)
α
On the other hand
Z
l(θ00 ) ≤
Z
1
k
0
A∩Rα
0
A∩Rα
We now establish (6.1), thereby completing the proof.
Z
Z
Z
l(θ00 ) =
l(θ00 ) +
l(θ00 )
A
0
A∩Rα
A∩Rα
Z
=
≤
Z
l(θ00 ) −
Rα
Z
=
1
k
Z
Z
1
l(θ ) +
k
0
L(θ0 )
( see (6.2), (6.3) )
0
A∩Rα
l(θ0 ) +
1
k
Z
l(θ0 )
A
Rα
l(θ00 ) −
l(θ00 )
0
A∩Rα
A0 ∩Rα
Rα
=
Z
1
l(θ ) −
k
00
Z
l(θ00 ) +
A0 ∩Rα
Rα
Z
Z
l(θ00 ) −
α α
+
k
k
Rα
Z
=
l(θ00 )
Rα
since both Rα and A have size α.
Example 27. Suppose X1 , . . . , Xn are iid N (0, 1), and and we want to test H0 : θ = θ0
versus H1 : θ = θ00 , where θ00 > θ0 . According to the Z-test, we should reject H0 if
√
Z = n(X̄ − θ0 ) is large, or equivalently if X̄ is large. We can now use the NeymanPearson lemma to show that the Z-test is “best”. The likelihood function is
L(θ) = (2π)−n/2 exp{−
n
X
(xi − θ)2 /2}.
i=1
103
According to the Neyman-Pearson lemma, a best critical region is given by the set of
(x1 , . . . , xn ) such that L(θ0 )/L(θ00 ) ≤ k1 , or equivalently, such that
1
ln[L(θ00 )/L(θ0 )] ≥ k2 .
n
But
1
ln[L(θ00 )/L(θ0 )]
n
n
=
1X
[(xi − θ0 )2 /2 − (x1 − θ00 )2 /2]
n i=1
=
1 X 2
[(x − 2θ0 xi + θ02 ) − (x2i − 2θ00 xi + θ002 )]
2n i=1 i
=
1 X
[2(θ00 − θ0 )xi + θ02 − θ002 ]
2n i=1
=
1
(θ00 − θ0 )x̄ + [θ02 − θ002 ].
2
n
n
So the best test rejects H0 when x̄ ≥ k, where k is a constant. But this is exactly the
form of the rejection region for the Z-test. Therefore, the Z-test is “best”.
2
Exercise 39. A random sample of n flowers is taken from a colony and the numbers
X, Y and Z of the three genotypes AA, Aa and aa are observed, where X + Y + Z =
n. Under the hypothesis of random cross-fertilisation, each flower has probabilities
θ2 , 2θ(1 − θ) and (1 − θ)2 of belonging to the respective genotypes, where 0 < θ < 1
is an unknown parameter.
1. Show that the MLE of θ is θ̂ = (2X + Y )/(2n).
2. Consider the test statistic T = 2X + Y. Given that T has a binomial distribution
with parameters 2n and θ, obtain a critical region of approximate size α based on
T for testing the null hypothesis that θ = θ0 against the alternative that θ = θ1 ,
where θ1 < θ0 and 0 < α < 1.
3. Show that the above test is the most powerful of size α.
4. Deduce approximately how large n must be to ensure that the power is at least
0.9 when α = 0.05, θ0 = 0.4 and θ1 = 0.3.
104
Definition 12. For a general testing problem H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 . A
test at significance level α is called uniformly most powerful if its power is larger at
each θ ∈ Θ1 from the power of any other test in the same problem and at the same
significance.
It is easy to note that if the test (rejection region) derived from the Neyman-Pearson
lemma does not depend on θ00 ∈ Θ1 then it is most powerful for the problem H0 : θ =
θ0 vs. H1 : θ ∈ Θ1 .
Exercise 40. Let X1 , X2 , . . . , Xn be a random sample from the Weibull distribution
with probability density function f (x) = θλxλ−1 exp(−θxλ ), for x > 0 where θ > 0
is unknown and λ > 0 is known.
1. Find the form of the most powerful test of the null hypothesis that θ = θ0 against
the alternative hypothesis that θ = θ1 , where θ0 > θ1 .
2. Find the distribution function of X λ and deduce that this random variable has an
exponential distribution.
3. Find the critical region of the most powerful test at the 1% level when n =
50, θ0 = 0.05 and θ1 = 0.025. Evaluate the power of this test.
4. Explain what is meant by the power of a test and describe how the power may
be used to determine the most appropriate size of a sample. Using this approach
to the situation described in the previous item to determine the minimal sample
size for a test that would have chances of any kind of error smaller than 1%.
Exercise 41. In a particular set of Bernoulli trials, it is widely believed that the probability of a success is θ = 43 . However, an alternative view is that θ = 23 . In order to test
H0 : θ =
3
4
against H1 : θ = 32 , n independent trials are to be observed. Let θ̂ denote
the proportion of successes in these trials.
1. Show that the likelihood ratio aapproach leads to a size α test in which H0 is
rejected in favour of H1 when θ̂ < k for some suitable k.
105
2. By applying the central limit theorem, write down the large sample distributions
of θ̂ when H0 is true and when H1 is true.
3. Hence find an expression for k in terms of n when α = 0.05.
4. Find n so that this test has power 0.95.
6.5
Goodness of Fit Tests
Suppose that we have a random experiment with a random variable Y of interest. Assume additionally that Y is discrete with density function f on a finite set S. We repeat
the experiment n times to generate a random sample Y1 , Y2 , . . . , Yn from the distribution of Y . These are independent variables, each with the distribution of Y .
In this section, we assume that the distribution of Y is unknown. For a given
probability mass function f0 , we will test the hypotheses H0 : f = f0 versus H1 :
f 6= f0 . The test that we will construct is known as the goodness of fit test for the
conjectured density f0 . As usual, our challenge in developing the test is to find an
appropriate test statistic – one that gives us information about the hypotheses and whose
distribution, under the null hypothesis, is known, at least approximately.
Suppose that S = y1 , y2 , . . . , yk . To simplify the notation, let pj = f0 (yj ) for
j = 1, 2, . . . , k. Now let Nj = #{i ∈ 1, 2, ..., n : yi = yj } for j = 1, 2, . . . , k.
Under the null hypothesis, (N1 , N2 , . . . , Nk ) has the multinomial distribution with
parameters n and p1 , p2 , . . . , pk with E(Nj ) = npj and Var(Nj ) = npj (1 − pj ). This
result indicates how we might begin to construct our test: for each j we can compare
the observed frequency of yj (namely Nj ) with the expected frequency of value yj
(namely npj ), under the null hypothesis. Specifically, our test statistic will be
X2 =
(N1 − np1 )2
(N2 − np2 )2
(Nk − npk )2
+
+ ··· +
.
np1
np2
npk
Note that the test statistic is based on the squared errors (the differences between the
expected frequencies and the observed frequencies). The reason that the squared errors
are scaled as they are is the following crucial fact, which we will accept without proof:
106
under the null hypothesis, as n increases to infinity, the distribution of X 2 converges to
the chi-square distribution with k − 1 degrees of freedom.
For m > 0 and r in (0, 1), we will let χ2m,r denote the quantile of order r for
the chi-square distribution with m degrees of freedom. Then, the following test has
approximate significance level α: reject H0 : f = f0 versus H1 : f 6= f0 , if and only
if X 2 > χ2k−1,1−α . The test is an approximate one and works best when n is large.
Just how large n needs to be depends on the pj . One popular rule of thumb proposes
that the test will work well if all the expected frequencies satisfy npj ≥ 1 and at least
80% of the expected frequencies satisfy npj ≥ 5.
Example 28 (Genetical inheritance). In crosses between two types of maize four distinct types of plants were found in the second generation. In a sample of 1301 plants
there were 773 green, 231 golden, 238 green-striped, 59 golden-green-striped. According to a simple theory of genetical inheritance the probabilities of obtaining these four
plants are
9
3
3
16 , 16 , 16
and
1
16
respectively. Is the theory acceptable as a model for this
experiment?
Formally we will consider the hypotheses:
H0 : p1 =
9
16 ,
and p2 =
3
16 ,
and p3 =
3
16
and p4 =
1
16
;
H1 : not all the above probabilities are correct.
The expected frequencies for any plant under H0 is npi = 1301pi . We therefore
calculate the following table:
Observed Counts
Expected Counts
Contributions to X 2
Oi
Ei
(Oi − Ei )2 /Ei
773
731.8125
2.318
231
243.9375
0.686
238
243.9375
0.145
59
81.3125
6.123
X 2 = 9.272
Since X 2 embodies the differences between the observed and expected values we
can say that if X 2 is large that there is a big difference between what we observe and
107
what we expect so the theory does not seem to be supported by the observations. If X 2
is small the observations apparently conform to the theory and act as support for the
theory. The test statistic X 2 is distributed X 2 ∼ χ23df . In order to define what we would
consider to be an unusually large value of X 2 we will choose a significance level of α =
0.05. The R command qchisq(p=0.05,df=3,lower.tail=FALSE) calculates the 5% critical value for the test as 7.815. Since our value of X 2 is greater than the
critical value 7.815 we reject H0 and conclude that the theory is not a good model for
these data. The R command pchisq(q=9.272,df=3,lower.tail=FALSE)
calculates the p-value for the test equal to 0.026. ( These data are examined further in
2
chapter 9 of Snedecor and Cochoran. )
Very often we do not have a list of probabilities to specify our hypothesis as we had
in the above example. Rather our hypothesis relates to the probability distribution of the
counts without necessarily specifying the parameters of the distribution. For instance,
we might want to test that the number of male babies born on successive days in a
maternity hospital followed a binomial distribution, without specifying the probability
that any given baby will be male. Or, we might want to test that the number of defective
items in large consignments of spare parts for cars, follows a Poisson distribution, again
without specifying the parameter of the distribution.
The X 2 test is applicable when all the probabilities depend on unknown parameters, provided that the unknown parameters are replaced by their maximum likelihood
estimates and provided that one degree of freedom is deducted for each parameter estimated.
Example 29. Feller reports an analysis of flying-bomb hits in the south of London during World War II. Investigators partitioned the area into 576 sectors each beng 14 km2 .
The following table gives the resulting data:
No. of hits (x)
No. of sectors with x hits
0
1
2
3
4
5
229
221
93
35
7
1
If the hit pattern is random in the sense that the probability that a bomb will land in
any particular sector in constant, irrespective of the landing place of previous bombs, a
Poisson distribution might be expected to model the data.
108
x
P (x) = θ̂x e−θ̂
x!
Expected
Observed
Contributions to X 2
(Oi − Ei )2 /Ei
576 × P (X)
0
0.395
227.53
229
0.0095
1
0.367
211.34
211
0.0005
2
0.170
98.15
93
0.2702
3
0.053
30.39
35
0.6993
4
0.012
7.06
7
0.0005
5
0.002
1.31
1
0.0734
X 2 = 1.0534
The MLE of θ was calculated as θ̂ = 535/576 = 0.9288, that is, the total number
of observed hits divided by the number of sectors. We carry out the chi-squared test
as before except that we now subtract one additional degree of freedom because we
had to estimate θ. The test statistic X 2 is distributed X 2 ∼ χ24df . The R command
qchisq(p=0.05,df=4,lower.tail=FALSE) calculates the 5% critical value
for the test as 9.488. Alternatively, the R command
pchisq(q=1.0534,df=4,lower.tail=FALSE)
calculates the p-value for the test equal to 0.90. The result of the chi-squared test is
not statistically significant indicating that the divergence between the observed and
expected counts can be regarded as random fluctuations about the expected values.
Feller comments, “It is interesting to note that most people believed in a tendency of
the points of impact to cluster. It this were true, there would be a higher frequency of
sectors with either many hits or no hits and a deficiency in the intermediate classes. the
above table indicates perfect randomness and homogeneity of the area; we have here
an instructive illustration of the established fact that to the untrained eye randomness
appears a regularity or tendency to cluster.”
6.6
2
The χ2 Test for Contingency Tables
Let X and Y be a pair of categorical variables and suppose there are r possible values
for X and c possible values for Y . Examples of categorical variables are Religion,
109
Race, Social Class, Blood Group, Wind Direction, Fertiliser Type etc. The random
variables X and Y are said to be independent if P [X = a, Y = b] = P [X = a]P [Y =
b] for all possible values a of X and b of Y . In this section we consider how to test
the null hypothesis of independence using data consisting of a random sample of N
observations from the joint distribution of X and Y .
Example 30. A study was carried out to investigate whether hair colour (columns)
and eye colour (rows) were genetically linked. A genetic link would be supported if
the proportions of people having various eye colourings varied from one hair colour
grouping to another. 955 people were chosen at random and their hair colour and eye
colour recorded. The data are summarised in the following table :
Oij
Black
Brown
Fair
Red
Total
Brown
60
110
42
30
242
Green
67
142
28
35
272
Blue
123
248
90
25
486
Total
250
500
160
90
1000
The proportion of people with red hair is 90/1000 = 0.09 and the proportion having
blue eyes is 486/1000 = 0.486. So if eye colour and hair colour were truly independent we would expect the proportion of people having both black hair and brown eyes
to be approximately equal to (0.090)(0.486) = 0.04374 or equivalently we would
expect the number of people having both black hair and brown eyes to be close to
(1000)(0.04374) = 43.74. The observed number of people having both black hair and
brown eyes is 60.5. We can do similar calculations for all other combinations of hair
colour and eye colour to derive the following table of expected counts :
Eij
Black
Brown
Fair
Red
Total
Brown
60.5
121
38.72
21.78
242
Green
68.0
136
43.52
24.48
272
Blue
121.5
243
77.76
43.74
486
Total
250.0
500
160.00
90.00
1000
In order to test the null hypothesis of independence we need a test statistic which measures the magnitude of the discrepancy between the observed table and the table that
110
would be expected if independence were in fact true. In the early part of this century,
long before the invention of maximum likelihood or the formal theory of hypothesis
testing, Karl Pearson ( one of the founding fathers of Statistics ) proposed the following method of constructing such a measure of discrepancy:
(Oij −Eij )2
Eij
Black
Brown
Fair
Red
Brown
0.004
1.000
0.278
3.102
Green
0.015
0.265
5.535
4.521
Blue
0.019
0.103
1.927
8.029
For each cell in the table calculate (Oij − Eij )2 /Eij where Oij is the observed count
and Eij is the expected count and add the resulting values across all cells of the table.
The resulting total is called the χ2 test statistic which we will denote by W . The null
hypothesis of independence is rejected if the observed value of W is surprisingly large.
In the hair and eye colour example the discrepancies are as follows :
c
r X
X
(Oij − Eij )2
W =
= 24.796
Eij
i=1 j=1
What we would now like to calculate is the p-value which is the probability of getting
a value for W as large as 24.796 if the hypothesis of independence were in fact true.
Fisher showed that, when the hypothesis of independence is true, W behaves somewhat
like a χ2 random variable with degrees of freedom given by (r − 1)(c − 1) where r
is the number of rows in the table and c is the number of columns. In our example
r = 3, c = 4 and so (r − 1)(c − 1) = 6 and so the p-value is P [W ≥ 24.796] ≈
2
P [χ26 ≥ 24.796] = 0.0004. Hence we reject the independence hypothesis.
Exercise 42. It is believed that the number of breakages in a damaged chromosome,
X, follows a truncated Poisson distribution with probability mass function
P (X = k) =
e−λ λk
,
1 − e−λ k!
k = 1, 2, . . . ,
where λ > 0 is an unknown parameter. The frequency distribution of the number of
breakages in a random sample of 33 damaged chromosomes was as follows:
Breakages
1
2
3
4
5
6
7
8
9
10
11
12
13
Total
Chromosomes
11
6
4
5
0
1
0
2
1
0
1
1
1
33
111
1. Find an equation satisfied by λ̂, the MLE of λ.
2. Discuss approximations of λ̂. Show that the observed data give the estimate
λ̂ = 3.6.
3. Using this value for λ̂, test the null hypothesis that the number of breakages in a
damaged chromosome follows a truncated Poisson distribution. The categories
6 to 13 should be combined into a single category in the goodness-of-fit test.
112
Bibliography
[1] Hogg, R.V, McKean J.W., Craig, A.T. (2005) Introduction to mathematical statistics. 6th Ed. Pearson-Prentice Hall.
113