Download Complete notes for all Semester 2 work

Document related concepts

Inductive probability wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Confidence interval wikipedia , lookup

Foundations of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
University of Sheffield
School of Mathematics & and Statistics
Introduction to Probability and
Statistics: Semester 2
MAS113
Spring 2016
90
Chapter 5
Sums of Random Variables
In this chapter we continue the development of probability theory, which we
started in Semester 1. We will work towards an understanding of the central
limit theorem which will be vital for our later work on statistics.
5.1
Sums of Independent and Identically Distributed Random Variables
Suppose we have n random variables X1 , X2 , . . . , Xn . If these each have the
same probability distribution i.e.
P (X1 ≤ a) = P (X2 ≤ a) = · · · P (Xn ≤ a),
then we say that they are identically distributed.
We are particularly interested in the case where X1 , X2 , . . . , Xn are not
only identically distributed, but also independent so that for all 1 ≤ i, j ≤ n
with i 6= j:
P {(Xi ≤ a) ∩ (Xj ≤ b)} = P (Xi ≤ a)P (Xj ≤ b).
(5.1)
We then say that X1 , X2 , . . . , Xn are independent and identically distributed, or i.i.d. for short.
We can, if we like, regard X1 , X2 , . . . , Xn as independent copies of some
given random variable. I.i.d random variables are very important in applications as they describe repeated experiments that are carried out under
identical conditions, in which the outcome of each experiment does not affect
the others.
91
92
CHAPTER 5. SUMS OF RANDOM VARIABLES
Now define S(n) to be the sum and X̄(n) to be the mean:
S(n) =
n
X
Xi ,
(5.2)
S(n)
.
n
(5.3)
i=1
and
X̄(n) =
Both S(n) and X̄(n) are also random variables, as they are functions of the
random variables X1 , . . . , Xn . If we write E(Xi ) = µ and Var(Xi ) = σ 2
(for all i, as the variables have the same distribution), it is straightforward
to derive the mean and variance of S(n) and X̄(n) in terms of µ and σ 2 .
Firstly,
E(S(n)) =
=
=
=
E(X1 + X2 + · · · + Xn )
E(X1 ) + E(X2 ) + · · · + E(Xn )
µ + µ + ··· + µ
nµ.
Also, since Var(X + Y ) = Var(X) + Var(Y ) if X and Y are independent,1
we also have
Var(S(n)) =
=
=
=
Var(X1 + X2 + · · · + Xn )
Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
σ2 + σ2 + · · · + σ2
nσ 2 .
We then have
E(X̄(n)) = E
S(n)
n
1
E(S(n))
n
1
=
× nµ
n
= µ,
=
and also
1
This is not true, in general.
(5.4)
5.1. SUMS OF INDEPENDENT AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES93
Var(X̄(n)) = Var
S(n)
n
1
Var(S(n))
n2
1
× nσ 2
=
n2
σ2
=
.
n
=
(5.5)
The standard deviation of X̄(n) plays an important role. It is called the
standard error and we denote it by SE(X̄(n)), so that
σ
SE(X̄(n)) = √ .
n
(5.6)
These results have important applications in statistics. Suppose we are
able to observe i.i.d. random variables X1 , . . . , Xn , but we don’t know the
value of E(Xi ) = µ. Equation (5.5) tells us that as n increases, the variance
of X̄(n) gets smaller, and the smaller the variance is, the closer we expect
X̄(n) to be to its mean value. Equation (5.4) tells us that the mean value of
X̄(n) is µ (for any value of n). In other words, as n gets larger, we expect
X̄(n) to be increasingly close to the unknown quantity µ, so we can use the
observed value of X̄(n) to estimate µ.
We illustrate this in Figure 5.1. The four plots show the density functions
of X̄(n) for n = 1, 10, 20 and 100. In each case, X1 , . . . , Xn ∼ N (0, 1), so
E(Xi ) = µ = 0. We can see that the density function of X̄(n) becomes
more tightly concentrated about the value 0 as we increase n, and that the
observed value of X̄(n) is increasingly likely to be close to 0.
94
CHAPTER 5. SUMS OF RANDOM VARIABLES
n=1
n=10
4
4
3
3
2
2
1
1
0
ï4
ï2
0
x̄
n=20
2
4
0
ï4
4
4
3
3
2
2
1
1
0
ï4
ï2
0
x̄
2
4
0
ï4
ï2
ï2
0
2
4
0
2
4
x̄
n=100
x̄
Figure 5.1: The density function of X̄(n), when Xi ∼ N (0, 1), for four choices
of n. If we didn’t know that µ = E(Xi ) = 0, but we could observe X̄(n),
then using the observed value of X̄(n) for large n is likely to give us a good
estimate of µ, as X̄(n) is likely to be close to µ.
Here are two key examples of sums of i.i.d. random variables:
• If Xi ∼ Bernoulli(p) for 1 ≤ i ≤ n, then S(n) ∼ Bin(n, p).
• If Xi ∼ N (µ, σ 2 ) for 1 ≤ i ≤ n, then S(n) ∼ N (nµ, nσ 2 ).
We will prove these two facts later on, using moment generating functions.
5.2
Laws of Large Numbers
We will now derive an important result regarding the behaviour of X̄(n) for
large n. We first prove a useful inequality. It is true if X is discrete or
continuous. We’ll just prove the continuous case.
5.2.1
Chebyshev’s inequality
Let X be a random variable for which E(X) = µ and Var(X) = σ 2 . Then
for any c > 0
σ2
(5.7)
P (|X − µ| ≥ c) ≤ 2 .
c
5.2. LAWS OF LARGE NUMBERS
95
Proof. Let A = {x ∈ R; |x − µ| ≥ c}. From the definition of variance, and
using the fact that R = A ∪ Ā,
Z ∞
2
(x − µ)2 fX (x)dx
σ =
Z−∞
Z
2
=
(x − µ) fX (x)dx + (x − µ)2 fX (x)dx
Ā
ZA
(x − µ)2 fX (x)dx
≥
AZ
2
≥ c
fX (x)dx
A
2
= c P (X ∈ A)
= c2 P (|X − µ| ≥ c),
and the result follows.
The same inequality holds if P (|X −µ| ≥ c) is replaced by P (|X −µ| > c)
in (5.7), as P (|X − µ| > c) ≤ P (|X − µ| ≥ c).
What does Chebychev’s inequality tell us? We expect the probability
to find the value of a random variable to be smaller, the further away we
get from the mean. But a large variance may counteract this a little, as it
tells us that the values which have high probability are more spread out.
Chebychev’s inequality makes this more precise.
Example. A random variable X has mean 1 and variance 0.5. What
can you say about P (X > 6)?
Solution.
P (X > 6) = P (X − 1 > 5)
≤ P (|X − 1| > 5)
0.5
1
≤
= , by Chebychev’s inequality.
25
50
5.2.2
The Weak Law of Large Numbers
Let X1 , X2 , . . . be a sequence of i.i.d. random variables, each with mean µ
and variance σ 2 . Then for all ε > 0,
P (|X̄(n) − µ| > ε) ≤
σ2
.
ε2 n
(5.8)
This follows from Chebyshev’s inequality, and the fact that E(X̄(n)) = µ
and Var(X̄(n)) = σ 2 /n.
96
CHAPTER 5. SUMS OF RANDOM VARIABLES
From (5.8), we have
lim P (|X̄(n) − µ| > ε) = 0.
n→∞
(5.9)
We can choose ε to be as small as we like, so as n increases, it becomes
increasingly unlikely that X̄(n) will differ from µ by any amount ε. Equation
(5.9) is known as the weak law of large numbers.
It is possible to prove a stronger result, which is that
P lim X̄(n) = µ = 1;
(5.10)
n→∞
but the proof is outside the scope of this module. This result is known as
the strong law of large numbers.
5.3
Moment Generating Functions
Let X be a random variable. Two important associated numerical quantities
are the expected values E(X) and E(X 2 ), from which we can compute the
variance, using
Var(X) = E(X 2 ) − E(X)2 .
More generally we might try to find the nth moment E(X n ) for n ∈
N. For example E(X 3 ) and E(X 4 ) give information about the shape of the
distribution and are used to calculate quantities called the skewness and
kurtosis. We can try to calculate moments directly, but there is also a useful
shortcut. Define the moment generating function (or mgf) for all t ∈ R by:
MX (t) = E(etX ).
So MX (t) =
Z
MX (t) =
N
X
(5.11)
etxi pX (xi ), if X is discrete and
i=1
∞
etx fX (x)dx, if X is continuous.
−∞
e.g. If X ∼ Bernoulli(p), it only takes two values: 1 with probability p,
and 0 with probability 1 − p, and so
MX (t) = pet.1 + (1 − p)et.0 = pet + 1 − p.
Now differentiate (5.11) to get
d
MX (t) = E(XetX ),
dt
5.3. MOMENT GENERATING FUNCTIONS
97
d
MX (t) = E(X).
dt
t=0
and so
Similarly
d2
MX (t) = E(X 2 etX ),
2
dt
d2
= E(X 2 ).
M
(t)
and hence
X
2
dt
t=0
In fact you can find all the moments by this procedure:
dn
MX (t) = E(X n ).
n
dt
t=0
You can explore this result for some known distributions in the exercises.
Another way of seeing how the mgf of X contains information about all
the moments comes from writing the series expansion for the exponential
function
∞
X
tn n
tX
X .
e =
n!
n=0
It then follows from (5.11) that:
MX (t) =
∞
X
tn
n=0
n!
E(X n ).
(5.12)
Our main purpose in introducing mgfs at this stage is not to use them to
calculate moments, but to give a rough idea of the proof of the central limit
theorem. The next result is very useful for that.
Theorem 1. If X and Y are independent, then for all t ∈ R
MX+Y (t) = MX (t)MY (t).
Proof
MX+Y (t) =
=
=
=
E(et(X+Y ) )
E(etX etY )
E(etX )E(etY ) by independence
MX (t)MY (t).
It follows (by using mathematical induction) that if S(n) = X1 + X2 +
· · · + Xn is a sum of i.i.d random variables having common mgf MX then
MS(n) (t) = MX (t)n .
(5.13)
98
CHAPTER 5. SUMS OF RANDOM VARIABLES
Here is the key example that we need.
Example If X ∼ N (µ, σ 2 ) then
1
MX (t) = eµt+ 2 σ
2 t2
(5.14)
To establish this we use the definition of MX (t) and the formula for the
normal density:
(
2 )
Z ∞
1
x
−
µ
1
ext exp −
dx
MX (t) = √
2
σ
σ 2π −∞
Substitute z =
x−µ
σ
to obtain
Z
1 2
1 µt ∞ σtz
e exp − z dz
MX (t) = √ e
2
2π
−∞
Z ∞
1
1
1 22
2
= √ exp µt + σ t
exp − (z − σt) dz,
2
2
2π
−∞
afterR completing the square. Now substitute y = z − σt and use the fact that
∞
− 12 x2
√1
e
dx = 1 to find
−∞
2π
1 22
MX (t) = exp µt + σ t .
2
An important special case is when X is the standard normal Z ∼ N (0, 1):
1 2
MZ (t) = e 2 t
(5.15)
In fact it can be shown that the mgf uniquely determines the distribution
of a random variable, so X ∼ N (µ, σ 2 ) is the only random variable with
the mgf (5.14). Using this fact we can show that the sum of n i.i.d normal
distributions is itself normal. From (5.13) and (5.14), we have
1
MS(n) (t) = [eµt+ 2 σ
1
2 t2
= enµt+ 2 nσ
]n
2 t2
,
so S(n) ∼ N (nµ, nσ 2 ), and X̄(n) ∼ N (µ, σ 2 /n).
5.4
The Central Limit Theorem (CLT)
We finish with another important result, that tells about the distribution of
X̄(n) for large n. It could be argued that this is the most important result
5.4. THE CENTRAL LIMIT THEOREM (CLT)
99
in the whole of probability (and statistics). The law of large numbers tells
us that X̄(n) tends to µ as n → ∞. But the central limit theorem gives
far more information. It tells us about the behaviour of the distribution of
the fluctuations of X̄(n) around µ, as n → ∞. These are always normally
distributed!
The central limit theorem
Let X1 , X2 , . . . be a sequence of i.i.d random variables, each with mean µ and variance
σ 2 . For any −∞ ≤ a < b ≤ ∞,
Z b
1
1 2
X̄(n) − µ
√
≤b = √
exp − z dz.
lim P a ≤
(5.16)
n→∞
2
σ/ n
2π a
In other words, the distribution of X̄(n) tends to a normal distribution with mean µ and
variance σ 2 /n, as n → ∞. So for large n, we have, approximately
σ2
X̄(n) ∼ N µ,
,
n
S(n) ∼ N nµ, nσ 2 .
Notice that the right hand side of (5.16) is P (a ≤ Z ≤ b) where Z ∼
N (0, 1) is the standard normal.
We can also rewrite the left hand side in terms of S(n) by multiplying
top and bottom by n. Then we get another equivalent form of the central
limit theorem (CLT) which is often seen in books:
S(n) − nµ
√
lim P a ≤
≤ b = P (a ≤ Z ≤ b).
(5.17)
n→∞
σ n
As long as X1 , X2 , . . . have the same distribution (and are independent)
(5.17) is valid; it doesn’t matter what that distribution is. It can be discrete,
or continuous – uniform, Poisson, Bernoulli, exponential, etc etc.
Proof (Outline only). We will aim to establish (5.17). Define
S(n) − nµ
√
.
σ n
Then E(Y (n)) = 0 and Var(Y (n)) = 1. (You check this.)
Now consider i.i.d random variables defined by
Y (n) =
Tj = Xj − µ for j = 1, 2, . . .
Clearly, we have
100
CHAPTER 5. SUMS OF RANDOM VARIABLES
E(Tj ) = 0 and E(Tj2 ) = Var(Tj ) = σ 2 ;
(5.18)
1
Y (n) = √ (T1 + T2 + · · · + Tn ).
σ n
By (5.13)
n
t
√
for all t ∈ R
MY (n) (t) = M
σ n
where M on the right-hand side denotes the common moment generating
function of the Ti s.
Then expanding (5.12) using (5.18):
M
t
√
σ n
1 t2 2
t
α(n)
√ ×0 +
σ +
= 1+
2
2σ n
n
σ n
2
α(n)
1t
+
= 1+
2n
n
where α(n) is a remainder term that satisfies limn→∞ α(n) = 0. Thus
n
1 t2 1
MY (n) (t) = 1 +
+ α(n)
2n n
From MAS110, we know that ey = limn→∞ (1 + ny )n and, furthermore,
since limn→∞ α(n) = 0, then it can also be shown that
n
y
1
y
e = lim 1 + + α(n)
n→∞
n n
Hence, using (5.15) we find that
lim MY (n) (t) = exp
n→∞
1 2
t
2
= MZ (t)
and the result follows, since the mgf determines the distribution.
As a first application of this result we can investigate the binomial approximation to the normal distribution. If we take X1 , X2 , . . . to be Bernoulli random variables with common parameter p, then S(n) is binomial with parameters n and p (see Problem 8). Since E(S(n)) = np and Var(S(n)) = np(1 − p)
we get the following:
5.4. THE CENTRAL LIMIT THEOREM (CLT)
101
Corollary 5.4.1. (de Moivre–Laplace central limit theorem) If X1 , X2 , . . .
are Bernoulli with common parameter p, then
!
Z b
1 2
S(n) − np
1
≤b = √
exp − z dz
(5.19)
lim P a ≤ p
n→∞
2
2π a
np(1 − p)
This was historically the first central limit theorem to be established. It
is named after Abraham de Moivre (1667–1754) who first used integrals of
exp(− 21 z 2 ) to approximate probabilities, and Pierre-Simon Laplace (1749–
1857) who obtained a formula quite close in spirit to (5.19). Laplace was
studying the dynamics of the solar system, and was motivated in these calculations by the need to estimate the probability that the inclination of
cometary orbits to a given plane was between certain limits. Corollary 5.4.1
is sometime referred to as the normal approximation to the binomial distribution.
Because the binomial distribution is discrete, while the normal distribution is continuous, we can get greater accuracy in the CLT by using a
continuity correction. If X is binomial and we want to use the CLT to approximate P (m ≤ X ≤ n) where m and n are positive integers, we get
greater accuracy by applying the CLT to P (m − 0.5 ≤ X ≤ n + 0.5).
Example A woman claims to have psychic abilities, in that she can
predict the outcome of a coin toss. If she is tested 100 times, but she is
really just guessing, what is the probability that she will be right 60 or more
times?
Let X be the number of correct predictions. If the woman is guessing,
and the coin is fair, the probability that she is right on any single occasion
is 0.5. Then
X ∼ Binom(100, 0.5).
Approximately, using the CLT,
X ∼ N (50, 100 × 0.5 × (1 − 0.5)),
so
P (X ≥ 60) = 1 − Φ
60 − 50
5
= 0.02275013,
using the command 1-pnorm(60,50,5) in R.
If we apply the continuity correction,
59.5 − 50
P (X ≥ 60) = 1 − Φ
= 0.02871656,
5
102
CHAPTER 5. SUMS OF RANDOM VARIABLES
using the command 1-pnorm(59.5,50,5) in R.
(Using the command 1-pbinom(59,100,0.5), we calculate the probability to be 0.02844397).
5.4.1
Example: Sums of Exponential Random Variables
Here we illustrate the CLT by a computer experiment. Let X1 , X2 , . . . be a
sequence of i.i.d exponential random variables, each with rate parameter 1.
Note that the shape of the Exp(rate= 1) distribution is very ‘non-normal’
(compare the pdf of an exponential random variable with that of a normal
random variable). Now µ = E(Xi ) = 1 and σ 2 = Var(Xi ) = 1. For large n,
approximately, we should have
X̄(n) ∼ N
1
1,
n
.
We now do a simulation experiment in R to compare this approximation
with what we get if we simulate random values of X̄(n) lots of times, for
different values of n. The results are shown in Figure 5.2. In the top left
plot, we have the case n = 1. The solid line shows the N (1, 1) density
function, and the histogram represents the distribution of X̄(1) (which is
just a single value of X), based on what we see when we simulate random
values of X̄(1) many times. The histogram doesn’t match the shape of the
density function, which is to be expected, as we know that the density of a
single exponential random variable looks nothing like a ‘bell-shaped’ curve.
In the top right plot, we repeat the experiment, but now with n = 10.
Now we can see that histogram of simulated values of X̄(10) is closer in shape
to the N (1, 1/10) density function, so even with small n, the CLT is giving
a good approximation for the distribution of X̄(n). Simulations for larger
values of n are shown in the remaining two plots.
5.4. THE CENTRAL LIMIT THEOREM (CLT)
n = 10
0.8
Density
0.0
0.4
0.4
0.0
0.2
Density
0.6
1.2
0.8
n=1
103
-1
0
1
2
3
4
0.0
0.5
1.0
1.5
X(n)
X(n)
n = 30
n = 100
2.0
3
2
Density
1
0
Density
0.0 0.5 1.0 1.5 2.0
4
-2
0.4 0.6 0.8 1.0 1.2 1.4 1.6
0.7
0.9
X(n)
1.1
1.3
X(n)
Figure 5.2: Testing the CLT approximation for different n. The solid line
shows the density function under the CLT approximation. The histogram
represents the distributed for simulated values of X̄(n), from a simulation in
R.
5.4.2
Application to Stirling’s Approximation
In this subsection we will use the probability theory that we’ve just studied
to gain insight into Stirling’s Approximation which states that, as n → ∞,
n! ∼
√
1
2πnn+ 2 e−n .
The precise meaning of (5.20) is that lim √
(5.20)
n!
= 1.
1
2πnn+ 2 e−n
The approximation goes back to Abraham de Moivre (1667-1754) who
1
showed that n! ∼ Cnn+ 2 e−n ,√where C > 0. It was James Stirling (16921770) who proved that C = 2π. Stirling’s approximation is very widely
used in both pure and applied mathematics. It is even reasonably accurate
for small n – for example, 3! = 6 and the approximation gives 5.84; 5! = 120
and you get 118.02 when you use the approximation.
There are many proofs of (5.20) in the literature. Here we will give
a “rough proof”, based on the work of this section. Let X1 , X2 , . . . be a
sequence of i.i.d. random variables, each having a Poisson distribution, with
mean (and variance) λ = 1. Then by Problem 11, S(n) = X1 + X2 + · · · + Xn
n→∞
104
CHAPTER 5. SUMS OF RANDOM VARIABLES
also has a Poisson distribution, with mean (and variance) λ = n. It follows
that
e−n nn
.
(5.21)
P (S(n) = n) =
n!
But
1
S(n) − n
√
P (S(n) = n) = P (n − 1 < S(n) ≤ n) = P − √ <
≤0 ,
n
n
(5.22)
and by the central limit theorem (5.17), where we take µ = σ = 1, Zwe find
0
x2
1
√
√
e− 2 dx.
that for large n, P − √1n < S(n)−n
≤
0
is
well-approximated
by
n
2π − √1n
Now if n is very large, √1n is very small, and so for − √1n < x ≤ 0, we can
x2
make the approximation e− 2 ∼
= 1.2 But then
1
√
2π
Z
0
2
e
− √1n
− x2
1
dx ∼
=√
2π
Z
0
− √1n
1dx = √
1
.
2πn
When we combine this with (5.22) and (5.21), we conclude that for sufficiently
large n,
e−n nn ∼ 1
,
=√
n!
2πn
which after rearrangement gives
√
1
n! ∼
= 2πnn+ 2 e−n .
Before we leave the central limit theorem, here is a wonderful quote from
Sir Francis Galton (1822-1911), one of the founders of statistics:
“I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”.3
The law would have been personified by the Greeks and deified, if they had
known of it. It reigns with serenity and in complete self-effacement, amidst
the wildest confusion. The huger the mob, and the greater the apparent
anarchy, the more perfect is its sway. It is the supreme law of Unreason.
Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful
form of regularity proves to have been latent all along.”
2
3
The symbol ∼
= means “approximately equal to”; it is far less precise than ∼.
This is another name for the CLT.
Chapter 6
Introducing Statistics
6.1
Introduction
So far in this module, we have focussed entirely on probability theory – the
mathematical theory of chance. It is a distinct branch of mathematics in its
own right; from a pure mathematical point of view, it is a special case of a
more general subject called measure theory; but probability has many important areas of application, e.g. to studying the bulk properties of ensembles of
particles in physics (statistical mechanics), to modelling random mutations
in genetics, and to describing the random behaviour of the stock market in
financial economics.
For the rest of this module, we will be studying the application of probability theory to statistics. But first we should ask the question “What is
statistics?” At the beginning of Randall Pruim’s book “Foundations and
Applications of Statistics”, are listed five alternative definitions, all given by
different experts. Here’s the list:
• a collection of procedures and principles for gaining information in order
to make decisions when faced with uncertainty,
• a way of taming uncertainty, of turning raw data into arguments that
can resolve profound questions,
• the science of drawing conclusions from data with the aid of the mathematics of probability,
• the explanation of variation in the context of what remains unexplained,
105
106
CHAPTER 6. INTRODUCING STATISTICS
• the mathematics of the collection, organisation, and interpretation of
numerical data, especially the analysis of a population’s characteristics
by inference from sampling.
Although not quite the same, these all capture something important. We
have collected some data which tells us something, but not everything, about
its source. So there is uncertainty as we do not have the whole picture.
We wish to gain as much information as we can, and probability will be the
mathematical tool that we use to help us get that. For example, suppose
that a new drug is given to 20 patients with some disease, and 12 recover.
Can we deduce that the probability of recovery is 0.6? What if we gave the
same drug to another group of 20 patients, and only 5 recovered? In the next
two sections (6.2 and 6.3) we will look at some real-life examples.
6.2
Is smoking related to lung cancer?
It is well known that smoking can increase the risk of lung cancer. Of course
there are other health risks attributed to this type of cancer (such as family
history), but now there are several studies linking smoking to lung cancer.
The 1982 USA Surgeon Generals Report stated that: “Cigarette smoking is
the major single cause of cancer mortality in the United States.” Lung cancer
gives 5% of UK deaths today. Below is Müller’s data (Table 3 of Doll, 1988).
The data is classifying 86 sampled people according to lung cancer intensity
and general health, for 5 categories of smokers (from zero to extreme).
Smoking
Extreme
Very heavy
Heavy
Moderate
Non
Total
Lung cancer
25
18
13
27
3
86
Healthy
4
5
22
41
14
86
The question is whether smoking increases the risk of getting lung cancer.
The table indicates clearly that the number of lung cancer sufferers appears to
be increasing with the intensity of smoking, while their general health seems
to be inversely proportional to having lung cancer (e.g. 18 very heavy smokers
have lung cancer and 5 of them are healthy, while the inverse relationship
applies for non-smokers). This suggests that there may be a link between
smoking and lung cancer.
6.3. DO MOBILE PHONES CAUSE BRAIN TUMOURS?
107
However, if we are going to trust this data, we need to know that it is
representative of the general population, so that by studying this data set we
can draw conclusions, not just for the 86 people used in that sample, but for
any member of the general public. It is reasonable to think that if we pick
another 86 people in a similar study, the results would be different (maybe
only slightly, but perhaps more substantially). In other words, there is some
uncertainty in our conclusion that smoking causes lung cancer. We need to
have a more complex mechanism that gives us information about the level
of uncertainty we have in trusting the above study.
One option would be to conduct many similar studies and to have some
rules to compare them. This is impractical because in many real-life situations, such studies are lengthy and expensive. We should be able to use
the same single data set and to be able to say how “confident” we are in
the conclusions drawn. In addition to that we should be able to address the
following questions:
• Is this sample of 86 people representative?
• Is 86 a large enough sample to draw “good” conclusions? For comparison, we would surely believe that asking only 5 people will not be
enough. What about 15, or 50, or 500?
• How can we “estimate” the effects of this alleged relationship of smoking and cancer? Can we estimate how certain or how uncertain we
believe we are?
• Finally, can we design a model that describes this relationship beyond
the limitations of this data set? Can we make predictions based on this
model?
6.3
Do mobile phones cause brain tumours?
Over the past 20 years, there has been a growing interest in establishing
whether mobile phones are linked to brain cancer. Some studies have found a
positive link. However, in a recent study (BBC, 21 October 2011), this link is
rejected, you can read the related article in http://www.bbc.co.uk/news/health-15387297.
The discussion below is based on this article.
The study looked at brain cancer incidence, for a sample of 358,403 mobile
users, over a period of 18 years. The results are summarized below:
no of mobile users gliomas (type of brain cancer)
358,400
356
other cancers
846
108
CHAPTER 6. INTRODUCING STATISTICS
From a first glance we might conclude that the gliomas counts are too
small, for the sample size 358,403. We can calculate the relevant proportions
0.00099 (gliomas) and 0.00236 (other cancers). We may think that these proportions look small, but for a more conclusive comparison one should know
how many brain cancer counts are expected among non mobile users. For
example, if we knew that among non mobile users, 20 gliomas are expected
(for the same sample size 358,400), we may believe that 356 is a significant
number. Furthermore, the following points are important to consider:
• Can we estimate the true proportion of gliomas among mobile users? If
we calculate (as above) this proportion as 0.00099 (based on the above
data set), can we know whether this is generally true? What if we
conduct another study and calculate another proportion? How different
is this likely to be? Based on this data set, can we give a measure of
how certain or how confident we are about these proportions?
• As mentioned above, we should be able to establish a “gold standard”,
to which we can compare the cancer counts of our study. In this case,
this could be brain cancer proportions, or risk among non mobile users.
In statistics it is very common to compare the results of a study with
a gold standard; this is usually called a control group.
• The selected sample should be representative of the general population,
otherwise the findings are likely to be biased and limited in scope. By
“representative” we mean that if the general population shares some
characteristics (relevant to the link of cancer with mobile phones), then
any sample we pick should share these too.
• An important point to consider is that a number of mobile users in
this study could have developed early stage brain cancer which is not
detected, or be at risk (for which there is some non-zero probability)
to develop brain cancer in the future. Although the study considers
such possibilities to be risk-free, they may not be, and this could lead
to different estimation of the overall risk and safety of mobile phones.
6.4
Data and Summary Statistics
In statistics, the word population is used to describe the largest group of
“things” that we are interested in. The “things” might be people, animals,
plants, machines, numbers etc, e.g. registered voters in the UK, cod in the
English channel, breast cancer sufferers in Yorkshire, weights of Mars bars
in a supermarket etc etc
6.4. DATA AND SUMMARY STATISTICS
109
The population as a whole is generally too big or inaccessible for us to
fully investigate. This is one of the most important reasons why we need
statistics. What we do instead is to obtain a representative sample taken
from the population, and then use our knowledge of the sample together with
probability theory, to make inferences about the population as a whole. We
will say more about the important problem of inference in the next chapter.
Suppose that we are interested in the weight (in kg, say) of cod in the
North sea. Let us suppose that there are N cod altogether. This is already
problematic; no–one knows how to count all the cod in the sea at any one
time! We could label the weights of the cods y1 , y2 , . . . , yN . Then we would
love to know the population mean
N
1 X
yi ,
µ=
N i=1
and the population variance
N
1 X
σ =
(yi − µ)2 .
N i=1
2
But we cannot ever measure these numbers. They are inaccessible to us.
But on the other hand, we can go out and catch some cod from different
parts of the sea and obtain a sample of weights x1 , x2 , . . . , xn . These are the
data that we will work with. Here n is the number of fish in the sample. It
may be e.g. 50, or 100 or 1000. But it is much smaller that N and we write
n << N to indicate this (the symbol << means “much smaller”). This time
we can calculate the sample mean
n
1X
xi ,
x̄ =
n i=1
(6.1)
and the sample variance
n
1 X
(xi − x̄)2 .
s =
n − 1 i=1
2
(6.2)
Notice how in (6.2) we divide by n − 1 rather than by n. One way of
justifying this is to argue that although s2 appears to depend on n unknowns
x1 , x2 , . . . , xn , once we know x̄, we have an equation x1 + x2 + · · · + xn = nx̄
which allows us to eliminate one of the unknowns, so there are only n − 1
unknowns altogether. The sample standard deviation is
110
CHAPTER 6. INTRODUCING STATISTICS
v
u
√
u
2
s= s =t
n
1 X
(xi − x̄)2 .
n − 1 i=1
Just as in probability theory, the sample mean x̄ measures the average
weight of the sample, and the variance s2 and standard deviation s tell us
about the spread of numbers in the sample about the mean. We tend to
prefer to work with s2 as it has nicer mathematical properties; s has the
advantage of being measured in the same units as the data. If there are
many data points that differ from x̄ by a relatively large amount, then s and
s2 will be large, but if all the data points are tightly clustered around x̄ then
these numbers are much smaller.
The following alternative formula for sample variance is sometimes useful:
Proposition 6.4.1.
n
1 X 2
n
s =
xi −
x̄2 .
n − 1 i=1
n−1
2
Proof. Expanding (6.2) we get
(n − 1)s
2
=
=
n
X
(x2i − 2xi x̄ + x̄2 )
i=1
n
X
i=1
=
n
X
x2i − 2x̄
n
X
xi +
i=1
n
X
x̄2
i=1
x2i − 2nx̄2 + nx̄2
i=1
=
n
X
x2i − nx̄2 ,
i=1
and the result follows when wePdivide both sides by n − 1. Notice that to
get to the third line, we used ni=1 xi = nx̄ from (6.1). We also used the
very useful fact that if c is any constant (which doesn’t depend on i), then
P
n
2
i=1 c = nc, which we applied with c = x̄ .
Note It is very important in statistics to reserve the symbols µ and σ 2
for the population mean and variance, and to use x̄ and s2 for the sample
mean and variance.
6.4. DATA AND SUMMARY STATISTICS
111
Example 6.1.
from the North sea had weights
P24A sample of 24 cod
P24taken
2
in kg for which i=1 xi = 207 and i=1 xi = 2230. Calculate x̄ and s to 3
d.p.’s
= 8.625 and s2 =
x̄ = 207
24
3 d.p.’s.
2230
23
−
24×(8.625)2
23
= 19.332. Hence s = 4.397 to
Numerical data can be easily entered into R using the command
> x = c(x1 , x2 , . . . , xn ),
where here, of course, x1 , x2 , . . . , xn are the numerical values of the data. The
sample mean, variance, and standard deviation are then obtained using the
commands > mean(x), > var(x) and > sd(x).
Exercise. Enter the data 1, 3, 5, 5, 6, 8, 9, 14, 14, 20 into R and check that
the mean, variance and standard deviation are respectively 8.5, 34.5 and
5.87367.
112
CHAPTER 6. INTRODUCING STATISTICS
Chapter 7
Estimation and uncertainty
7.1
Populations and random samples
As was discussed in Chapter 6, statistics in concerned with observational
studies where we seek information about the whole from knowledge of the
part. A typical example involves the study of a particular data set which
represents a sample taken from a larger population from which the data is
obtained. In the smoking/cancer example of section 6.2, the data consisted
of 86 people and their related health state, and the population comprised
all the people in a particular region of interest. Statistics is concerned with
studying the properties of the particular sample we observe, and making
inferences or drawing conclusions about the population. This is necessary,
as for example, it is usually not possible to observe and study all elements
of the population; in the above study it may be impossible to ask all people
about their smoking habits, and very costly to examine their state of health.
Thus, we obtain the sample, a small part of the population (86 people), and
we aim to use this sample to draw conclusions for the entire population.
The way that we extract information from the sample and use it to make
estimates about the population, is through probability theory. Suppose that
we are interested in learning about the mean height µ of the students taking
MAS113 in a particular academic year. Suppose that there are 240 students;
then to obtain the mean we might measure the height of all students. However, this could be quite time-consuming. Instead, we may decide to take
a sample of these students, e.g. a sample of 10 students and measure their
height:
Student
1
Sample element x1
Height (in cm) 167
2
x2
165
3
x3
178
4
x4
190
113
5
x5
176
6
x6
184
7
x7
189
8
x8
191
9
x9
182
10
x10
169
114
CHAPTER 7. ESTIMATION AND UNCERTAINTY
We may believe that if these 10 students are “representative” of the popu1
(x1 +x2 +x3 +x4 +x5 +x6 +x7 +x8 +x9 +x10 ),
lation, their average height x̄ = 10
will be a “good estimate” of the population mean µ. However, it is highly
unlikely that x̄ is going to be exactly the same as the mean of all 240 student
heights. So the question is how close x̄ is to µ? P
We could
obtain another
10
1
∗
x
sample x∗1 , . . . , x∗10 , and again compute x̄∗ = 10
i=1 i . Each time we
obtain a different sample, we are likely to receive a different estimate of µ.
Thus, there is uncertainty around the values of x̄ that we compute from the
data, to the extent of how closely they agree with the true population mean
(heights in this example).
7.2
Confidence Intervals for the Mean I
Let us assume that we have a population of size N and we randomly choose
a sample of size n, with n much smaller than N . The population has some
“true distribution” which we don’t know, e.g. if it is the heights of people in
the U.K., it may be that 63% are between 50 800 and 60 200 . Let
X1 be the random variable that selects the first point x1 in the sample,
X2 be the random variable that selects the second point x2 in the sample,
··· ··· ··· ··· ···
Xn be the random variable that selects the nth point xn in the sample.
We will make the following assumptions
(I) Each of X1 , X2 , . . . , Xn have the same distribution, which is the distribution of the population.
(II) The random variables X1 , X2 , . . . , Xn are independent.
So X1 , X2 , . . . , Xn are i.i.d. random variables as studied in section 5.1.
You may criticize both of these assumptions, but we have to start somewhere. All models of the real world make simplifying assumptions in order
to make progress. Later on we may try to refine them.
Let µ and σ 2 be the population mean and variance. So E(Xi ) = µ and
Var(Xi ) = σ 2 for i = 1, . . . , n.
A key topic in statistics is called estimation of parameters. The numbers
µ and σ 2 are unknown parameters and we try to get good “estimates” of
them based on the data we obtain, and probabilistic reasoning. This is also
a part of statistical inference.
In this section we seek to estimate the parameter µ. At this stage we will
assume that σ is known to us. As we discussed previously, we will assume
7.2. CONFIDENCE INTERVALS FOR THE MEAN I
115
that we have collected a random sample x1 , x2 , . . . , xn from the population.
We can then calculate the sample mean:
n
1X
x̄ =
xi .
n i=1
The number x̄ is called a point estimate of µ. We have already pointed out
that it may not be very accurate (but would become more so, if n were to
get close to N ). We can do better by using our knowledge of probability
theory to construct an interval estimate, i.e. an interval in which we have
“high confidence” that µ will lie. Let us consider the random variable:
X̄ =
X1 + X 2 + · · · + Xn
.
n
From section 5.1, we know that
σ2
E(X̄) = µ, Var(X̄) = .
n
The values taken by the random variable X̄ consist of all possible sample
means of size n, which are taken from the population.
Step 1. Assume for now that Xi ∼ N (µ, σ 2 ), so the underlying population
is normally distributed.
In this case Z =
X̄ − µ
√σ
n
is a standard normal: Z ∼ N (0, 1). We need to
make a choice on the size of the confidence interval we’re going to construct.
To do this we choose a small number α, which is reasonably close to zero,
and find the corresponding critical value zα/2 so that
P (Z < −zα/2 ) = P (Z > zα/2 ) =
α
.
2
In practice, given α, we find zα/2 from R using the command
qnorm(1 − α/2, 0, 1).
Common choices are (check), to 4 s.f.:
α = 0.1, zα/2 = 1.645
α = 0.05, zα/2 = 1.960
α = 0.01, zα/2 = 2.576.
(7.1)
116
CHAPTER 7. ESTIMATION AND UNCERTAINTY
Now we reason as follows; from (7.1), we have
P (−zα/2 ≤ Z ≤ zα/2 ) = 1 − α,
i.e.
P
−zα/2 ≤
X̄ − µ
√σ
n
!
≤ zα/2
= 1 − α.
Rearranging the inequality we conclude that
σ
σ
P X̄ − zα/2 √ ≤ µ ≤ X̄ + zα/2 √
= 1 − α.
n
n
(7.2)
The equation (7.2) gives a mathematically precise probability interval for
the mean µ. But it is not useful practically as it stands. To make it useful,
we must replace the random variable X̄ with the point estimate x̄ that we’ve
obtained from the data. Then we can no longer speak the language of probability, but we have what is called in statistics, a confidence interval. The
key conclusion is:
We are 100(1 − α)% confident that the true value of µ lies in the interval
σ
σ
(7.3)
x̄ − zα/2 √ , x̄ + zα/2 √ .
n
n
The result stated in (7.3) is very important. It can also be summed up
as: We are 100(1 − α)% confident that
True value of mean is in the interval given by
“point estimate ± (critical value × standard error).”
What does a confidence interval mean? The parameter µ is not a random
quantity. So when we calculate our interval it is either “true” or “false” that µ
lies in it. Suppose α = 0.05. Then if we generated many different samples of
size n from our population, and computed the interval (7.3) for the different
point estimates x̄ that we’d obtain, we would expect that approximately 95%
of these would contain µ. The other 5% would not.
Step 2. From now on, assume that the Xi s are arbitrary random variables,
each having mean µ and variance σ 2 .
We are now in a much more general situation. But the good news is that
we can proceed exactly in Step 1, and still use (7.3) to calculate confidence
intervals, provided we have a large enough sample. Why? The reason is that
we have the central limit theorem which tells us that for sufficiently large n,
7.2. CONFIDENCE INTERVALS FOR THE MEAN I
X̄ − µ
√σ
n
117
is approximately N (0, 1) (c.f. (5.16)),
and that is all we need to, at least approximately, carry out the same reasoning as before. How large should n be? In practice, statisticians usually
require at least 30. If n < 30, we have a small sample, and we will see how
to calculate confidence intervals in this case, later on.
Example 7.1 (Temperature measurement) A chemist wants to
know the melting point of a newly developed compound, under specific laboratory conditions. Temperature is measured using an instrument which is
known to have measurement errors with standard deviation 0.3o C. There
is good theoretical evidence to support the belief that the melting points
are normally distributed. She repeats an experiment to measure the melting
point nine times, and obtains the following values (in o C): 82.38, 82.57, 82.23,
82.05, 82.21, 82.03, 82.04, 82.35, 81.60.
Although the sample size is smaller than 30, because the distribution of
the population is normal (and σ is known) we can still use (7.3)
Solution. The estimated melting point in o C is computed using R to give
0.3
= 0.1 If we calculate a 95% confidence
x̄ = 82.16. The standard error is √
9
interval for the melting point, we obtain
(82.16 − 1.96 × 0.1, 82.16 + 1.96 × 0.1) or (81.96, 82.36).
We might argue that 95% is insufficiently precise, and ask for a 99%
confidence interval. Then we would get
(82.16 − 2.576 × 0.1, 82.16 + 2.576 × 0.1) or (81.90, 82.412).
So we have increased our confidence, but at the price of increased uncertainty,
as the confidence interval for µ is wider.
Note 1. In practice, it is unlikely that we will know the value of σ 2 .
In that case, provided n > 30, we can still derive confidence intervals by
replacing it in (7.3) by the sample variance:
n
s2 =
1 X
(xi − x̄)2 .
n − 1 i=1
Of course, this reduces the accuracy of the calculation. We will come back
to this point later on, when we investigate the role of the t-distribution.
118
CHAPTER 7. ESTIMATION AND UNCERTAINTY
7.2.1
Confidence Intervals for a Proportion
Frequently we may want to estimate the proportion of individuals in a population that have a certain property. Then for each member of the sample,
we might ask is it a “success” or a “failure”? (or alternatively, a “yes”, or a
“no”). Let p be the true proportion that has the property. This is the parameter that we seek to estimate. In this case, X1 , X2 , . . . , Xn are Bernoulli(p)
and S(n) = X1 + X2 + · · · Xn is Binom(n, p), and so has mean np and
q stanp
has mean p with standard error p(1−p)
.
dard deviation np(1 − p). So S(n)
n
n
We seek a confidence interval for p, and instead of x̄, it is common to use p̂
for the sample proportion that is obtained from the data. We use the same
machinery as we developed above (see also (5.19)) to deduce that we are
100(1 − α)% confident that the true value of p lies in the interval:
"
r
p̂ − zα/2
p(1 − p)
, p̂ + zα/2
n
r
#
p(1 − p)
.
n
But we cannot use this in practice, as the interval depends on the quantity
we are trying to estimate. To
q get a usable formula, we apply the philosophy
of Note 1, and replace p in p(1−p)
, with the point estimate p̂ to obtain the
n
100(1 − α)% confidence interval:
"
r
p̂ − zα/2
p̂(1 − p̂)
, p̂ + zα/2
n
r
#
p̂(1 − p̂)
.
n
(7.4)
The interval in (7.4)is called the Wald interval, in honour of Abraham
Wald (1902-50). The approximation of p by p̂ in the standard error, means
that (7.4) is not always very accurate. In the next chapter we’ll look at a
way of improving it.
Example 7.2 (Opinion polling) A random sample of 150 UK university
students has been asked “Will you vote Labour at the next election?”. Of
those polled, 68 say ‘Yes”. Using this sample, we can give a 99% confidence
interval for the proportion p of students (in the population of UK students)
who we expect to vote Labour at the next election.
q We have p̂ = 68/150 ≈ 0.453, and the estimated standard error is
p̂(1−p̂)
150
≈ 0.0406. An approximate 99% confidence interval is given by
(0.453 − 2.58 × 0.0406, 0.453 + 2.58 × 0.0406) i.e. (0.348, 0.558).
7.3. UNBIASED AND CONSISTENT ESTIMATORS
7.3
119
Unbiased and Consistent Estimators
In the last section, we have learned how to find confidence intervals for the
mean and for a proportion, by using the normal distribution and/or the CLT.
Later on we will want to find confidence intervals for the variance. At this
stage, it’s a good idea to develop some more theoretical insight.
Let θ be some parameter of the population that we seek to estimate, e.g
we could have θ = µ, or θ = σ 2 . Let X1 , X2 , . . . , Xn be, as in the last
section, the i.i.d. random variables that randomly select the sample of size
n. An estimator of θ is defined to be a function θ̂ of the random variables
X1 , X2 , . . . , Xn . So we could write:
θ̂ = f (X1 , X2 , . . . , Xn ).
Of course there are many such functions, and we need to impose some
more conditions to get a “good” estimator. Whenever we choose a random sample x1 , x2 , . . . , xn , we can calculate a value of the estimator using
f (x1 , x2 , . . . , xn ). We might expect some values to be accurate, and others to
be less so, but “on average”, we would expect to get quite close to the true
value θ. To make this intuition more precise, we define the bias to be the
real number B(θ̂, θ) = E(θ̂) − θ, and we say that an estimator is unbiased if
it has zero bias, i.e. if
E(θ̂) = θ.
The distribution of θ̂ is called the sampling distribution and its standard
deviation is called the standard error which is denoted SE(θ̂).
Example 7.3 Suppose that we seek to estimate the mean µ. We define
the estimator
n
1X
Xi .
µ̂ = X̄ =
n i=1
We have already seen that E(X̄) = µ (recall (5.4)), and by (5.6), SE(X̄) =
If the population is normally distributed, then the sampling distribution
is N (µ, σ 2 /n).
√σ .
n
There may be many good unbiased estimates of the same parameter.
Another desirable property, is that they should not fluctuate too far from
the mean, and if we have two or more competing unbiased estimators, then
we should always choose the one with the smallest variance. There has been
a great deal of theoretical work on minimal variance unbiased estimators.
Example 7.4
120
CHAPTER 7. ESTIMATION AND UNCERTAINTY
Suppose that for an unknown mean µ, we have X ∼ N (µ, 2), Y ∼ N (µ, 1)
and X, Y are independent. We define the estimators
1
1
and W = X + Y
2
2
Are Z, W unbiased estimators? Which one would you prefer?
Z = 2X − Y
E(Z) = E(2X − Y ) = 2E(X) − E(Y ) = 2µ − µ = µ,
and so Z is unbiased.
E(W ) = E(0.5X + 0.5Y ) = 0.5E(X) + 0.5E(Y ) = 0.5µ + 0.5µ = µ,
and so W is also unbiased. The variances are
Var(Z) = Var(2X − Y ) = 4Var(X) + Var(Y ) = 4 × 2 + 1 = 9
Var(W ) = Var(0.5X + 0.5Y ) = 0.25Var(X) + 0.25Var(Y )
= 0.25 × 2 + 0.25 × 1 = 0.75.
Then W is better estimator, as both Z and W are unbiased for µ, but W
has the smaller variance.
But having a small variance does not in itself create a good estimator. In
Figure 7.1, we return to the data on student height that we examined at the
beginning of this section. Let us suppose, just for illustrative purposes, that
the true mean of the population is µ = 178, and that we select an estimator
X ∼ N (175, 0.1). Since E(X) = 175, this is a biased estimator; although it
has a small variance, it is a poor estimator, as the probability of obtaining
a good approximation to the true mean is quite small (as can be seen from
the figure).
Another property of estimators that we’ll briefly look at is consistency.
Suppose that we have a sequence θ̂n of estimators of θ, where the label n
corresponds to increasing sample size. We say that these are consistent if the
probability that θ̂n is close to θ is close to 1, for large n. More precisely we
require that for any > 0 (no matter how small it is)
lim P (|θ̂n − θ| < ) = 1.
n→∞
(7.5)
The next result gives us a convenient method for finding consistent estimators.
Theorem 2. Suppose that (θ̂n ) are unbiased estimators of θ. If
lim Var(θ̂n ) = 0,
n→∞
then θ̂n are consistent.
7.3. UNBIASED AND CONSISTENT ESTIMATORS
121
0.8
0.6
0.4
0.0
0.2
probability density function
1.0
1.2
Distribution of estimator of the mean
173
174
175
176
177
178
179
180
X
Figure 7.1: Distribution of the estimator X̄ ∼ N (175, 0.1); the true mean of
the population is µ = 178 and is indicated with the solid vertical line.
Proof. From Chebychev’s inequality (5.7):
P (|θ̂n − θ| ≥ ) ≤
Var(θ̂n )
2
and so since Var(θ̂n ) → 0, as n → ∞, it follows that P (|θ̂n − θ| ≥ ) → 0.
Hence
P (|θ̂n − θ| < ) = 1 − P (|θ̂n − θ| ≥ ) → 1,
as was required.
This theorem enables us to check quickly whether an unbiased estimator
is consistent, without needing to directly evaluate the limit of P (|θ̂n −θ| < ).
For example, if we return to Example 7.3 we see that the estimator X̄
of the mean µ is consistent by applying Theorem 2; indeed we have already
seen that X̄ is unbiased, and its variance σ 2 /n converges to 0 as n → ∞.
Be aware that, in general, it is possible to find consistent estimators that
are biased, and unbiased estimators that fail to be consistent.
122
CHAPTER 7. ESTIMATION AND UNCERTAINTY
0.2
0.0
0.1
probability density function
0.3
0.4
Consistent estimators for the estimator of the mean
172
174
176
178
180
182
184
X
Figure 7.2: Distribution of the estimator X̄ ∼ N (178, 40/n), for n = 10
(solid line), n = 40 (dashed line) and n = 100 (dotted line).
Figure 7.2 illustrates consistency for the student height data, under the
assumption that the true mean height is 178, using a sequence of estimators
which are N (178, 40/n).
7.3.1
The variance estimator S 2
In this subsection we will aim to find an unbiased estimator for the population
variance σ 2 . If µ was known, a candidate for an estimator of σ 2 is:
n
S12 =
1X
(Xi − µ)2
n i=1
(7.6)
7.3. UNBIASED AND CONSISTENT ESTIMATORS
123
We can see that S12 is unbiased for σ 2 . Indeed,
n
E(S12 ) =
1X
E[(Xi − µ)2 ]
n i=1
n
1X 2
=
σ
n i=1
=
nσ 2
= σ2.
n
However, in most applications we will not know µ. One idea is to replace
µ by its estimator X̄, and so to consider as an estimator for σ 2
n
S22 =
1X
(Xi − X̄)2 .
n i=1
It turns out that S22 is not an unbiased estimator for σ 2 . The reason is that
we have lost some information by estimating µ. We slightly modify S22 to
n
1 X
S =
(Xi − X̄)2 .
n − 1 i=1
2
(7.7)
We will show that this is an unbiased estimator for σ 2 . Note the difference
between S 2 and S22 .
In order to prove that S 2 is unbiased, we first write
S2 =
1
SXX ,
n−1
where SXX =
n
X
(Xi − X̄)2 .
i=1
(7.8)
124
CHAPTER 7. ESTIMATION AND UNCERTAINTY
Then we have
E(SXX ) = E
n
X
!
(Xi − X̄)2
i=1
=
=
n
X
i=1
n
X
E[(Xi − X̄)2 ]
E[{(Xi − µ) − (X̄ − µ)}2 ]
i=1
=
=
=
n
X
E[(Xi − µ)2 ] +
n
X
i=1
i=1
n
X
n
X
E[(Xi − µ)2 ] +
i=1
i=1
n
X
n
X
E[(Xi − µ)2 ] +
i=1
n
X
E[(Xi − µ)(X̄ − µ)]
i=1
E[(X̄ − µ)2 ] − 2E
n
X
!
(Xi − µ)(X̄ − µ)
i=1
E[(X̄ − µ)2 ] − 2nE[(X̄ − µ)2 ]
i=1
2
= nσ 2 +
E[(X̄ − µ)2 ] − 2
2
2nσ
nσ
−
= (n − 1)σ 2 ,
n
n
since Xi ∼ N (µ, σ 2 ) and X̄ ∼ N (µ, σ 2 /n).
Thus
1
1
E(SXX ) =
(n − 1)σ 2 = σ 2
E(S 2 ) =
n−1
n−1
and so S 2 is unbiased for σ 2 . It can also be shown to be a consistent estimator.
7.4
Confidence Intervals for the Mean II
We now return to the problem of using confidence intervals for the mean µ.
In particular, we will want to deal with the situation where σ 2 is unknown,
and the sample is small (n < 30), so it is not reasonable to approximate σ 2
by the sample standard deviation s2 , and proceed as we did in section 7.2.
Throughout this section we will assume that the population is normally
distributed, so our sample
random variables X1 , X2 , . . . , Xn are N (µ, σ 2 ).
p
We will use S = SXX /(n − 1) as our estimator of σ. It is possible to
show that the distribution of S/σ depends on n, but not on µ or σ, so the
distribution of
X̄ − µ
√
S/ n
also depends on n but not on µ or σ. This becomes clearer if we write
7.4. CONFIDENCE INTERVALS FOR THE MEAN II
125
X̄ − µ
X̄ − µ S
√ =
√ ÷ ,
σ
S/ n
σ/ n
X̄−µ
√ ∼ N (0, 1) which is independent of both µ and σ.
since σ/
n
This gives rise to a new family of distributions that we have not met
before.
If X1 , . . . , Xn are independent N (µ, σ 2 ) random variables, then the distribution of
X̄ − µ
√
T =
S/ n
is called the Student t distribution with n − 1 degrees of freedom. We
write
T ∼ tn−1 .
The degrees-of-freedom parameter determines the shape of the distribution.
The fact that the distribution of T depends on n, but not on µ or σ, gives
us a way to calculate confidence intervals, as we’ll see below.
Properties of the t distribution
If a random variable T has a t distribution with ν degrees of freedom, that
is if
T ∼ tν ,
then T has the density function
fν (t) = k(ν)(1 + t2 /ν)−(ν+1)/2 ,
−∞<t<∞
for an appropriate constant k(ν), which depends only on ν.1 This will be
proved in MAS223.
The pdf fν is symmetric about zero. For large values of ν, it is very
similar to the standard normal density N (0, 1). (Why?) Figure 7.3 shows
the central part of the t-distribution density with 2 degrees of freedom.
Note that the t-distribution looks quite similar to the standard normal
near to zero, but the t-distribution has heavier tails. To see more, draw
the graphs of the t distribution with 5 degrees of freedom and the standard
normal; for the t-distribution, use the command:
plot(function(x) dt(x, df=5), -5,5, main="", xlab="Values of t",
ylab="Probability density")
In fact k(ν) = √1πν Γ((ν+1)/2)
Γ(ν/2) , where the gamma function Γ(ν) =
√
that Γ(n) = (n − 1)!, if n is a natural number; but Γ(1/2) = π.
1
R∞
0
e−x xν−1 dx. Note
CHAPTER 7. ESTIMATION AND UNCERTAINTY
0.20
0.15
0.00
0.05
0.10
Probability density
0.25
0.30
0.35
126
−4
−2
0
2
4
Values of t
Figure 7.3: Graph of the t-distribution with 2 degrees of freedom
and for the normal use the command
plot(function(x) dnorm(x), -5,5, main="", xlab="Values of z",
ylab="Probability density")
Cumulative probabilities from t-distributions can be calculated in R using
the pt function, which works in a similar way to the pnorm function; but note
that the value of the degree-of-freedom parameter needs to be specified.
Confidence intervals using the t distribution
For our purposes, the importance of the t-distribution lies in its use for constructing confidence intervals when σ 2 is unknown and has to be effectively
replaced by s2 , which is estimated from the data.
7.4. CONFIDENCE INTERVALS FOR THE MEAN II
127
To recap, let X1 , . . . , Xn be independent N (µ, σ 2 ) random variables; we
use X̄ as an estimator for µ, with σ 2 unknown. Then
T =
X̄ − µ
√ ∼ tn−1 .
S/ n
(7.9)
As in (7.1) we choose a small number α and (using the symmetry of the
t-distribution), seek a critical value tn−1;α/2 so that
P (T < −tn−1;α/2 ) = P (T > tn−1;α/2 ) =
α
.
2
Then we have,
P (−tn−1;α/2 ≤ T ≤ −tn−1;α/2 ) = 1 − α.
Now substitute for T from (7.9) and rearrange to obtain
S
S
P X̄ − √ tn−1;α/2 ≤ µ ≤ X̄ + √ tn−1;α/2 = 1 − α.
n
n
Replacing the random variables X̄ and S by their observed values, x̄ and
s, we find that a 100(1 − α)% confidence interval for µ is given by:
s
x̄ ± √ tn−1;α/2 .
n
(7.10)
You should compare (7.10) with (7.3).
For the most common case, where we want a 95% confidence interval
(α = 0.05), some examples of critical values are shown in the following table.
(What happens as n → ∞?)
Sample size n
Degrees of freedom ν
Critical value tn−1;0.025
3
2
4.30
6
5
2.57
10
9
2.26
25
24
2.06
50
49
2.01
100
99
1.98
Note that these values can be obtained easily in R. The last one, for example,
is produced using the command qt(0.975,99) and rounding.
In R, a calculation of a 95% confidence interval for the mean, based on
estimating the sample standard deviation and using the t-distribution, can
be obtained as follows.
• Make sure that data are collected in an object, say dat.
• Get estimates of the mean (µ̂)=mean(dat) and of the variance (s2 )=var(dat)
of dat.
128
CHAPTER 7. ESTIMATION AND UNCERTAINTY
• Then the interval is
(mean(dat)-qt(0.975,n-1)* sqrt(var(dat))/sqrt(n),
mean(dat)+qt(0.975,n-1)* sqrt(var(dat))/sqrt(n)), where n is
the size of dat, which can be determined in R by length(dat).
Example 7.4 (Density of the Earth)
Twelve observations were made on the density of the earth relative to
water. We require a 95% confidence interval for the true value. Here are the
data:
5.10, 5.39, 5.47, 5.34, 5.30, 5.68, 5.27, 5.42, 5.63, 5.46, 5.75, 5.85.
Let Xi be the ith measurement of the density of the earth relative to
water, and suppose Xi ∼ N (µ, σ 2 ), where µ is the unknown true value and
σ 2 is the unknown variance of the measurements around the true value. Here
we have n = 12, s = 0.2183 and x̄ = 5.4717. From R, t11;0.025 = 2.201. So by
(7.10) the required confidence interval is
√
√
(5.4717 − 2.201 × 0.2183/ 12, 5.4717 + 2.201 × 0.2183/ 12),
or 5.4717 ± 0.1387. Thus a 95% confidence interval for the density of the
earth relative to water is (5.333, 5.610).
In R, following the above steps we get the following interval (unrounded!):
(5.33297, 5.610363).
7.5
Estimating Variance
We have seen that
S 2 = SXX /(n − 1)
is a useful estimator of the variance of a normal distribution, based on observations X1 , . . . , Xn . In this section, we look at the sampling distribution
of S 2 , and at interval estimation for σ 2 .
The distribution of S 2
We start first with a basic result. If the random variable X follows the
standard normal distribution X ∼ N (0, 1), then the p.d.f. of Y = X 2 is
given by
y
1
fY (y) = √ y 1/2−1 exp −
(7.11)
2
2π
7.5. ESTIMATING VARIANCE
129
To see this, we denote by fX (x) the p.d.f. of X, FX (x) the c.d.f. of X and
FY (y) the c.d.f. of Y . Recall that
1 2
1
fX (x) = √ e− 2 x .
2π
(7.12)
Then for y ≥ 0,
FY (y) =
=
=
=
=
=
P (Y ≤ y)
P (X 2 ≤ y)
√
√
P (− y ≤ X ≤ y)
√
√
P (X ≤ y) − P (X ≤ − y)
√
√
FX ( y) − FX (− y)
√
√
√
FX ( y) − 1 + FX ( y) = 2FX ( y) − 1
Now from the definition of the p.d.f. we have
√
fX ( y)
1
d FY (y)
√
= 2fX ( y) √ = √
fY (y) =
dy
2 y
y
and using (7.12), we obtain
y
1
fY (y) = √
exp − .
2
2πy
The above distribution of Y is known as the chi-square distribution with 1
degree of freedom. The general form of the chi-square distribution with n
degrees of freedom is given by the p.d.f.
y
1
y n/2−1 exp − , y ≥ 0
(7.13)
fY (y) = n/2
2 Γ(n/2)
2
(If y ≤ 0, √
then fY (y) = 0). Here Γ denotes the gamma function. Since
Γ(1/2) = π, if we put n = 1 in (7.13), we obtain (7.11). In terms of
notation, we write Y ∼ χ2n to mean that the random variable Y has a chisquared distribution with n degrees of freedom.
An important property of the chi-square distribution is that, if the independent random variables Y1 , Y2 follow the chi-square distributions Y1 ∼ χ21
and Y2 ∼ χ21 , then Y1 + Y2 ∼ χ22 . It is beyond the scope of this module to
derive this result, or to give a more detailed account of the gamma function.
That is postponed to Year 2. The above property can easily be generalized for a finite number of random variables, to give the following important
result:
130
CHAPTER 7. ESTIMATION AND UNCERTAINTY
If Z1 , . . . , Zn are n independent
random variables each with the standard
Pn
2
normal distribution then j=1 Zj ∼ χ2n .
Another way of writing this is that if X1 , . . . , Xn are n independent random variables each with the same normal distribution N (µ, σ 2 ) then
Pn
2
j=1 (Xj − µ)
∼ χ2n .
σ2
Figure 7.4 gives the pdf of the chi-squared distribution with 10 degrees
of freedom.
A closely related result is that
Pn
2
SXX
j=1 (Xj − X̄)
(7.14)
=
∼ χ2n−1 .
σ2
σ2
Once again, a proof of this is beyond the scope of this module, but the
“n − 1” relates to the fact that the terms in the sum cannot be regarded as
n distinct sources of information; there is one constraint on them, namely
that ΣXj = nX̄. We’ll see another example of this, arising in a very different
way, later in the module.
For our present purposes, this helps because, combining (7.14) with (7.8)
we can then write
(n − 1) 2
S ∼ χ2n−1 .
(7.15)
σ2
In the following we need critical values for the chi-square distribution. To
be precise if Q has a chi-square distribution with ν degrees of freedom, we
define the values χ2ν,α by
P (Q ≤ χ2ν,α ) = α.
(7.16)
It follows that
P (χ2ν,α/2 ≤ Q ≤ χ2ν,1−α/2 ) = 1 − α.
(7.17)
To obtain values of χ2ν,α , we require the R command qchisq(α, ν).
Confidence intervals using the χ2 distribution
In this case, the idea is to seek confidence intervals, wherein we find a, b ≥ 0
so that
P (aS 2 ≤ σ 2 ≤ bS 2 ) = 1 − α.
(7.18)
Now by (7.17) and using (7.15), we have
(n − 1)S 2
2
2
≤ χn−1,1−α/2 = 1 − α.
P χn−1,α/2 ≤
σ2
7.5. ESTIMATING VARIANCE
131
(n − 1)S 2
(n − 1)S 2
2
2
, and
Now since
≤ χn−1,1−α/2 if and only if σ ≥ 2
σ2
χn−1,1−α/2
(n − 1)S 2
(n − 1)S 2
2
2
if
and
only
if
σ
≤
; we conclude that a
≥
χ
n−1,α/2
σ2
χ2n−1,α/2
100(1 − α)% confidence interval for σ 2 is given by
(n − 1)s2
(n − 1)s2
2
≤
σ
≤
,
χ2n−1,1−α/2
χ2n−1,α/2
(7.19)
where the sample variance s2 is computed from the data.
So we have
a=
(n − 1)
(n − 1)
, b= 2
.
2
χn−1,1−α/2
χn−1,α/2
For the most common case, where we want a 95% confidence interval
(α = 0.05), some examples of critical values are shown in the following table.
Sample size n
Degrees of freedom ν
χ2ν,0.975
χ2ν,0.025
Multiplier a
Multiplier b
3
2
7.38
0.0506
0.271
39.50
6
5
12.8
0.831
0.390
6.015
10
9
19.0
2.70
0.473
3.333
25
24
39.4
12.4
0.610
1.935
50
49
70.2
31.6
0.698
1.553
100
99
128
73.4
0.771
1.349
So if we calculate a sample variance s2 based on 25 normal observations, then
a 95% confidence interval for the true variance would be roughly (0.6s2 , 1.9s2 ).
Note that if we have
P (aS 2 ≤ σ 2 ≤ bS 2 ) = 1 − α.
then we also have
√
√
P ( aS ≤ σ ≤ bS) = 1 − α.
So confidence intervals for σ can be calculated in the obvious way from corresponding intervals for σ 2 .
Example 7.5 (Density of the Earth) In Example 7.4, twelve observations were made on the density of the earth relative to water; the sample
variance was s2 = (0.2183)2 = 0.04765. From R, χ211,0.025 = 3.816, χ211,0.975 =
21.92. We get a = 11/21.92 = 0.502, b = 11/3.816 = 2.88, so a 95% confidence interval for σ 2 is (0.024, 0.137).
A 95% confidence interval for σ, the standard deviation of the measurement error on individual observations, is (0.15, 0.37).
CHAPTER 7. ESTIMATION AND UNCERTAINTY
fX(x)
0.00
0.04
0.08
132
0
5
10
15
20
25
30
35
x
Figure 7.4: Chi-squared distribution for ten degrees of freedom. The shaded
area on the left is P (Q < χ210,0.025 ) and that on the right is P (Q > χ210,0.975 ).
Chapter 8
Hypothesis Testing
8.1
Introduction
In Chapter 7 we were concerned with estimating the parameters of a model,
allowing for the uncertainty in them, and relating the values to an underlying real-world problem. Often we want to investigate whether or not the
data are consistent with a particular model, or with a particular value of
a parameter—usually one that has a specific real-world interpretation. For
this chapter, we will concentrate on the latter case, investigating the plausibility of a particular parameter value; later in the course, we will see some
examples of more general model-checking.
There are many real-world questions that can be framed as: “is this
parameter equal to a particular fixed value?”
• Is the mean measurement error equal to zero? For example, if the
measurement errors from an instrument follow a N (µ, σ 2 ) distribution,
is µ = 0?
• Is the probability of success equal to half? Is this coin fair? In a series
of Bernoulli(p) trials, is p = 12 ?
• Are these two variables unrelated? Is the correlation between them
zero?
The formal terminology is hypothesis testing: we have a hypothesis, which
is effectively a statement about certain parameter values, and we wish to test
it by looking at the data we have.
For example in medical research, we may be interested in assessing the
effectiveness of a treatment, or the effectiveness of a drug. We may choose
some patients who did receive the treatment and some others who did not,
133
134
CHAPTER 8. HYPOTHESIS TESTING
and we may be interested in whether the group with the treatment have
shown enough evidence of improvement. Here the hypothesis we want to
test is that the treatment has no effect.
We cannot prove that a hypothesis is true. Generally, any data we obtained can be explained by, or are consistent with, a range of possible models.
However, we can have evidence against a hypothesis; the data we see can
effectively rule out certain possibilities. We can talk about ‘rejecting’ a hypothesis. Of course, in many cases this is only because the observed data
would be very unlikely if a particular hypothesis were true, not (usually)
because the data would be impossible. In the medical example above, we
may reject the hypothesis of no effect of the treatment; in that case the
data we have obtained provides evidence in favour of the effectiveness of the
treatment.
The hypothesis that we test is known as the null hypothesis. We need
to be able to describe, statistically, what would happen if the null hypothesis were true. Because we cannot prove a hypothesis, but only reject it or
not, often the null hypothesis has a negative form. For instance, if we are
interested in showing that there is a relationship between two variables, we
can take our null hypothesis to be that there is no relationship; if the data
seem very unlikely given that hypothesis, then we may be able to reject the
hypothesis of ‘no relationship’, which tells us that there is some sort of relationship. In the medical example above, our null hypothesis is that the
treatment has no effect.
Sometimes a hypothesis is described as ‘accepted’; this does not imply
that it has been proven to be true, only that it has not been rejected. This
terminology is potentially misleading, and best avoided.
To carry out a formal test of a null hypothesis, we need to specify what
the alternative is. Often this is obvious, since it may for example consist of all
possible parameter values except the one specified by the null hypothesis; but
this need not always be the case. We do need to know what the alternative
hypothesis is, but it turns out that we do not need to know what the data
would look like if the alternative hypothesis were true, at least not in as much
detail as under the null hypothesis. So the alternative can include a whole
range of parameter values without creating technical problems.
The shorthand H0 is often used to denote the null hypothesis, and HA (or
sometimes H1 ) to denote the alternative hypothesis. The same subscript
zero is used to denote the particular value of a parameter specified by a null
hypothesis; we might for example set µ0 = 0 if we are investigating whether
a particular mean is zero.
To carry out a test of a hypothesis, the basic idea is to look at some
summary of the data—which should be simple enough that we can find its
8.2. EXAMPLES
135
distribution given that the null hypothesis is true—and assess whether the
observed summary is consistent with what we would expect to see, or whether
it differs from what we’d expect in a way that supports the alternative hypothesis.
8.2
Examples
Example 8.1: The lady tasting tea
This is a famous study by R.A. Fisher (1890-1962), which revolutionized
statistics and opened the path onto randomization and designed experiments.
Around 1919 a lady called Dr. Muriel Bristol, claimed that she could determine whether milk was added before or after the tea, just by using taste.
Fisher designed an experiment in order to test this claim. 8 randomly designed tea cups were prepared, 4 of which had milk added before tea and 4
had milk added after tea. The objective was to see whether the lady could
determine and classify the two groups of 4 cups.
The null hypothesis here is that the lady has no such ability, or that the
lady has probability 0.5 to give the right answer, i.e. H0 : p = 0.5, with the
alternative HA : p 6= 0.5, where p denotes the probability that she gives the
right answer. Fisher found the probability function, under the null hypothesis
H0 and compared this with the evidence of the lady’s answers.
Example 8.2: Fairness of a coin
It is claimed that the following sequence of results comes from independent tosses of a fair coin (where H denotes ‘heads’ and T denotes ‘tails’):
T, T, T, H, T, T, T, T, T, T, T, T, H, T, H, T, T, T, H, T.
We wish to test the null hypothesis of independent tosses of a fair coin (H0 )
against the alternative hypothesis of independent tosses of an unfair coin
(HA ), using the number of heads observed, Y , as a test statistic.
If the null hypothesis is true, what is the distribution of Y ?
What form would the distribution of the test statistic take if the alternative hypothesis were true?
Calculate (i) the probability of getting exactly the number of heads observed, and (ii) the probability of a number of heads as low as that observed
or lower, if the null hypothesis is true.
If the null hypothesis is true, then the distribution of the number of heads
Y would be Binom(20, 0.5). Under the alternative hypothesis of an unfair
coin, Y ∼ Binom(20, p), p 6= 0.5.
136
CHAPTER 8. HYPOTHESIS TESTING
Just 4 heads were observed. (i) The probability of getting exactly 4 heads,
under the null hypothesis, is P (Y = 4) given Y ∼ Binom(20, 0.5), which is
0.00462 (using dbinom(x=4, size=20, prob=0.5)).
(ii) The probability of getting 4 or fewer heads is P (Y ≤ 4) given Y ∼
Binom(20, 0.5), which is 0.00591 (using pbinom(q=4, size=20, prob=0.5)).
Example 8.3 Temperature measurement
The chemist introduced in Chapter 7 is investigating the effect of a new
way of producing a material. She thinks that the new technique may change
the melting point of the material from the well-known value it has when
the standard production technique is used, which is 50◦ C. She measures the
melting point of 25 samples of the material, using an instrument known to
have measurement errors with standard deviation 0.3◦ C, and is happy to
assume that observations are independent and normally distributed.
We can formalize this by modelling the measured temperatures as
X1 , . . . , Xn ∼ N (µ, σ 2 ), with n = 25, σ = 0.3, and µ unknown. The null
hypothesis is that µ = 50; the alternative hypothesis is that µ 6= 50.
If H0 is true, then the data should come from the N (µ0 , σ 2 ) or N (50, 0.32 )
distribution. If HA is true, then the data will be systematically larger or
smaller than under H0 , depending on whether µ is larger or smaller than µ0 .
Because a departure from H0 of this sort will tend to affect all the observations equally, looking at the sample mean is likely to be a good way of
checking for consistency with H0 . Given H0 , we know that the distribution
of X̄ would be N (µ0 , σ 2 /n) which is N (50, 0.062 ); given HA , X̄ would be
N (µ, 0.062 ) for some µ 6= 50.
So a possible test of H0 is to look at the sample mean, and compare it
with the distribution it should have under H0 . Any value of X̄ is possible
under H0 , but very large or very small values suggest that we can reject H0 in
favour of HA . If we observe x̄ = 50.1, say, there is no reason to doubt the null
hypothesis, since that is a perfectly reasonable value from N (50, 0.062 ). This
does not prove that H0 is true; 50.1 is also a perfectly reasonable value from
N (50.2, 0.062 ) say. On the other hand, if we observe x̄ = 51.2, that is a very
unlikely value from N (50, 0.062 ), although of course not impossible; since
some alternative values of µ would explain that observation much better, we
should regard this as a strong reason to reject H0 in favour of HA .
In the next section, we will formalise and quantify these ideas.
8.3. OVERVIEW OF HYPOTHESIS TESTING
8.3
137
Overview of Hypothesis Testing
In general we have a model with a parameter θ, a null hypothesis H0 : θ = θ0 ,
and an alternative hypothesis HA . We find a random variable which will
behave differently depending on whether H0 or HA is true—a test statistic.
We need to be able to work out its distribution if H0 is true; this is known
as the null distribution.
Based on the form of the hypotheses, we decide what sort of values of the
test statistic count as evidence against H0 : very large values, or very small
ones, or both. We then quantify the strength of the evidence by calculating
the probability given H0 of getting the observed value of the test statistic or
a more extreme value. This probability is known as the p-value of the test,
and is a measure of the evidence against H0 . A very small p-value means that
what we actually saw would have been very unlikely under H0 (and better
explained by HA ); a p-value that is not small means that what we saw was
just the kind of thing we would expect to see if H0 were true. Note that
in this approach, all probability calculations are done under the assumption
that the null hypothesis is true; the alternative hypothesis affects the choice
of test statistic, and the choice of which extremes of the null distribution
count against H0 . This means that the alternative hypothesis can be less
precisely specified than the null.
8.4
The Z test
As a first example of a hypothesis test in detail, we will consider independent
observations X1 , . . . , Xn ∼ N (µ, σ 2 ), with σ 2 known. We test H0 :µ = µ0
against HA :µ 6= µ0 . We will use the sample mean X̄ to summarize the
sample; the test statistic we use is
Z=
X̄ − µ0
√ .
σ/ n
We know that if H0 is true, then
X̄ ∼ N (µ0 , σ 2 /n),
X̄ − µ0 ∼ N (0, σ 2 /n),
Z ∼ N (0, 1).
A value of Z that is very large (assuming that Z ∼ N (0, 1)) suggests that
µ > µ0 , and a value of Z that is very small suggests that µ < µ0 ; either
case counts as evidence against H0 and in favour of HA . We quantify this by
138
CHAPTER 8. HYPOTHESIS TESTING
calculating the probability of getting a value as unfavourable as the one we
actually got, or more unfavourable, under the assumption that H0 holds: the
p-value. If we write zobs for the actual value we obtain for Z, i.e.
zobs =
x̄ − µ0
√ ,
σ/ n
then the p-value is
P (Z >| zobs | given H0 ) + P (Z < − | zobs | given H0 ).
Because of the symmetry of the normal distribution about zero, this is simply
2P (Z >| zobs | given H0 ).
See Figure 8.1. Since Z ∼ N (0, 1) given H0 , the p-value is
2(1 − Φ(| zobs |)).
(8.1)
The p-value from a Z-test can therefore be looked up in standard normal
tables, or calculated easily in R.
Example 8.3 continued
In the temperature example, we had 25 normal observations, with standard deviation 0.3.
We test H0 : µ = 50 against HA : µ 6= 50.
(X̄ − 50)
√ .
(0.3/ 25)
If we observed x̄ = 50.1, this would give
So the test statistic is Z =
zobs =
50.1 − 50
= 0.1/0.06 = 1.667.
0.3/5
Using (8.1), the p-value is 2(1 − Φ(1.667)) = 0.096, (where we use the R
command pnorm(1.667, 0, 1) = 0.9522 to find Φ(1.667)). This is quite a
small number and gives some evidence against H0 .
If instead we observed x̄ = 50.3, this would give zobs = 0.3/0.06 = 5. In
this case, the p-value is 2(1 − Φ(5)) = 2.87 × 10−7 . This gives considerable
evidence against H0 .
8.4. THE Z TEST
139
0.2
zobs
0.1
− zobs
P(Z < − zobs)
P(Z > zobs)
0.0
Probability density
0.3
0.4
Z−test with zobs positive
−3
−2
−1
0
1
2
z
Figure 8.1: Calculation of the p-value for a Z-test. For zobs > 0, the p-value
is given by (1 − Φ(zobs )) + Φ(−zobs ), which simplifies to 2(1 − Φ(zobs )).
3
140
8.5
8.5.1
CHAPTER 8. HYPOTHESIS TESTING
Interpretation of p-Values
General points
Firstly, note that the p-value is not the probability of the null hypothesis being true. Nor is it the probability of the null hypothesis being
false. In the current framework, we cannot make probability statements
about hypotheses, just as we cannot make them directly about parameters.
The p-value is a measure of the evidence against the null hypothesis, which
is calculated as a probability of seeing certain kinds of data given that H0 is
true.
If the p-value is very small, it indicates that either H0 is false, or that
something very unusual has happened. A very small p-value is evidence
against H0 , although of course it can never absolutely prove that H0 is false;
something unusual could have happened.
If the p-value is not small, all that we can say is that there is no evidence
against H0 . Remember that it does not prove that H0 is true. Note also that
there is no difference in interpretation between say p ≈ 1, p ≈ 0.5, p ≈ 0.2;
all indicate that we have seen something that gives us no reason to doubt H0
in favour of HA .
8.5.2
Thresholds for p-values
The most common threshold is that a p-value less than 0.05 is regarded
as ‘good’ evidence to reject a null hypothesis. Of course, this is somewhat
arbitrary; a p-value just above this threshold is saying more or less the same
as one just below it.
A p-value less than 0.01 is regarded as ‘strong’ evidence to reject the null
hypothesis, and smaller and smaller p-values represent stronger and stronger
evidence against it. This is not the same thing as evidence that H0 is ‘more
and more wrong’ i.e. that θ is further and further from θ0 .
Any value above 0.1 is regarded as no evidence against the null hypothesis.
As already mentioned, you should not attach any meaning to the difference
between p values of say 0.2 and 0.9; in both cases, there is simply no evidence
against H0 .
A value between 0.1 and 0.05 is conventionally ‘weak’ or ‘some’ evidence
against H0 ; sometimes it is treated as essentially no evidence. Note that
there is a connection between these conventional thresholds for p-values and
the values for α usually used in constructing confidence intervals. We will
explore the connection between hypothesis testing and interval estimation in
more detail later in the module.
8.6. TESTING A NORMAL MEAN: VARIANCE UNKNOWN
141
It is also important to appreciate that p-values, although extremely useful
and widely used, are not infallible and can lead to misunderstandings if
misused in conjunction with badly structured experiments. For this and
other reasons, the journal “Basic and Applied Social Psychology” banned
them in February 2015. This is a highly controversial move. To find out
more about it, go to
http://www.nature.com/news/psychology-journal-bans-p-values-1.
17001
and/or
http://www.statslife.org.uk/news/2116-academic-journal-bans-p-value-significanceAn alternative approach to hypothesis testing called Bayesian statistics is
becoming increasingly popular. This uses conditional probability and Bayes’
theorem as inferential tools. It is assumed that you have some prior probability P (H) that a hypothesis of interest is valid. You then gather data and
use this as evidence E to upgrade to a new posterior probability P (H/E),
which is the conditional probability that the hypothesis is valid, given the
weight of evidence that you have obtained. You can learn more about this
approach in MAS364.
8.6
Testing a Normal Mean: Variance Unknown
If X1 , . . . , Xn are independent N (µ, σ 2 ) random variables and σ 2 is not known,
Z is no longer suitable as a test statistic; in fact it is no longer a statistic,
since it depends on the unknown value of σ.
There is a simple but important variant of the Z-test that can be used
to test hypotheses about µ in this situation. Define
T =
X̄ − µ0
√ .
S/ n
Then if H0 :µ = µ0 is true, we know from section 7.4 that
T ∼ tn−1 .
The hypothesis test based on this result is known as the t-test. More precisely, it is called the one-sample t-test, since it deals with a single sample of
independent observations. The reason for the terminology is to distinguish
this test from an important, closely related test called the two-sample t-test,
based on a different structure for the data. Later in this module, we will look
142
CHAPTER 8. HYPOTHESIS TESTING
at the two-sample t-test in detail, along with a wider range of uses for the
one-sample test.
In the most common case, where we have HA :µ 6= µ0 , we regard very
large or very small values of T as evidence against H0 . So the p-value is
P (T more extreme than tobs given H0 )
= P (T >| tobs | given H0 ) + P (T < − | tobs | given H0 )
= 2P (T >| tobs | given H0 )
by the symmetry of the t distribution about zero. Since T ∼ tn−1 given H0 ,
the p-value is
2(1 − Ftn−1 (| tobs |)),
where Ftn−1 is the cdf for the t-distribution having n − 1 degrees of freedom.
The cumulative probabilities for t distributions that are needed here can be
found easily using R.
Example 8.4 [Density of the Earth]
The accepted value for the density of the earth relative to water is 5.517.
Are the sample data in Example 7.4 consistent with this hypothesis?
We have H0 : µ = 5.517 and HA : µ 6= 5.517, and we have already seen
that x̄ = 5.4717, s = 0.2183, n = 12. So
X̄ − µ0
√
S/ n
X̄ − 5.517
√
=
,
S/ 12
T =
and inserting the observed values for x̄ and s we get
tobs = −0.719.
We need to compare this with the t distribution with n − 1 = 11 degrees of freedom. We have | tobs |= 0.719. From R, pt(-0.719,11) or
1-pt(0.719,11) gives 0.2436 to 4 d.p., and so the p-value is 0.487 to 3 d.p.
and there is no evidence against H0 .
This example can also be carried out entirely within R. The command
t.test(earth,mu=5.517)
produces the following output.
8.7. FIXED SIGNIFICANCE LEVELS AND ERROR PROBABILITIES143
One-sample t-Test
data: earth
t = -0.7194, df = 11, p-value = 0.4869
alternative hypothesis: mean is not equal to 5.517
95 percent confidence interval:
5.332970 5.610363
sample estimates:
mean of x
5.471667
Note that R does not give any interpretation. Also, R does need to be
told µ0 = 5.517 here or it will assume that µ0 = 0 by default. The test can
also be carried out from the menus; again, the null mean 5.517 has to be
filled in.
8.7
8.7.1
Fixed significance levels and error probabilities
Testing with a fixed significance level
Rather than simply quantifying evidence, often we need explicitly to decide
whether or not to reject H0 . This amounts to fixing a threshold α, the
significance level of the test, and then rejecting H0 if the p-value is less than
α.
Note that this can equally well be done by calculating the corresponding
threshold level of the test statistic, known as the critical value, and comparing the observed value with that. This is essentially the same notion of
critical value that was used when we calculated confidence intervals.
For example, in the Z-test with a two-sided HA , the p-value is
2(1 − Φ(| zobs |)).
To carry out a test with a fixed significance level of α, we reject H0 if
2(1 − Φ(| zobs |)) < α
⇔ 1 − Φ(| zobs |) < α/2
⇔ Φ(| zobs |) > 1 − α/2
⇔ | zobs |> Φ−1 (1 − α/2).
So we calculate Φ−1 (1−α/2), and compare zobs with that. In R, Φ−1 (1−α/2)
can be found using the command qnorm(1 − α/2, 0, 1).
144
CHAPTER 8. HYPOTHESIS TESTING
For the Z-test at the 5% level, the critical value is Φ−1 (1−0.05/2) = 1.96.
so we reject the null hypothesis at the 5% level if | zobs |> 1.96, (without
calculating the precise p-value; we know that p < 0.05). Similarly, at the 1%
level, the critical value is Φ−1 (1 − 0.01/2) = 2.58.
It is not quite so simple for the t-test; since there is a different distribution
for each value of the degrees of freedom parameter, there is also a different
critical value in each case.
The advantages of a test at a fixed level is that it is slightly easier, and that
it gives a clear decision. The disadvantages are that it gives less information,
and that it depends, of course, on the choice of the significance level. How
we make that choice depends on the probabilities of different outcomes of
the test and on their consequences.
Testing hypotheses at fixed significance level uses a similar way of thinking
to that we employed to calculate confidence intervals. In fact, if we are using
the normal distribution, then for significance level α, we compute the critical
value z α2 . If zobs is in the interval [−z α2 , z α2 ] we do not reject H0 . If zobs is
outside the interval, we reject H0 .
e.g. In Example 8.3, if we tested at the 5% significance level, z α2 = 1.96.
Observing x̄ = 50.1, gives zobs = 1.667 which is in the interval [−1.96, 1.96],
so we do not reject H0 . Observing x̄ = 50.3, gives zobs = 5 which is not in
the interval [−1.96, 1.96], so we reject H0 . These are the same conclusions
that we reached before using p-values.
8.7.2
Types and Probabilities of Errors
With a test at a fixed significance level, we have two clear outcomes—either
reject H0 , or don’t reject it. An important issue in choosing the significance
level of a test, and in other choices such as sample size, is the probability
that the test gives the wrong conclusion.
There are two kinds of error that we can make. Suppose that we have a
given significance level α.
• Type I error: reject the null hypothesis when it is in fact true
• Type II error: fail to reject null hypothesis when alternative is true
The first case is easier to deal with.
P (Type I error)
= P (p-value < α given H0 true)
= P (observed value is in region leading to p < α given H0 true)
= α.
8.7. FIXED SIGNIFICANCE LEVELS AND ERROR PROBABILITIES145
We can control the probability of Type I error by choosing α. Note that it is
not the probability that we are wrong on a given occasion, or even in the long
run—it is a measure of the proportion of true null hypotheses that we would
reject. Obviously, all other things being equal, the lower the significance level
the better.
The second type of error is more complicated to quantify.
P (Type II error) = P (p-value > α given HA true).
It depends on the actual parameter value within HA . So usually, this error
probability is a function of the parameter, θ say, as well as of α. In general,
the higher the significance level, the less likely we are to fail to reject H0 , for
any given alternative. This is the argument for a higher significance level.
In practice, choosing a significance level for a test represents a trade-off
between the two sorts of errors. In fact there is a theoretical result - the
Neyman-Pearson lemma which, in principle, enables the probability of Type
II error to be minimised, for a fixed probability of type I error.
Sometimes it is more natural to think about the probability of successfully
rejecting H0 when it is false. This is just 1 − P (Type II error), so is again a
function of the particular parameter value within HA , and is known as the
power of the test. Another way of stating the trade-off is that we want high
power but low Type I error probability.
Consider a Z-test with H0 :µ = µ0 and HA :µ 6= µ0 . If we choose to carry
out a test at the 5% level, we know that we reject H0 if | zobs |> 1.96.
The probability of this is 0.05 if H0 is true; if HA is true, the probability of
making a Type II error by not rejecting H0 depends on the actual value of
the population mean. Given the actual value µ, for all i = 1, 2, . . . , n
Xi ∼ N (µ, σ 2 ),
X̄ ∼ N (µ, σ 2 /n),
X̄ − µ0 ∼ N (µ − µ0 , σ 2 /n),
µ − µ0
√ ,1 ,
Y ∼ N
σ/ n
where
Y =
X̄ − µ0
√ .
σ/ n
146
CHAPTER 8. HYPOTHESIS TESTING
So the power of the test, for a particular µ 6= µ0 , is
P (Reject H0 given µ) = P (| Y |> 1.96 given µ)
µ − µ0
√ ,1
= P | Y |> 1.96 given Y ∼ N
σ/ n
µ − µ0
√ ,1
= P Y < −1.96 given Y ∼ N
σ/ n
µ − µ0
√ ,1
+ P Y > 1.96 given Y ∼ N
σ/ n
µ − µ0
µ − µ0
√
√
= Φ −1.96 −
+ 1 − Φ 1.96 −
.
σ/ n
σ/ n
Note the lack of symmetry; the two terms are not the same.
In the above calculation, we chose α = 0.05, and so the critical value
zcrit = 1.96. In the general case where α is not specified, we have:
µ − µ0
√
P (Reject H0 given µ) = Φ −zcrit −
σ/ n
µ − µ0
√
+ 1 − Φ zcrit −
,
σ/ n
where zcrit is the critical value for the particular significance level we want.
Example 16: Temperature Measurement Revisited
In the temperature measurement example (Example 8.3), we have X1 , . . . , Xn ∼
N (µ, σ 2 ), n = 25, σ = 0.3, H0 : µ = 50, HA : µ 6= 50. So
Z=
X̄ − 50
X̄ − 50
√
=
,
0.06
(0.3/ 25)
and we will reject H0 at the 5% level if | zobs |> 1.96.
If H0 is true, the probability of Type I error is 5%. If instead HA is true,
and the actual value is µ = 50.2, then
µ − µ0
√ , 1 = N (0.2/0.06, 1) = N (3.333, 1),
Y ∼N
σ/ n
and so the power of the test for µ = 50.2 is
P (Reject H0 given µ = 50.2) = P (| Y |> 1.96 given Y ∼ N (3.333, 1))
= P (Y < −1.96 given Y ∼ N (3.333, 1))
+ P (Y > 1.96 given Y ∼ N (3.333, 1))
= Φ(−1.96 − 3.333) + 1 − Φ(1.96 − 3.333)
= 6 × 10−8 + 0.915
= 0.915 to 3d.p.
8.7. FIXED SIGNIFICANCE LEVELS AND ERROR PROBABILITIES147
So the power of the test, that is the probability of rejecting H0 , when HA is
true and µ = 50.2, is about 0.9; the Type II error probability is about 0.1.
If instead the actual value is 49.9, then
µ − µ0
√ , 1 = N (−0.1/0.06, 1) = N (−1.667, 1),
Y ∼N
σ/ n
and so
P (Reject H0 given µ = 49.9)
= Φ(−1.96 − (−1.667)) + 1 − Φ(1.96 − (−1.667))
= 0.385 + 0.00014
= 0.385 to 3d.p.
So when the true mean is µ = 49.9, the probability of rejecting H0 is much
lower, around 0.4. The probability of a Type II error is more than 0.6.
Alternative hypotheses close to µ0 are harder to detect.
We can repeat this calculation for a wide range of values of µ. Ideally a
test should have a power function that is always high and a low significance
level, so that all error probabilities are small; in practice, this is not possible,
as the Neyman-Pearson lemma tells us.
As well as depending on the significance level of the test, the power function also depends on the sample size. Larger sample sizes give a better
chance of rejecting the null hypothesis when it is false, for a test of a given
significance level.
If we carried out a Z-test at the 5% level based on just 9 temperature
measurements, we would have
Z=
X̄ − 50
X̄ − 50
√ =
,
0.1
(0.3/ 9)
still rejecting H0 for |zobs | > 1.96. If instead HA is true, and the actual value
is 50.2, then
µ − µ0
√ , 1 = N (0.2/0.1, 1) = N (2, 1),
Y ∼N
σ/ n
and so the power of the test for µ = 50.2 is
P (Reject H0 given µ = 50.2) = P (|Z| > 1.96 given Y ∼ N (2, 1))
= P (Z < −1.96 given Y ∼ N (2, 1))
+ P (Z > 1.96 given Y ∼ N (2, 1))
= Φ(−1.96 − 2) + 1 − Φ(1.96 − 2)
= 4 × 10−8 + 0.516
= 0.516 to 3d.p.
148
CHAPTER 8. HYPOTHESIS TESTING
The power when µ = 50.2 is much lower with n = 9 than with n = 25.
0.2
0.1
0.0
Probability density
0.3
0.4
8.7. FIXED SIGNIFICANCE LEVELS AND ERROR PROBABILITIES149
−2
−1
0
1
2
3
z
Figure 8.2: The distribution of Z under HA with a particular value of µ.
Compare this, with the distribution of Z under H0 as in Figure 8.1.
4
150
CHAPTER 8. HYPOTHESIS TESTING
0.6
0.4
0.2
Power of test
0.8
1.0
Power with n=9
49.0
49.5
50.0
50.5
True mean
Figure 8.3: The power function of the Z-test in the worked example, with
sample size 9.
51.0
8.8. TWO-SAMPLE PROBLEMS
8.8
151
Two-sample problems
So far, we have concentrated on situations involving just a single sample of
independent observations from some distribution. Much of the rest of the
module is about relaxing these assumptions.
We start by looking at problems involving two samples; where we may
be interested in measuring the same quantity, but the distributions of the
two populations may be different.Often we will be interested in estimating
the difference between corresponding parameters in two distributions, or in
testing whether or not there is a difference.
Examples might be assessing the difference in effect of a medical treatment
in men and in women, the relative popularities of two policies, or the effect
of physical environment on how well individuals carry out certain tasks.
8.8.1
Estimators and Standard Errors
Firstly, let X1 , . . . , XnX and Y1 , . . . , YnY be i.i.d random variables representing independent samples from two populations, with means µX and µY , and
2
variances σX
and σY2 , respectively. We already know that X̄ is an unbiased
√
estimator for µX , with standard error σX / nX , and similarly for Ȳ .
If we are interested in the difference µX − µY then the obvious estimator
is X̄ − Ȳ . We have
E[X̄ − Ȳ ] = E[X̄] − E[Ȳ ]
= µX − µY ,
so this is an unbiased estimator of the difference. Since X1 , . . . , XnX and
Y1 , . . . , YnY are all independent,
Var[X̄ − Ȳ ] = Var[X̄ + (−1)Ȳ ]
= Var[X̄] + (−1)2 Var[Ȳ ]
2
= σX
/nX + σY2 /nY .
So X̄ − Ȳ has standard error
q
2
σX
/nX + σY2 /nY .
2
As usual, when σX
and σY2 are not known, then we can use the estimated
standard error
q
s2X /nX + s2Y /nY .
On the other hand, if we have two samples that are clearly linked, as in
Example 17 below, we may have nX = nY = n say; but the two samples
152
CHAPTER 8. HYPOTHESIS TESTING
are not independent. In that example Xi and Yi are dependent, for any i;
however, we can often assume that if i 6= j, then Xi and Xj are independent,
Yi and Yj are independent and Xi and Yj are independent. The trick is to
define new random variables Di = Xi − Yi , i = 1, . . . , n. Note that these are
signed differences, not absolute differences. Then Di and Dj are independent,
for i 6= j, and
E[Di ] = E[Xi − Yi ]
= E[Xi ] − E[Yi ]
= µX − µY .
So we can use D̄ (which is, of course, equal to X̄
√ − Ȳ ), as an2 estimator of the
difference µX − µY . It has standard error σD / n, where σD = Var[Di ]. The
2
2
variance σD
cannot be calculated in terms of σX
and σY2 , since it is affected
by the dependence between Xi and Yi , but√it can be estimated from the data.
The estimated standard error of D̄ is sD / n. This is not the same as in the
case of independent samples.
Example 17: Reaction times
For a sample of people, reaction times to a flashing light are measured in
ordinary daylight and in darkness, giving the following values (in seconds).
Reaction time in daylight
2.3
2.2
2.4
2.3
2.0
2.3
2.8
1.7
2.4
2.5
Reaction time in the dark
1.8
2.3
2.0
1.6
1.8
2.1
2.1
1.9
2.4
2.1
Difference
0.5
-0.1
0.4
0.7
0.2
0.2
0.7
-0.2
0.0
0.4
Note that each row corresponds to two measurements (and their difference) for a single person; the two measurements are definitely not independent.
Write Xi for the reaction time in daylight of the ith person, Yi for that
2
person’s reaction time in darkness, Di for the (signed) difference, µX , σX
etc
2
for the population means and variances, and x̄, sX etc for the sample means
and variances, and d¯ = x̄ − ȳ.
¯
Our estimate of the population
mean
√ difference µD is d = 0.28, with esti√
mated standard error sD / n = 0.316/ 10 = 0.100. So the estimated mean
8.8. TWO-SAMPLE PROBLEMS
153
difference in reaction times between daylight and darkness is 0.28 seconds
with estimated standard error 0.10 seconds.
8.8.2
Paired samples
We would like to be able to give interval estimates and carry out hypothesis
tests in the case of paired observations. This would enable us to quantify
the difference in a more satisfactory way, and to test (for example) whether
there really is a difference between the population means.
Again, the trick is to work with the signed differences Di . Under the
assumptions mentioned above, the Di s are a random sample from the population of differences; we can often use familiar methods for inference on µD ,
provided that the other assumptions are met.
Example 18: Reactions Revisited
Assume that the differences in reaction times in Example 8.8.1 are normally distributed. Then, since the true standard deviation of the Di s is
unknown, we can use the method of section 7.4, based on the t distribution,
to get an interval estimate for the mean difference. The 95% interval is
√
√
d¯ ± t9,0.025 sD / 10 = 0.28 ± 2.262 × 0.316/ 10
= (0.054, 0.506),
where we found t9,0.025 by using the R command qt(0.975, 9).
If we wanted to test the null hypothesis that there is no difference, that
is H0 : µD = 0 against HA : µD 6= 0, say, we could use a t-test on the Di s; this
is known as a paired t-test. We define
T =
which gives
d¯ − 0
√
sD / n
√
tobs = 0.28/(0.316/ 10) = 2.80.
There is good evidence against H0 ; from R, the p-value is 2(1 − pt(2.8, 9)) =
0.021.
Example 19: Wearing Shoes.
In an experiment to compare two materials, ‘X’ and ‘Y’, for making shoes,
one material is assigned at random to the left shoe of a child and the other
to the right shoe. The amount of wear is measured as follows1 . In a real
1
This is based on a classic example originally from Box, G.E.P., W.G. Hunter and J.S.
Hunter (1978) Statistics for Experimenters: an Introduction to Design, Data Analysis,
and Model Building. New York: John Wiley. Units are not specified.
154
CHAPTER 8. HYPOTHESIS TESTING
experiment there are likely to be many more observations.
Child
Amount of wear with material ‘X’: xi
Amount of wear with material ‘Y’: yi
Difference di
1
4.7
3.0
1.7
2
3
4
5
15.4 9.2 7.4 2.1
15.2 7.8 5.4 1.6
0.2 1.4 2.0 0.5
We can test whether there is a real difference in wear by testing H0 :µD =
0 against the alternative HA :µD 6= 0. Again, provided the differences are
normally distributed, we can use a paired t-test. We have
T =
D̄ − 0
√ ,
sD / n
and under H0 , T ∼ t4 .
√
tobs = 1.16/(0.777/ 5) = 3.34.
The p-value is
=
=
=
=
2P (T > |tobs | given H0 )
2(1 − P (T < |tobs | given H0 ))
2(1 − P (T < 3.34 given T ∼ t4 ))
2(1 − 0.9856)
0.0288.
So there is good evidence of a difference in wear between the two materials.
We can give a confidence interval for the mean difference in wear; a 95%
interval is
√
√
d¯ ± t4,0.025 sD / 5 = 1.16 ± 2.776 × 0.777/ 5
= (0.20, 2.12).
Interpretation of this is limited in this example, since the units are unknown;
clearly the wear with material ‘X’ is greater than with ‘Y’, but the magnitude
of the difference is very uncertain.
8.8.3
Independent samples
For independent samples, we can work directly with the raw data, or with
summaries of each sample, rather than needing to calculate differences. Again,
we will concentrate here on the case where the observations are normally distributed, so
2
Xi ∼ N (µX , σX
), i = 1, . . . , nX
8.8. TWO-SAMPLE PROBLEMS
155
and
Yi ∼ N (µY , σY2 ), i = 1, . . . , nY .
In this case, it follows that
2
/nX + σY2 /nY ).
X̄ − Ȳ ∼ N (µX − µY , σX
2
If both σX
and σY2 are known, then this immediately allows us to construct
confidence intervals and carry out hypothesis tests; we can use the fact that
(X̄ − Ȳ ) − (µX − µY )
p
∼ N (0, 1).
2
σX
/nX + σY2 /nY
2
However, the much more common case is that the variances σX
and σY2 are
unknown. We can replace them by their estimates from the samples, to get
the estimated standard error, which suggests using
(X̄ − Ȳ ) − (µX − µY )
p
.
2
/nX + SY2 /nY
SX
Unfortunately, this statistic does not have a t distribution, but under suitable
conditions—provided neither nX or nY is less than 5, its distribution can be
approximated by by a suitably chosen t distribution, tν . It can be shown
that the degrees of freedom parameter, ν, lies between min{nX − 1, nY − 1}
and nX + nY − 2, depending on the relative variances and sample sizes. The
best approximation is given by
ν=
s2X
nX
(s2X /nX )2
nX −1
+
+
s2Y
nY
2
(s2Y /nY )2
nY −1
,
the Welch approximation, and this is what R uses. Note that ν is not necessarily an integer in this case. When doing calculations by hand, the simpler
‘approximation’ ν = min{nX − 1, nY − 1} is often used.
We have approximately
(X̄ − Ȳ ) − (µX − µY )
p
∼ tν ,
2
SX
/nX + SY2 /nY
so a 100(1 − α)% confidence interval for µX − µY is given by
q
(x̄ − ȳ) ± tν,1−α/2 s2X /nX + s2Y /nY
156
CHAPTER 8. HYPOTHESIS TESTING
and a test of H0 : µX −µY = µ0 against HA : µX −µY 6= µ0 involves comparing
(X̄ − Ȳ ) − µ0
T =p 2
SX /nX + SY2 /nY
with the tν distribution it would have if H0 were true. Note that here, µ0 is
the hypothesised value of µX − µY ; very often µ0 = 0, since the hypothesis
of ‘no difference’ is naturally of interest.
Example 20: Mathematics teaching.
An eight-week trial of teaching mathematics to children aged 6 years has
been carried out. Those in Group 1 were regularly praised, whilst the ones in
Group 2 were not. At the end of the trial, all children took an examination.
A summary of the examination results is as follows, where X1 , . . . , XnX represent results obtained by children in Group 1, and Y1 , . . . , YnY those obtained
by Group 2.
Group Sample size
1
nX = 21
2
nY = 23
Sample mean
x̄ = 51.48
ȳ = 41.52
Sample standard deviation
sX = 11.022
sY = 17.152
To estimate the effect of the difference in teaching method—that is, the
difference between µX and µY —we can use the estimate x̄ − ȳ = 9.96. The
estimated standard error is
q
s2X /nX + s2Y /nY
p
=
11.0222 /21 + 17.1522 /23
= 4.309.
We can approximate the degrees of freedom by ν = min{nX −1, nY −1} = 20.
So a 95% confidence interval for µX − µY is
9.96 ± t20,0.025 × 4.309
= 9.96 ± 2.086 × 4.309
= (1.08, 18.84).
A test of the hypothesis H0 : µX − µY = 0, at a fixed level of 5%, can be
carried out simply by noting that the confidence interval does not contain 0.
To assess the evidence against H0 more usefully, we can calculate a p-value.
We have
(X̄ − Ȳ ) − 0
T = p 2
,
SX /nX + SY2 /nY
tobs = 9.96/4.309 = 2.31
P (T > |tobs | given H0 ) = 2(1 − Ft20 (2.31)) = 0.0317.
8.9. CONFIDENCE INTERVALS FOR A PROPORTION, REVISITED157
So there is good evidence (p = 0.032) that the difference in teaching methods
does lead to a difference in performance; a 95% confidence interval for the
mean difference goes from about 1 mark to 19 marks.
Note that the 2-sample t-test described here is sometimes called the Welch
corrected, or Welch modified (2-sample) t-test, and does not assume equal
variances or equal sample sizes. In special cases, the calculation of the test
statistic or the degrees of freedom differs slightly.
In R, the Welch modified version is available, but the default assumption
is that the two sample variances are the same; usually this should be changed.
In R 2.1.3.2 the modified Welch test is applied using the function t.test.
For example t.test(x, y). The equal variances is set as FALSE and so no
change is needed. In any case you should make sure that the equal variance
option is deactivated. For more information see the R help pages, e.g. type
?t.test.
8.9
Confidence Intervals for a Proportion, Revisited
In this small final section, we revisit the work of section 7.2.1. Recall that
we obtained a confidence interval (7.4) for an unknown proportion p of the
following form
r
p̂(1 − p̂)
.
p̂ ± zα/2
n
Here we have a sample of size n, the sample proportion is p̂ and zα/2 is the
critical value. Now let’s take another perspective. We ask the question, what
values of p0 would not be rejected if we tested the hypothesis H0 : p = p0
against the alternative HA : p 6= p0 ? Given the level of significance α, the
p̂ − p0
and we will not reject H0 if
test statistic takes values q
p0 (1−p0 )
n
p̂ − p0
−zα/2 ≤ q
p0 (1−p0 )
n
≤ zα/2 .
Consider the case where equality holds. Then we obtain the following
2
(p̂ − p0 ) =
2
zα/2
p0 (1 − p0 )
,
n
and when we expand this and rearrange, we obtain a quadratic equation
for the unknown p0 :
158
CHAPTER 8. HYPOTHESIS TESTING
p20
1+
2
zα/2
n
!
− p0
2p̂ +
2
zα/2
n
!
+ p̂2 = 0.
When we solve this, after some algebraic manipulation (and you should
check this), we get
q
2
2
zα/2
zα/2
+
p̂ + 2n ± zα/2 p̂(1−p̂)
n
4n2
.
(8.2)
p0 =
2
zα/2
1+ n
and this gives the two limits for our required confidence interval. When
n is very large, some of the terms in (8.2) are negligible. Also, in the case
where we seek a 95% confidence interval, zα/2 = 1.96 ∼
= 2. Then you can
check in the exercises, that a good approximation to (8.2) is
r
p̃(1 − p̃)
,
(8.3)
p̃ ± 1.96
n+4
x+2
where if p̂ = nx , then p̃ = n+4
.
Note that (8.3) is the same as the Wald interval (7.4), but with two
additional successes and two additional failures added to the data. For this
reason (8.3) is called the plus four confidence interval for a proportion, or
sometimes the Wilson interval after its creator.
Chapter 9
Count Data, Contingency
Tables and Goodness of Fit
9.1
Hypothesis Tests on Proportions
Count data arise when we are interested in counting some observed aspect
of an experiment, e.g. number of Ebola virus outbreaks in a population,
number of votes obtained by a candidate in an election. We might suppose,
for example, that the population from which our count data are taken has
a binomial, Poisson or multinomial distribution. Suppose that the data are
either “success” or “failure”, and we are counting number of successes. If we
have a sample of size n, and there are x successes, then we may calculate
the sample proportion p̂ = x/n. If p is the true probability of success in the
population, then we can test the hypothesis H0 : p = p0 against HA : p 6= p0 .
When testing proportions it is sometimes more realistic to use a one-sided
test and take either HA : p < p0 , or HA : p > p0 .
If n is large, then we can use the normal approximation to the binomial
as our test statistic. In this case we should use a continuity correction as
described in section 5.4. So Z ∼ N (0, 1) where
X ± 1 − np0
,
Z=p 2
np0 (1 − p0 )
(9.1)
and X is the random variable: number of successes in n trials.
Example 21 The manufacturers of “Happy Crunch”, claim that at most
20% of all breakfast cereal buyers purchase the rival brand “Sweet Bliss”.
Test this claim at α = 0.01 if a random check at several supermarket outlets
found 58 purchases of “Sweet Bliss” among 200 cereal buyers.
159
160CHAPTER 9. COUNT DATA, CONTINGENCY TABLES AND GOODNESS OF FIT
Solution. We test H0 : p = 0.2 against HA : p > 0.2. As this is a
one-sided test, we compare with zα and not zα/2 . Using the R-command
qnorm(0.99, 0, 1), we have zα = 2.326. Substituting x = 58, n = 200 and
= 3.09.
p0 = 0.2 into (9.1) we get zobs = √ 57.5−40
200(0.2)(0.8)
Since 3.09 > 2.326, we reject H0 , and conclude that the evidence suggests
that more than 20% of cereal lovers prefer “Sweet Bliss”.
In more complicated situations we may want to test for differences between proportions. For example, suppose that we have samples of voters in
12 different cities and we want to know if the proportion favouring a given
political party is the same in each. To model this and similar problems, we
suppose that we have r independent random variables X1 , X2 , . . . , Xr where
Xk ∼Binom(pk , nk ) for k = 1, 2, . . . , r. In the example just mentioned, we’d
have r = 12, pk is the probability of a randomly chosen person in the kth city
supporting the party, and nk is the number of people sampled in city number
k. If the nk ’s are sufficiently large, we can use the CLT to approximate each
of these by standard normals Zk ∼ N (0, 1), where
Xk − nk pk
.
Zk = p
nk pk (1 − pk )
But it is not convenient for hypothesis testing to deal with r different test
statistics. It is better to use just one random variable. In fact from section
7.5. we know that Q ∼ χ2r where
Q=
r
X
(Xk − nk pk )2
k=1
nk pk (1 − pk )
,
(9.2)
and this is the test statistic that we need. When we replace the random
variables Xk by the observed values xk in (9.2), we will write q instead of Q.
There is a convenient way to think about this problem, that has farreaching generalisations. Suppose that we arrange the data in a table as
follows:
Sample 1
Sample 2
..
.
..
.
Sample r
Successes
x1
x2
..
.
..
.
xr
Failures
n 1 − x1
n 2 − x2
..
.
..
.
nr − xr
9.2. CONTINGENCY TABLES
161
The numbers that appear in this table are called observed cell frequencies
and are denoted fij for i = 1, 2, . . . , r and j = 1, 2, so fi1 = xi , fi2 = ni − xi .
Now suppose that we test the null hypothesis H0 : p1 = p2 = · · · = pr
against HA : At least two of the pk ’s are different. If H0 is true, and p0 is the
common value of the pk ’s, then the expected cell frequencies are the numbers
eij for i = 1, 2, . . . , r and j = 1, 2, where ei1 = ni p0 and ei2 = ni (1 − p0 ). The
following result has important consequences.
Proposition 9.1.1.
q=
r X
2
X
(fij − eij )2
.
e
ij
i=1 j=1
(9.3)
Proof. Since we don’t yet know it is q, let q 0 denote the right hand side
of (9.3). Then
q
0
r X
(xi − ni p0 )2
[ni − xi − ni (1 − p0 )]2
+
=
ni p0
ni (1 − p0 )
i=1
r
X
1
1
2
=
(xi − ni p0 )
+
ni p0 ni (1 − p0 )
i=1
=
r
X
(xi − ni p0 )2
= q,
n
p
(1
−
p
)
i
0
0
i=1
by (9.2), given that H0 is true.
9.2
Contingency Tables
In the last section we dealt with binomial random variables, where there are
only two possible outcomes, “success” or “failure”. The multinomial distribution generalises this to the case where there are c ≥ 2 possible outcomes.
It was covered in detail in Section 3.5, Chapter 3 of Semester 1 notes. Here
we give a brief reminder.
We have a sequence of n independent trials, each of which has c ≥ 2
possible outcomes, with probabilities θ1 , θ2 , . . . , θc respectively, where
θ1 + · · · + θc = 1.
Define
162CHAPTER 9. COUNT DATA, CONTINGENCY TABLES AND GOODNESS OF FIT
X1 to be the number of times outcome 1 occurs,
X2 to be the number of times outcome 2 occurs,
..
.
Xc to be the number of times outcome c occurs.
The joint distribution of the random variables X1 , X2 , . . . , Xc is called the
multinomial distribution with parameters n, θ1 , θ2 , . . . , θc , and it is denoted
by
Mn(n; θ1 , θ2 , . . . , θc ).
The joint probability function of X1 , X2 , . . . , Xc is

n!

θ1x1 · · · θcxc if x1 + · · · + xc = n
pX1 ···Xc (x1 , . . . , xc ) = x1 ! · · · xc !
0
otherwise.
Using this distribution, we can generalise the table we compiled in the last
section, where we had r different samples, each of which can be a success, or a
failure; to the case where there are r samples, each with c possible outcomes.
The resulting table (we’ll see an example below) is called an r ×c contingency
table.
Sample 1
Sample 2
..
.
..
.
Sample r
Type 1
f11
f21
..
.
..
.
fr1
Type 2 · · ·
f12
···
f22
···
..
.
..
.
fr2
···
Type c
f1c
f2c
frc
For example, suppose that we visit 12 different cities during the UK
General Election, and ask voters in each city if they plan to vote Conservative,
Labour, Liberal Democrat, Green, Ukip, Other or are Undecided. Then we
will compile a 12 × 7 contingency table of data.
When we have a contingency table, there are many different hypotheses
that we might test. Here we will just look at one of these, and test for
independence of rows and columns. To be precise, lets suppose that pij is the
joint probability that the i th population has outcome j. Then we test the
hypothesis H0 : pij = pi uj for all i, j against HA : pij 6= pi uj for at P
least one
value of i and j. Here pi and P
uj are the marginal probabilities: pi = cj=1 pij ,
for i = 1, 2, . . . , r and uj = ri=1 pij , for j = 1, 2, . . . , c. Generalising (9.3),
our test statistic is
9.2. CONTINGENCY TABLES
163
r X
c
X
(fij − eij )2
q=
,
eij
i=1 j=1
(9.4)
where, as before eij are the expected cell frequencies if H0 were true, and
fij are the observed cell frequencies.
The next point is very important.
We reject H0 if q > χ2ν,1−α where ν = (r − 1)(c − 1), and α is a
given significance level. Alternatively (as in Example 22 below),
we reject H0 if the p-value P (Q > q|H0 ) is smaller than 0.05, where
the random variable Q has a chi-squared distribution with ν degrees
of freedom.
Why is ν = (r − 1)(c − 1)? In general, when we use the chi-squared
distribution for count data, we have ν = s−t−1, where s is the total number
of terms in the summation, and t is the number of independent parameters
that are estimated from the
rc. We
Pcmust estimate the
Pcdata. In our case, s =P
r
r parameters pi by fi = j=1 fij /f , where f = i=1 j=1 fij is the total
P P
P
sum of all the count data points. Since ri=1 fi = f1 ri=1 cj=1 fij = 1, only
r − 1 of these
Prare independent. Similarly, we must estimate the c parameters
uj by gj = i=1 fij /f , and only c − 1 of these are independent. Hence
ν = s − t − 1 = rc − (r − 1) − (c − 1) − 1 = (r − 1)(c − 1).
Example 22: Restaurant Types
The following table presents data on restaurants in the UK. In this case
r = c = 3, and the rows label the type of ownership, while the columns
describe the nature of the food served. The table gives the count data along
with row and column totals for 259 restaurants. Is there any evidence of a
relationship between owner type and food type?
1: Sole proprietorship
Owner 2: Partnership
3: Corporation
1: Fast food
42
8
59
109
Food type
2: Ethnic 3: Other
30
32
9
7
35
37
74
76
104
24
131
259
164CHAPTER 9. COUNT DATA, CONTINGENCY TABLES AND GOODNESS OF FIT
The estimates of the marginal probabilities for “owner type” are:
f1 = 104/259 ≈ 0.402,
f2 = 24/259 ≈ 0.093,
f3 = 131/259 ≈ 0.506.
Similarly for “food type”, we have:
g1 = 109/259 ≈ 0.421,
g2 = 74/259 ≈ 0.286,
g3 = 76/259 ≈ 0.293.
The expected counts eij = nfi gj are then as shown in the following table (to
1 d.p.).
1: Sole proprietorship
Owner 2: Partnership
3: Corporation
1: Fast food
43.8
10.1
55.1
109
Food type
2: Ethnic 3: Other
29.7
30.5
6.9
7.0
37.4
38.4
74
76
104
24
131
259
The χ2 statistic is calculated using (9.4) to give q = (42 − 43.8)2 /43.8 +
(30 − 29.7)2 /29.7 + · · · and we get q = 1.736 (3 d.p.) We then compare q
with the χ2 distribution with (r − 1)(c − 1) = 4 degrees of freedom. The pvalue for testing the null hypothesis of independence against the alternative
of some form of dependence is 1 − pchisq(1.736, 4) = 0.784 (3 d.p.), and so
we conclude that there is no evidence against the null hypothesis.
For theoretical reasons that we won’t go into here, we should not use the
chi-squared test arbitrarily. One rule of thumb is that we require that none of
the eij ’s are smaller than 5. A weaker one that is also used is that all eij > 1
and at least 80% of the eij ’s are larger than 5. This also applies to the work
in the next section. If some of the eij ’s do turn out to be smaller than 5, then
we should combine cells together (which also reduces the number of degrees
of freedom).
9.3
Goodness of Fit
This applied to situations where we want to determine if a given data set is
a random sample from a population that has a hypothesised probability distribution. Suppose that we have m count data points, and that the observed
9.3. GOODNESS OF FIT
165
frequencies are O1 , O2 , . . . , Om . We want to compare these with the expected
frequencies E1 , E2 , . . . , Em if the hypothesis were true. We compute the test
statistic:
q=
m
X
(Oi − Ei )2
Ei
i=1
,
χ2m−t−1 ,
where t is the number of independent
and we reject H0 if q >
parameters that are estimated from the data.
Example 23: Poisson counts
In a long sequence of observations on the behaviour of a laboratory animal, it is hypothesized that the numbers of actions categorised as ‘grooming’
in a day should follow a Poisson distribution (see Section 3.4.3 of Chapter 3)
with unknown mean.
Actual numbers of actions, over 60 days, are as shown in the following
table. Are these consistent with a Poisson distribution?
Number of actions 0
Frequency
13
1 2
9 10
3 4 5 6 7 or more
16 5 5 2
0
Estimating the Poisson mean as the mean of the observations gives a value
of
9 + (10 × 2) + (16 × 3) + (5 × 4) + (5 × 5) + (2 × 6)
= 2.233 (to 3 d.p.).
13 + 9 + 10 + 16 + 5 + 5 + 2
The probabilities of the different possible values are then 0.107, 0.239, 0.267,
0.199, 0.111, 0.050, 0.018, 0.006, 0.002, . . . (to 3 d.p.) using dpois(0:8,2.233)
in R. The expected numbers of observations taking these values are obtained
by multiplying by n = 60, giving 6.4, 14.4, 16.0, 11.9, 6.7, 3.0, 1.1, 0.4, 0.1,
. . . ; to give high enough expected values in all classes, we combine values of
4 and over into a single class.
Number of actions
Frequency
0
13
1
9
2
10
3
16
4 5 6 7+
|5 5 {z 2 0}
Class
Observed number
Expected number
χ2 term
0
13
6.4
6.7
1
9
14.4
2.0
2
10
16.0
2.3
3
16
11.9
1.4
4 or more
12
11.2
0.1
60
60
60
12.4
The value q = 12.4 should then be compared with the distribution of Q under
the null hypothesis; bearing in mind that one parameter has been estimated
166CHAPTER 9. COUNT DATA, CONTINGENCY TABLES AND GOODNESS OF FIT
(using extra information—the original data, not just the counts in the final
classes), that distribution is approximately χ25−1−1=3 (and, more precisely,
lies between χ23 and χ24 ). From R, 1-pchisq(12.4,3) is 0.0061, and so there
is strong evidence that the data do not come from a Poisson distribution.
It is clear from comparing observed and expected numbers that there are
more zeroes in the data than would be expected from a Poisson distribution
with the appropriate mean, and that the rest of the data are rather higher
than expected.