Download (ST217: Mathematical Statistics B)

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
MSB
(ST217: Mathematical Statistics B)
Aim
To review, expand & apply the ideas from MSA.
In particular, MSA mainly studied one unknown quantity at once. In MSB we’ll study interrelationships.
Lectures & Classes
Monday
Wednesday
Thursday
12–1
10–11
1–2
R0.21
R0.21
PLT
Examples classes will begin in week 3.
Style
• Lectures will be supplemented (NOT replaced!!) with printed notes.
Please take care of these notes—duplicates may not be readily available.
• I shall teach mainly by posing problems (both theoretical and applied) and working through them.
Contents
1. Overview of MSA.
2. Bivariate & Multivariate Probability Distributions.
Joint distributions, conditional distributions, marginal distributions; conditional expectation. The
χ2 , t, F and multivariate Normal distributions and their interrelationships.
3. Inference for Multiparameter Models.
Likelihood, frequentist and Bayesian inference, prediction and decision-making. Comparison between
various approaches. Point and interval estimation. Classical simple and composite hypothesis testing,
likelihood ratio tests, asymptotic results.
4. Linear Statistical Models.
Linear regression, multiple regression & analysis of variance models. Model choice, model checking
and residuals.
5. Further Topics (time permitting).
Nonlinear models, problems & paradoxes, etc.
Books
The books recommended for MSA are also useful for MSB. Excellent books on mathematical statistics are:
1. ‘Statistical Inference’ by George Casella & Roger L. Berger [C&B], Duxbury Press (1990),
2. ‘Probability and Statistics’ by Morris DeGroot, Addison-Wesley (2nd edition 1989).
A good book discussing the application and interpretation of statistical methods is ‘Introduction to the
Practice of Statistics’ by Moore & McCabe [M&M], Freeman (3rd edition 1998). Many of the data sets
considered below come from the ‘Handbook of Small Data Sets’ [HSDS] by Hand et al., Chapman & Hall,
London (1994).
There are many other useful references on mathematical statistics available in the library, including books
by Hogg & Craig [H&C], Lindgren, Mood, Graybill & Boes [MG&B], and Rice.
c 1998,1999,2000,2001 by J. E. H. Shaw
These notes are copyright 1
Chapter 1
Overview of MSA
1.1
Basic Ideas
1.1.1
What is ‘Statistics’ ?
Statistics may be defined as:
‘The study of how information should be employed to reflect on, and give guidance for action
in, a practical situation involving uncertainty.’ [italics by JEHS]
Vic Barnett, Comparative Statistical Inference
Figure 1.1: A practical situation involving uncertainty
2
1.1.2
Statistical Modelling
The emphasis of modern statistics is on modelling the patterns and interrelationships in the existing data,
and then applying the chosen model(s) to predict future data.
Typically there is a measurable response (for example, reduction Y in patient’s blood pressure) that is
thought to be related to explanatory variables xj (for example, treatment applied, dose, patient’s age,
weight, etc.) We seek a formula that relates the observed responses to the corresponding explanatory
variables, and that can be used to predict future responses in terms of their corresponding explanatory
variables:
Observed Response =
Fitted Value
+ Residual,
Future Response
= Predicted value +
Error.
Here the fitted values should take account of all the consistent patterns in the data, and the residuals
represent the remaining random variation.
1.1.3
Prediction and Decision-Making
Always remember that the main aim in modelling as above is to predict (for example) the effects of different
medical treatments, and hence to decide which treatment to use, and in what circumstances.
The fundamental assumption is that the future data will be in some sense similar to existing data. The
ideas of exchangeability and conditional independence are crucial.
The following notation is useful:
X⊥
⊥Y
X⊥
⊥ Y |Z
‘X is independent of Y ’, i.e. Y gives you no information about X,
‘X is conditionally independent of Y given Z’, i.e. if you know the value taken by the
RV Z, then Y gives you no further information about X.
Most methods of statistical inference proceed indirectly from what we know (the observed data and any
other relevant information) to what we really want to know (future, as yet unobserved, data), by assuming
that the random variation in the observed data can be thought of as a sample from an underlying population,
and learning about the properties of this population.
1.1.4
Known and Unknown Features of a Statistical Problem
A statistic is a property of a sample, whereas a parameter is a property of a population. Often it’s natural
to estimate a parameter θ (such as the population mean µ) by the corresponding property of the sample
(here the sample mean X). Note that θ may be a vector or more complicated object.
Unobserved quantities are treated mathematically as random variables. Potentially observable quantities
are usually denoted by capital letters (Xi , X, Y etc.) Once the data have been observed, the values taken
by these random variables are known (Xi = xi , X = x etc.) Unobservable or hypothetical quantities are
usually denoted by Greek letters (θ, µ, σ 2 etc.), and estimators are often denoted by putting a hat on the
b µ
corresponding symbol (θ,
b, σ
b2 etc.)
Nearly all statistics books use the above style of notation, so it will be adopted in these notes. However, sometimes I shall wish to distinguish carefully between knowns and unknowns, and shall denote all
unknowns by capitals. Thus Θ represents an unknown parameter vector, and θ represents a particular
assumed value of Θ. This is especially useful when considering probability distributions for parameters;
one can then write fΘ (θ) and Pr(Θ = θ) by exact analogy with fX (x) and Pr(X = x).
The set of possible values for a RV X is called its sample space ΩX . Similarly the parameter space ΩΘ is
the set of possible values for the parameter Θ.
3
1.1.5
Likelihood
In general, we can infer properties θ of the population by comparing how compatible are the various possible
values of θ with the observed data. This motivates the idea of likelihood (equivalently, log-likelihood or
support). We need a probability model for the data, in which the probability distribution of the random
variation is a member of a (realistic but mathematically tractable) family of probability distributions,
indexed by a parameter θ.
Likelihood-based approaches have both advantages and disadvantages—
Advantages
Disadvantages
Unified theory (many practical problems can be
tackled in essentially the same way).
Is the theory directly relevant? (is likelihood alone
enough? and how do we balance realism and
tractability?)
Often get simple sufficient statistics (hence we can
summarise a huge data set by a few simple properties).
If the probability model is wrong, then results can
be misleading (e.g. if one assumes a Normal distribution when the true distribution is Cauchy).
CLT suggests likelihood methods work well when
there’s loads of data.
One seldom has loads of data!
1.1.6
Where Will We Go from Here?
• MSA provided the mathematical toolbox (e.g. probability theory and the idea of random variables)
for studying random variation.
• MSB will add to this toolbox and study interrelationships between (random) variables.
• We shall also consider some important general forms for the fitted/predicted values, in particular
linear models and their generalizations.
1.2
Sampling Distributions
Statistical analysis involves calculating various statistics from the data, for example the maximum likelihood
b for θ. We want to understand the properties of these statistics; hence the importance
estimator (MLE) θ
of the central limit theorem (CLT) & its generalizations, and of studying the probability distributions of
transformed random variables.
P
If we have a formula for a summary statistic S, e.g. S =
Xi /n = X, and are prepared to make certain assumptions about the original random variables Xi , then we can say things about the probability
distribution of S.
The probability distribution of a statistic S, i.e. the pattern of values S would take if it were calculated in
successive samples similar to the one we actually have, is called its sampling distribution.
4
1.2.1
Typical Assumptions
1. Standard Assumption (IID RVs):
Xi are IID (independent and identically distributed) with (unknown) mean µ and variance σ 2 .
This implies
(a) E[X] = E[Xi ] = µ, and
(b) Var[X] =
1
n Var[Xi ]
= σ 2 /n.
(c) If we define the standardised random variables
Zn =
X −µ
,
σ
then as n → ∞, the distribution of Zn tends to the standard Normal N(0, 1) distribution.
2. Additional Assumption (Normality):
2
The Xi are IID Normal: Xi IID
∼ N (µ, σ ).
This implies that X ∼ N(µ, σ 2 /n).
1.2.2
Further Uses of Sampling Distributions
We can also
• compare various plausible estimators (e.g. to estimate the centre of symmetry of a supposedly symmetric distribution we might use the sample mean, median, or something more exotic),
• obtain interval estimates for unknown quantities (e.g. 95% confidence intervals, HPD intervals, support intervals),
• test hypotheses about unknown quantities.
Comments
1. Note the importance of expectations of (possibly transformed) random variables:
E[X]
E[(X − µ)2 ]
E[esX ]
E[eitX ]
=
=
=
=
µ (measure of location)
σ 2 (measure of scale)
moment generating function
characteristic function
2. We must always consider whether the assumptions made are reasonable, both from general considerations (e.g.: is independence reasonable? is the assumption of identical distributions reasonable?
is it reasonable to assume that the data follow a Poisson distribution? etc.) and with reference to
the observed set of data (e.g. are there any ‘outliers’—unreasonably extreme values—or unexpected
patterns?)
3. Likelihood and other methods suggest estimators for unknown quantities of interest (parameters etc.)
under certain specified assumptions.
Even if these assumptions are invalid (and in practice they always will be to some extent!) we may still
want to use summary statistics as estimators of properties of the underlying population. Therefore
(a) We’ll want to investigate the properties of estimators under various relaxed assumptions, for
example partially specified models that use only the first and second moments of the unknown
quantities.
(b) It’s useful if the calculated statistics (e.g. MLEs) have an intuitive interpretation (like ‘sample
mean’ or ‘sample variance’).
5
1.3
(Revision?) Problems
1. First-year students attending a statistics course were asked to carry out the following procedure:
Toss two coins, without showing anyone else the results.
If the first coin showed ‘Heads’ then answer the following question:
“Did the second coin show ‘Heads’ ? (Yes or No)”
If the first coin showed ‘Tails’ then answer the following question:
“Have you ever watched a complete episode of ‘Teletubbies’ ? (Yes or No)”
The following results were recorded:
Males
Females
Yes
84
23
No
48
24
For each sex, and for both sexes combined, estimate the proportion who have watched a complete
episode of ‘Teletubbies’.
Using a chi-squared test, or otherwise, test whether the proportions differ between the sexes.
Discuss the assumptions you have made in carrying out your analysis.
2. Let X and Y be IID RVs with a standard Normal N (0, 1) distribution, and define Z = X/Y .
(a) Write down the lower quartile, median and upper quartile of Z, i.e. the points z25 , z50 & z75
such that Pr(Z < zk ) = k/100.
(b) Show that Z has a Cauchy distribution, with PDF 1/π(z 2 + 1).
HINT : consider the transformation Z = X/Y and W = |Y |.
3. Let X1 , . . . Xn be mutually independent RVs, with respective MGFs (moment generating functions)
MX1 (t), . . . , MXn (t), and let a1 , . . . , an and b1 , . . . , bn be fixed constants.
Show that the MGF of Z = (a1 X1 + b1 ) + (a2 X2 + b2 ) + · · · + (an Xn + bn ) is
P MZ (t) = exp t bi MX1 (a1 t) × · · · × MXn (an t).
Hence or otherwise show that any linear combination of independent Normal RVs is itself Normally
distributed.
4. A workman has to move a rectangular stone block a short distance, but doesn’t want to strain himself.
He rapidly estimates:
• height of block = 10 cm, with standard deviation 1 cm.
• width of block = 20 cm, with standard deviation 3 cm.
• length of block = 25 cm, with standard deviation 4 cm.
• density of block = 4.0 g/cc, with standard deviation 0.5 g/cc.
Assuming these estimates are mutually independent, calculate his estimates of the volume V (cc)
and total weight W (Kg) of the block, and their standard deviations.
The workman fears that he might hurt his back if W ≥ 30.
Using Chebyshev’s inequality, give an upper bound for his probability Pr(W ≥ 30).
[Chebyshev’s inequality states that if X has mean µ & variance σ 2 , then Pr(|X − µ| ≥ c) ≤
σ 2 /c2 —see MSA].
What is the workman’s value for Pr(W > 30) under the additional assumption that W is Normally
distributed? Compare this value with the bound found earlier.
How reasonable are the independence and Normality assumptions used in the above analysis?
6
5. Calculate the MLE of the centre of symmetry θ, given IID RVs X1 , X2 , . . . , Xn , where the common
PDF fX (x) of the Xi s is
(a) Normal (or Gaussian):
fX (x|θ, σ) = √
2 1
exp − 12 (x − θ)/σ
2πσ
(b) Laplacian (or Double Exponential ):
fX (x|θ, σ) =
1
exp |x − θ|/σ
2σ
(c) Uniform (or Rectangular ):
fX (x|θ) =
1 if θ − 21 < x < θ + 12
0 otherwise.
Do you consider these MLEs to be intuitively reasonable?
6. Calculate E[X], E[X 2 ], E[X 3 ] and E[X 4 ] under each of the following assumptions:
(a) X ∼ Poi (λ), i.e. X has PMF (probability mass function)
Pr(X = x|λ) =
λx exp(−λ)
x!
(x = 0, 1, 2, . . . )
(b) X ∼ Exp(β), i.e. X has PDF (probability density function)
βe−βx if x > 0
fX (x|β) =
0
otherwise.
(c) X ∼ N µ, σ 2 , i.e. X has PDF
fX (x|µ, σ) = √
2 1
exp − 12 (x − µ)/σ
2πσ
7. Describe briefly how, and under what circumstances, you might approximate
(a) a binomial distribution by a Normal distribution,
(b) a binomial distribution by a Poisson distribution,
(c) a Poisson distribution by a Normal distribution.
Suppose X ∼ Bin(100, 0.1), Y ∼ Poi (10), and Z ∼ N 10, 32 . Calculate, or look up in tables,
(i)
(iv)
Pr(X ≥ 6),
Pr(X > 16),
(ii)
(v)
Pr(Y ≥ 6),
Pr(Y > 16),
(iii)
(vi)
Pr(Z > 5.5),
Pr(Z > 16.5),
and comment on the accuracy of the approximations here.
8. The t distribution with n degrees of freedom, denoted tn or t(n), has the PDF
Γ 21 (n + 1)
1
1
√
,
−∞ < t < ∞,
f (t) =
nπ 1 + t2 /n(n+1)/2
Γ 12 n
and the F distribution with m and n degrees of freedom, denoted Fm,n or F (m, n), has PDF
Γ 12 (m + n)
x(m/2)−1
m/2 n/2
,
0 < x < ∞,
m
n
f (x) =
(mx + n)(m+n)/2
Γ 12 m Γ 12 n
with f (x) = 0 for x ≤ 0.
Show that if T ∼ tn and X ∼ Fm,n , then T 2 and X −1 both have F distributions.
7
9. Table 1.1 shows the estimated total resident population (thousands) of England and Wales at
30 June 1993:
Age
Persons
Males
Females
<1
1–14
15–44
45–64
65–74
≥ 75
669.6
9,268.0
21,875.0
11,435.8
4,595.9
3,594.9
343.1
4,756.9
11,115.6
5,676.6
2,081.7
1,224.5
326.5
4,511.1
10,759.4
5,759.2
2,514.2
2,370.4
Total
51,439.2
25,198.4
26,240.8
Table 1.1: Estimated resident population of England & Wales, mid 1993, by sex and
age-group (simplified from Table 1 of the 1993 mortality tables )
Table 1.2, also extracted from the published 1993 Mortality Statistics, shows the number of deaths
in 1993 among the resident population of England and Wales, categorised by sex, age-group and
underlying cause of death.
Assume that the rates observed in Tables 1.1 and 1.2 hold exactly, and suppose that an individual
I is chosen at random from the population. Define the random variables S (sex), A (age group), D
(death) and C (cause) as follows:
S
=
0
1
if I is male,
if I is female,
A
=
1
2
3
if I is under 1 year old,
if I is aged 1–14,
if I is aged 15–44,
D
=
0
1
if I survives the year,
if I dies,
C
=
cause of death (0–17).
4
5
6
if I is aged 45–64,
if I is aged 65–74,
if I is 75 years old or over,
For example,
Pr(S=0)
= 25198.4/51439.2,
Pr(S=0 & A=6) = 1224.5/51439.2,
Pr(D=0|S=0 & A=6)
= 1 − 138.239/1224.5,
Pr(C=8|S=0 & A=6)
= 28.645/1224.5,
etc.
(a) Calculate Pr(D=1|S=0), and Pr(D=1|S=0 & A=a) for a = 1, 2, 3, 4, 5, 6.
Also calculate Pr(S=0|D=1), and Pr(S=0|D=1 & A=a) for a = 1, 2, 3, 4, 5, 6.
If you were an actuary, and were asked by a non-expert “is the death rate for males higher or
lower than that for females?”, how would you respond based on the above calculations? Justify
your answer.
(b) Similarly, explain how you would respond to the questions
i. “is the death rate from neoplasms higher for males or for females?”
ii. “is the death rate from mental disorders higher for males or for females?”
iii. “is the death rate from diseases of the circulatory system higher for males or for females?”
iv. “is the death rate from diseases of the respiratory system higher for males or for females?”
8
Sex
All
ages
<1
1–14
0 Deaths below 28 days
(no cause specified)
M
F
1,603
1,192
1,603
1,192
−
−
−
−
−
−
−
−
−
−
1 Infectious & parasitic
diseases
M
F
1,954
1,452
60
46
79
44
565
169
390
193
346
283
514
717
2 Neoplasms
M
F
74,480
67,966
16
8
195
138
2,000
2,551
16,372
15,026
25,644
19,141
30,253
31,102
3 Endocrine, nutritional
& metabolic diseases and
immunity disorders
M
F
3,515
4,403
28
17
43
37
208
153
639
474
959
901
1,638
2,821
4 Diseases of blood and
blood-forming organs
M
F
897
1,084
5
3
12
14
62
28
106
73
204
163
508
803
5 Mental disorders
M
F
2,530
5,189
−
−
8
1
281
83
169
99
334
297
1,738
4,709
6 Diseases of the nervous
system and sense organs
M
F
4,403
4,717
59
42
136
118
530
313
675
546
890
809
2,113
2,889
7 Diseases of the
circulatory system
M
F
123,717
134,439
41
44
66
45
1,997
834
20,682
7,783
37,195
23,185
63,736
102,548
8 Diseases of the
respiratory system
M
F
41,802
49,068
86
59
79
74
608
322
3,157
2,145
9,227
6,602
28,645
39,866
9 Diseases of the
digestive system
M
F
7,848
10,574
10
20
27
14
511
298
1,706
1,193
2,058
1,921
3,536
7,128
10 Diseases of the
genitourinary system
M
F
3,008
3,710
4
4
6
7
57
55
215
219
676
535
2,050
2,890
11 Complications of pregnancy,
childbirth and the puerperium
M
F
−
27
−
−
−
−
−
27
−
−
−
−
−
−
12 Diseases of the skin
and subcutaneous tissue
M
F
269
748
1
−
1
−
7
15
22
30
62
80
176
623
13 Diseases of the musculoskeletal
system and connective tissue
M
F
785
2,639
1
−
5
5
28
43
106
173
151
385
494
2,033
14 Congenital anomalies
M
F
660
675
131
136
114
116
158
133
118
101
58
87
81
102
15 Certain conditions originating
in the perinatal period
M
F
186
114
93
60
8
5
13
3
18
4
16
10
38
32
16 Signs, symptoms and
ill-defined conditions
M
F
1,642
5,146
238
171
17
17
126
50
111
53
72
75
1,078
4,780
17 External causes of
injury and poisoning
M
F
9,859
5,869
34
30
311
162
4,749
1,240
2,183
882
941
731
1,641
2,824
M
F
279,158
299,012
2,410
1,832
1,107
797
11,900
6,317
46,669
28,994
78,833
55,205
138,239
205,867
Cause of death
Total
Age at death (years)
15–44
45–64
65–74
Table 1.2: Deaths in England & Wales, 1993, by underlying cause, sex and age-group
(extracted from Table 2 of the 1993 mortality tables )
9
≥ 75
(c) Now treat the data in Tables 1.1 & 1.2 as subject to statistical fluctuations. One can still
estimate
psac = Pr(S=s & A=a & C=c),
p·ac = Pr(A=a & C=c),
ps · · = Pr(S=s)
etc.
from the data, for example pb0,·,14 = 660/25198400 = 2.62×10−5 . Similarly estimate p1,·,14 and
p·,a,14 for a = 1 . . . 6. Using a chi-squared test or otherwise, investigate whether the relative
risk of death from a congenital anomaly between males and females is the same at all ages,
i.e. whether it reasonable to assume that
ps, a,14 = ps, ·,14 × p·, a,14 .
10. Data were collected on litter size and sex ratios for a large number of litters of piglets. The following
table gives the data for all litters of size between four and twelve:
Number
of males
Litter size
7
8
9
4
5
6
0
1
2
3
4
5
6
7
8
9
10
11
12
1
14
23
14
1
2
20
41
35
14
4
3
16
53
78
53
18
0
0
21
63
117
104
46
21
2
1
8
37
81
162
77
30
5
1
Total
53
116
221
374
402
10
11
12
0
2
23
72
101
83
46
12
7
0
0
7
8
19
79
82
48
24
10
0
0
0
1
3
15
15
33
13
12
8
1
1
0
0
0
1
8
4
9
18
11
15
4
0
0
0
346
277
102
70
(a) Discuss briefly what sort of probability distributions it might be reasonable to assume for the
total size N of a litter, and for the number M of males in a litter of size N = n.
(b) Suppose now that the litter size N follows a Poisson distribution with mean λ. Write down
an expression for Pr(N = n|4 ≤ N ≤ 12). Hence or otherwise give an expression for the
log-likelihood `(λ; . . .) given the above table of data.
(c) Evaluate `(λ; . . .) at λ = 7.5, 8 and 8.5. By fitting a quadratic to these values, provide point
and interval estimates of λ.
(d) Using a chi-squared test or otherwise, check how well your model fits the data.
(e) Comment on the following argument: ‘Provided λ isn’t too small, we could approximate the
Poisson distribution Poi (λ) by the Normal distribution N (λ, λ). This is symmetric, so we may
simply estimate the mean λ by the mode of the data (8 in our case). The standard deviation is
therefore nearly 3, and so we would expect the counts at litter size 8 ± 3 to be nearly 60% the
count at 8 (note that for a standard Normal, φ(1)/φ(0) = exp(−0.5) l 0.6). Since there are far
fewer litters of size 5 & 11 than this, the Poisson distribution must be a poor fit.’
Data from HSDS, set 176
Education is what survives when what has been learnt has been forgotten.
Burrhus Frederoc Skinner
10
Chapter 2
Bivariate & Multivariate
Distributions
MSA largely concerned IID (independent & identically distributed) random variables.
However in practice we are usually most interested in several random variables simultaneously, and their
interrelationships. Therefore we need to consider the probability distributions of random vectors, i.e. the
joint distribution of the individual random variables.
Bivariate Examples
A. (X1 , X2 ), the number of male & female pigs in a litter.
B. (X, Y ), the systolic and diastolic blood pressure of an individual.
C. (X, Y ), the age and height of an individual.
D. (X, Y ), the height and weight of an individual.
E. (b
µ, σ
b2 ), the estimated common mean and variance of n IID random variables X1 , . . . , Xn .
F. (Θ, X) where Θ ∼ U (0, 1) and X|Θ ∼ Bin(n, Θ), i.e.
1 if 0 < x < 1
fΘ (θ) =
0 otherwise,
fX (x|Θ = θ) =
n
x
θx (1 − θ)n−x
x = 0, 1, . . . , n.
Definition 2.1 (Bivariate CDF)
The joint cumulative distribution function of 2 RVs X & Y is the function
FX,Y (x, y) = Pr(X ≤ x & Y ≤ y),
(x, y) ∈ R2 .
(2.1)
Comments
1. The joint cumulative distribution function (or joint CDF) may also be called the ‘joint distribution
function’ or ‘joint DF’.
2. If there’s no ambiguity, then we may simply write F (x, y) for FX,Y (x, y).
11
2.1
Discrete Bivariate Distributions
If RVs X & Y are discrete, then they have a discrete joint distribution and a probability mass function
(PMF) that, similarly to the univariate case, is usually written fX,Y (x, y) or more simply f (x, y):
Definition 2.2 (Bivariate PMF)
The joint probability mass function of discrete RVs X and Y is
f (x, y) = Pr(X = x & Y = y).
Exercise 2.1
Suppose that the numbers X1 and X2 of male and female piglets follow independent Poisson distributions
with means λ1 & λ2 respectively. Find the joint PMF.
k
Exercise 2.2
Now assume the model N ∼ Poi (λ), (X1 |N ) ∼ Bin(N, θ), i.e. the total number N of piglets follows a
Poisson distribution, and, conditional on N = n, X has a Bin(n, θ) distribution (in particular θ = 0.5 if
the sexes are equally likely). Again find the joint PMF.
k
Exercise 2.3
Verify that the two models given in Exercises 2.1 & 2.2 give identical fitted values, and are therefore in
practice indistinguishable.
k
2.1.1
Manipulation
A discrete RV has a countable sample space, which without loss of generality can be represented as
N = {0, 1, 2, . . .}. Values of a discrete joint distribution f (x, y) can therefore be tabulated:
0
X
0
1
..
.
f00
f10
..
.
1
Y
2
3
f01
f11
..
.
f02
f12
..
.
...
...
..
.
...
and the probability of any event E obtained by simple summation:
X
Pr (X, Y ) ∈ E =
f (xi , yi ).
(xi ,yi )∈E
Exercise 2.4
Continuing Exercise 2.2, find the PMF of X1 , and hence identify the distribution of X1 .
k
Exercise 2.5
The RV Q is defined on the rational numbers in [0, 1] by Q = X/Y , where f (x, y) = (1 − α)αy−1 /(y + 1),
0 < α < 1, y = {1, 2, . . .}, x = {0, 1, . . . , y}.
Show that Pr(Q = 0) = (α − 1) α + log(1 − α) /α2 .
k
12
2.2
Continuous Bivariate Distributions
Definition 2.3 (Continuous bivariate distribution)
Random variables X & Y have a continuous joint distribution if there exists a function f from R2 to
[0, ∞) such that
ZZ
Pr (X, Y ) ∈ A =
f (x, y) dx dy
∀A ⊆ R2 .
(2.2)
A
Definition 2.4 (Bivariate PDF)
The function f (x, y) defined by Equation 2.2 is called the joint probability density function of X & Y .
Comments
1. f (x, y) may be written more explicitly as fX,Y (x, y).
Z ∞Z ∞
2.
f (x, y) dx dy = 1.
−∞
−∞
3. f (x, y) is not unique—it could be arbitrarily defined at a countable
RR set of points (xi , yi ) (more
generally, any ‘set with measure zero’) without changing the value of A f (x, y) dx dy for any set A.
4. f (x, y) ≥ 0 at all continuity points (x, y) ∈ R2 .
Examples
1. As in Example E from page 11, we will want to know properties of the joint distribution of (b
µ, σ
b2 ),
IID
2
2
the MLEs of µ and σ respectively given X1 , . . . , Xn ∼ N (µ, σ ).
2. In the situation of Example B from page 11, where X is the systolic blood pressure and Y the diastolic
blood pressure of an individual, it might be reasonable to assume that
X
Y |X
∼ N (µS , σS2 ),
∼ N (α + βX, σD2 ),
and hence obtain
fX,Y (x, y)
=
fX (x) fY |X (y|x).
Comment
As in Exercise 2.2, a family of multivariate distributions is most easily built up hierarchically using simple univariate distributions and conditional distributions like that of Y |X. Conditional distributions are
considered formally in Section 2.4.
2.2.1
Visualising and Displaying a Continuous Joint Distribution
A continuous bivariate distribution can be represented by a contour or other plot of its joint PDF (Fig. 2.1).
Comments
1. The joint distribution of X and Y may be neither discrete nor continuous, for example:
• Either X or Y may have both continuous and discrete components,
• One of X and Y may have a continuous distribution, the other discrete (like Example F on
page 11).
2. Higher dimensional joint distributions are obviously much more difficult to interpret and to represent
graphically, with or without computer help.
13
Figure 2.1: Contour and perspective plots of a bivariate distribution
2.3
Marginal Distributions
Given a joint CDF FX,Y (x, y), the distributions defined by the CDFs FX (x) = limy→∞ FX,Y (x, y) and
FY (y) = limx→∞ FX,Y (x, y) are called the marginal distributions of X and Y respectively:
Definition 2.5 (Marginal CDF, PMF and PDF—bivariate case)
FX (x) = limy→∞ FX,Y (x, y) is the marginal CDF of X.
If X has a discrete distribution, then fX (x) = Pr(X = x) is the marginal PMF of X.
d
If X has a continuous distribution, then fX (x) =
FX (x) is the marginal PDF of X.
dx
Marginal CDFs and PDFs of Y , and of other RVs for higher-dimensional joint distributions, are defined
similarly.
Exercise 2.6
Suppose that you are given a bag containing five coins:
1 double-tailed,
1 with Pr(head) = 1/4,
2 fair,
1 double-headed.
You pick one coin at random (each with probability 1/5), then toss it twice.
By finding the joint distribution of Θ = Pr(head) and X = number of heads, or otherwise, calculate the
distribution of the number of heads obtained.
k
Comments
1. If you’ve tabulated Pr(Θ = θ & X = x), then it’s simple to find FΘ (θ) and FX (x) by writing the row
sums and column sums in the margins of the table of Pr(Θ = θ & X = x)—hence the name ‘marginal
distribution’.
2. Although the most satisfactory general definition of marginal distributions is in terms of their CDFs,
in practice it’s usually easiest to work with PMFs or PDFs
2.4
2.4.1
Conditional Distributions
Discrete Case
If X and Y are discrete RVs then, by definition,
Pr(Y =y|X=x) = Pr(X=x & Y =y)/ Pr(X=x).
14
(2.3)
In other words (or, more accurately, in other symbols):
Definition 2.6 (Conditional PMF—bivariate case)
If X and Y have a discrete joint distribution with PMF fX,Y (x, y), then the conditional PMF fY |X
of Y given X = x is
fX,Y (x, y)
fY |X (y|x) =
(2.4)
fX (x)
P
where fX (x) = y fX,Y (x, y) is the marginal PMF of X.
Exercise 2.7
Continuing Exercise 2.6, what are the conditional distributions of [X |Θ = 1/4] and [Θ|X = 0]?
k
2.4.2
Continuous Case
Now suppose that X and Y have a continuous joint distribution. If we observe X = x, then we will want
to know the conditional CDF FY |X (y|X = x). But we CAN’T use Equation 2.3 directly, which would
entail dividing by zero. Therefore, by analogy with Equation 2.4, we adopt the following definition:
Definition 2.7 (Conditional PDF—bivariate case)
If X and Y have a continuous joint distribution with PDF fX,Y (x, y), then the conditional PDF fY |X
of Y given that X = x is
fX,Y (x, y)
fY |X (y|x) =
,
(2.5)
fX (x)
defined for all x ∈ R such that fX (x) > 0.
2.4.3
Independence
Recall that two RVs X and Y are independent (X ⊥
⊥ Y ) if, for any two sets A, B ∈ R,
Pr(X∈A & Y ∈B) = Pr(X ∈ A) Pr(Y ∈ B)
(2.6)
Exercise 2.8
Show that X and Y are independent according to Formula 2.6 if and only if
FX,Y (x, y) = FX (x)FY (y)
− ∞ < x, y < ∞,
(2.7)
fX,Y (x, y) = fX (x)fY (y)
− ∞ < x, y < ∞,
(2.8)
or equivalently if and only if
(where the functions f are interpreted as PMFs or PDFs in the discrete or continuous case respectively).
k
15
2.5
Problems
1. Let the function f (x, y) be defined by
6xy 2
f (x, y) =
0
if 0 < x < 1 and 0 < y < 1,
otherwise.
(a) Show that f (x, y) is a probability density function.
(b) If X and Y have the joint PDF f (x, y) above, show that Pr(X + Y ≥ 1) = 9/10.
(c) Find the marginal PDF fX (x) of X.
(d) Show that Pr(0.5 < X < 0.75) = 5/16.
2. Suppose that the random vector (X, Y ) takes values in the region A = {(x, y)|0 ≤ x ≤ 2, 0 ≤ y ≤ 2},
and that its CDF within A is given by FX,Y (x, y) = xy(x + y)/16.
(a) Find FX,Y (x, y) for values of (X, Y ) outside A.
(b) Find the marginal CDF FX (x) of X.
(c) Find the joint PDF fX,Y (x, y).
3. Suppose that X and Y are RVs with joint PDF
cx2 y
f (x, y) =
0
if x2 ≤ y ≤ 1,
otherwise.
(a) Find the value of c.
(b) Find Pr(X ≥ Y ).
(c) Find the marginal PDFs fX (x) & fY (y)
4. For each of the following joint PDFs f of X and Y , determine the constant c, find the marginal PDFs
of X and Y , and determine whether or not X and Y are independent.
(a)
f (x, y) =
ce−(x+2y) , for x, y ≥ 0,
0
otherwise.
(b)
f (x, y) =
cy 2 /2, for 0 ≤ x ≤ 2 and 0 ≤ y ≤ 1,
0
otherwise.
(c)
f (x, y) =
cxe−y , for 0 ≤ x ≤ 1 and 0 ≤ y < ∞,
0
otherwise.
(d)
f (x, y) =
cxy, for x, y ≥ 0 and x + y ≤ 1,
0
otherwise.
5. Suppose that X and Y are continuous RVs with joint PDF f (x, y) = e−y on 0 < x < y < ∞.
(a) Find Pr(X + Y ≥ 1)
[HINT : write this as 1 − Pr(X + Y < 1)].
(b) Find the marginal distribution of X.
(c) Find the conditional distribution of Y given that X = x.
16
6. Assume that X and Y are random variables each taking values in [0, 1]. For each of the following
CDFs, show that the marginal distribution of X and Y are both uniform U (0, 1), and determine the
conditional CDF FX|Y (x|Y = 0.5) in each case:
(a) F (x, y) = xy,
(b) F (x, y) = min(x, y),
0,
if x + y < 1,
(c) F (x, y) =
x + y − 1 if x + y ≥ 1.
7. Suppose that Θ is a random variable uniformly distributed on (0, 1), i.e. Θ ∼ U (0, 1), and that, once
Θ = θ has been observed, the random variable X is drawn from a binomial distribution [X|θ] ∼
Bin(2, θ).
(a) Find the joint CDF F (θ, x).
(b) How might you display the joint distribution of Θ and X graphically?
(c) What (as simply as you can express them) are the marginal CDFs F1 (θ) of Θ and F2 (x) of X?
8. Suppose that X and Y are two RVs having a continuous joint distribution. Show that X and Y are
independent if and only if fX|Y (x|y) = fX (x) for each value of y such that fY (y) > 0, and for all x.
9. Suppose that X ∼ U (0, 1) and [Y |X = x] ∼ U (0, x). Find the marginal PDFs of X and Y .
2.6
2.6.1
Multivariate Distributions
Introduction
Given a random vector X = (X1 , X2 , . . . , Xn )T , the joint distribution of the random variables X1 , X2 , . . . , Xn
is called a multivariate distribution.
Definition 2.8 (Joint CDF)
The joint cumulative distribution function of RVs X1 , X2 , . . . , Xn is the function
FX (x1 , x2 , . . . , xn ) = Pr(Xk ≤ xk ∀ k = 1, 2, . . . , n).
(2.9)
Comments
1. Formula 2.9 can be written succinctly as FX (x) = Pr(X ≤ x), in an ‘obvious’ vector notation.
2. FX (x) can be called simply the CDF of the random vector X.
3. Properties of FX are similar to the bivariate case. Unfortunately the notation is messier, particularly
for the things we’re generally most interested in for statistical inference, such as
(a) marginal distributions of unknown quantities and vectors,
(b) conditional distributions of unknown quantities and vectors, given what we know.
4. It’s often simpler to blur the distinction between row and column vectors, i.e. to let X denote either
(X1 , X2 , . . . , Xn ) or (X1 , X2 , . . . , Xn )T , depending on context.
17
Definition 2.9 (Discrete multivariate distribution)
The RV X ∈ Rn has a discrete distribution if it can take only a countable number of possible values.
Definition 2.10 (Multivariate PMF)
If X has a discrete distribution, then its probability mass function (PMF) is
x ∈ Rn
f (x) = Pr(X = x),
(2.10)
[i.e. the RVs X1 . . . Xn have joint PMF f (x1 . . . xn ) = Pr(X1 = x1 & · · · &Xn = xn )].
Definition 2.11 (Continuous multivariate distribution)
The RV X = (X1 , X2 , . . . , Xn ) has a continuous distribution if there is a nonnegative function f (x),
where x = (x1 , x2 , . . . , xn ), such that for any subset A ⊂ Rn ,
Z Z
Pr (X1 , X2 , . . . , Xn ) ∈ A = . . . f (x1 , x2 , . . . xn ) dx1 dx2 . . . dxn .
(2.11)
A
Definition 2.12 (Multivariate PDF)
The function f in 2.11 is the (joint) probability density function of X.
Comments
1. Without loss of generality, if X is discrete, then we can take its possible values to be Nn (i.e. each
coordinate Xi of X is a nonnegative integer).
2. Equation 2.11 could be simply written
Z
Pr X ∈ A) =
f (x)dx
(2.12)
A
3. As usual, f (·) may be written more explicitly fX (·), etc.
4. By the fundamental theorem of calculus,
fX (x1 , . . . , xn ) =
∂ n FX (x1 , . . . , xn )
∂x1 · · · ∂xn
(2.13)
at all points (x1 , . . . , xn ) where this derivative exists i.e. fX (x) = ∂ n FX (x)/∂x .
5. Mixed distributions (neither continuous nor discrete) can be handled using appropriate combinations
of summation and integration.
2.6.2
Useful Notation for Marginal & Conditional Distributions
We’ll sometimes adopt the following notation from DeGroot, particularly when the components Xi of X
are in some way similar, as in the multivariate Normal distribution (see later).
F (x)
denotes the CDF of X = (X1 , X2 , . . . , Xn ) at x = (x1 , x2 , . . . , xn ),
f (x)
denotes the corresponding joint PMF (discrete case) or PDF (continuous case),
fj (xj )
denotes the marginal PMF (PDF) of Xj (integrating over x1 . . . xj−1 , xj+1 . . . xn ),
fjk (xj , xk )
denotes the marginal joint PDF of Xj & Xk (integrating over the remaining xi s),
gj (xj |x1 . . . xj−1 , xj+1 . . . xn ) denotes the conditional PMF (PDF) of Xj given Xi = xi , i 6= j,
Fj (xj )
denotes the marginal CDF of Xj ,
Gjk
denotes the conditional CDF of (Xj , Xk ) given the values xi of all Xi , i 6= j, k, etc.
18
2.7
Expectation
2.7.1
Introduction
The following are important definitions and properties involving expectations, variances and covariances:
Var(X)
= E (X − µ)2
where µ = EX
2
2
= E X −µ ,
E[aX + b]
= aEX + b
where a and b are constants,
2
2
2
E (aX + b)
= a E X + 2abEX + b2 ,
Var(aX + b)
= a2 Var(X),
E[X1 X2 ]
= (EX1 )(EX2 )
Cov(X1 , X2 )
= E(X1 − µ1 )(X2 − µ2 ) = E[X1 X2 ] − µ1 µ2 ,
p
Var(X),
=
Cov(X1 , X2 )
= ρ(X1 , X2 ) =
.
SD(X1 )SD(X2 )
SD(X)
corr(X1 , X2 )
if X1 ⊥
⊥ X2 ,
Note that the definition of expectation applies directly in the multivariate case:
Definition 2.13 (Multivariate expectation)
 P
if X is discrete,

x h(x) f (x)

Z
E[h(X)] =


h(x) f (x) dx if X is continuous.
Rn
For example, if X = (X1 , X2 , X3 ) has a continuous distribution, then
Z ∞Z ∞Z ∞
E[X1 ] =
x1 f (x1 , x2 , x3 ) dx1 dx2 dx3
−∞
−∞
−∞
Exercise 2.9
Let X and Y be independent continuous RVs. Prove that, for arbitrary functions g(·) and h(·),
E g(X)h(Y ) = E g(X) E h(Y ) .
k
Exercise 2.10
Let X, Y and Z have independent Poisson distributions with means λ, µ, ν respectively. Find E[X 2 Y Z].
k
Exercise 2.11
[Cauchy-Schwartz] By considering E (tX − Y )2 , or otherwise, prove the Cauchy Schwartz inequality for
2
expectations, i.e. for any two RVs X and Y with finite second moments, E(XY ) ≤ E X 2 E Y 2 , with
equality if and only if Pr(Y = cX) = 1 for some constant c.
Hence or otherwise prove that the correlation ρX,Y between X and Y satisfies |ρX,Y | ≤ 1.
Under what circumstances does ρX,Y = 1?
k
19
2.8
Approximate Moments of Transformed Distributions
The moments of a transformed RV g(X) can often be well approximated via a Taylor series:
Exercise 2.12
[delta method] Let X1 , X2 , . . . , Xn be independent, each with mean µ and variance σ 2 , and let g(·) be a
function with a continuous derivative g 0 (·).
By considering a Taylor series expansion involving
X −µ
,
Zn = p
σ 2 /n
show that
E g(X) = g(µ) + O(n−1 ),
Var g(X) = n−1 σ 2 g 0 (µ)2 + O(n−3/2 ).
(2.14)
(2.15)
k
Comments
1. There is similarly a multivariate delta method, outside the scope of this course.
2. Important uses of expansions like the delta method include identifying useful transformations
g(·), for
example to remove skewness or, when Var(X) is a function of µ, to make Var g(X) (approximately)
independent of µ.
practice applied to the original RVs onP
the (often
3. A useful transformation g(X) is sometimes in P
reasonable) assumption that the properties of
g(Xi ) /n will be similar to those of g
Xi )/n .
Exercise 2.13
[Variance stabilising transformations]
Suppose that X1 , X2 , . . . , Xn are IID and that the (common) variance of each Xi is a function of the
(common) mean µ = EXi .
Show that the variance of g(X) is approximately constant if
p
g 0 (µ) = 1/ Var(µ).
If X ∼ Poi (µ), show that Y =
√
X has approximately constant variance.
k
20
2.9
Problems
1. The discrete random vector (X1 , X2 , X3 ) has the following PMF:
(X1 = 1)
X2
1
2
3
1
.02
.04
.02
X3
2
.03
.06
.03
(a) Calculate the marginal PMFs:
(X1 = 2)
3
.05
.10
.05
1
2
3
X2
f1 (x1 ), f2 (x2 ), f3 (x3 )
1
.08
.12
.05
X3
2
.04
.11
.05
3
.03
.07
.05
and f12 (x1 , x2 ).
(b) Are X1 and X2 independent?
(c) What are the conditional PMFs: g1 (x1 |X2 = 1, X3 = 3),
g3 (x3 |X1 = 1, X2 = 3), and g12 (x1 , x2 |X3 = 3) ?
g2 (x2 |X1 = 1, X3 = 3),
2. The RVs A, B, C etc. count the number of times the corresponding letter appears when a word is
chosen at random from the following list (each being chosen with probability 1/16):
MASCARA,
MOVIE,
RITE,
SQUID,
MASK,
PREY,
SEAT,
TENDER,
MERCY,
REPLICA,
SNAKE,
TIME,
MONSTER,
REPTILES,
SOMBRE,
TROUT.
(a) Complete the following table of the joint distribution of E, M and R:
E=0
R=0
M =0
M =1
1/16
1/16
R=1
E=1
M =0
R=0
E=2
M =1
M =0
2/16
M =1
R=0
R=1
R=1
(b) Calculate all three bivariate marginal distributions, and hence find which of the following statements are true:
(a) E ⊥
⊥ M,
(b) E ⊥
⊥ R,
(c) M ⊥
⊥ R.
(c) Similarly discover which of the following statements are true:
(d) M ⊥
⊥ R|E=0,
(g) M ⊥
⊥ R|E,
(e) M ⊥
⊥ R|E=1,
(h) E ⊥
⊥ R|M ,
(f) M ⊥
⊥ R|E=2,
(i) E ⊥
⊥ M |R.
3. Find variance stabilizing transformations for
(a) the exponential distribution,
(b) the binomial distribution.
4. Let Z ∼ N (0, 1) and define the RV X by
√
Pr(X = − 3) = 1/6,
Pr(X = 0) = 4/6,
√
Pr(X = + 3) = 1/6.
(a) Show that X has the same mean and variance as Z, and that X 2 has the same mean and
variance as Z 2 .
(b) Suppose the RV Y has mean µ and variance σ 2 . Compare the delta method for estimating the
mean and variance of the RV T = g(Y ) with the alternative estimates µ
b(T ) l E g(µ + σX) ,
d ) l Var g(µ + σX) . [Try a few simple distributions for Y and transformations g(·)].
Var(T
21
2.10
Conditional Expectation
2.10.1
Introduction
A common practical problem arises when X1 and X2 aren’t independent, we observe X2 = x2 , and we
want to know the mean of the resulting conditional distribution of X1 .
Definition 2.14 (Conditional expectation)
The conditional expectation of X1 given X2 is denoted E[X1 |X2 ]. If X2 = x2 then
Z
E[X1 |x2 ]
∞
x1 g1 (x1 |x2 ) dx1
=
(continuous case)
(2.16)
−∞
=
X
x1 g1 (x1 |x2 )
(discrete case)
(2.17)
x1
where g1 (x1 |x2 ) is the conditional PDF or PMF respectively.
Comment
Note that before X2 is known to take the value x2 , E[X1 |X2 ] is itself a random variable, being a function
of the RV X2 . We’ll be interested in the distribution of the RV E[X1 |X2 ], and (for example) comparing it
with the unconditional expectation EX1 . The following is an important result:
Theorem 2.1 (Marginal expectation)
For any two RVs X1 & X2 ,
E E[X1 |X2 ] = EX1 .
(2.18)
Exercise 2.14
Prove Equation 2.18 (i) for continuous RVs X1 and X2 , (ii) for discrete RVs X1 and X2 .
k
Exercise 2.15
Suppose that the RV X has a uniform distribution, X ∼ U (0, 1), and that, once X = x has been observed,
the conditional distribution of Y is [Y |X = x] ∼ U (x, 1).
Find E[Y |x] and hence, or otherwise, show that EY = 3/4.
k
Exercise 2.16
Suppose that Θ ∼ U (0, 1) and (X|Θ) ∼ Bin(2, Θ).
Find E[X |Θ] and hence or otherwise show that EX = 1.
2.10.2
k
Conditional Expectations of Functions of RVs
By extending Theorem 2.1, we can relate the conditional and marginal expectations of functions of RVs
(in particular, their variances).
Theorem 2.2 (Marginal expectation of a transformed RV)
For any RVs X1 & X2 , and for any function h(·),
E E[h(X1 )|X2 ] = E[h(X1 )].
22
(2.19)
Exercise 2.17
Prove Equation 2.19 (i) for discrete RVs X1 and X2 , (ii) for continuous RVs X1 and X2 .
k
An important consequence of Equation 2.19 is the following theorem relating marginal variance to conditional variance and conditional expectation:
Theorem 2.3 (Marginal variance)
For any RVs X1 & X2 ,
Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ] .
(2.20)
Comments
1. Equation 2.20 is easiest to remember in English:
‘marginal variance
=
expectation of conditional variance
+ variance of conditional expectation’.
2. A useful interpretation of Equation 2.20 is:
Var(X1 )
=
average random variation inherent in X1 even if X2 were known
+ random variation due to not knowing X2 and hence not knowing EX1 .
i.e. the uncertainty involved in predicting the value x1 taken by a random variable X1 splits into two
components. One component is the unavoidable uncertainty due to random variation in X1 , but the
other can be reduced by observing quantities (here the value x2 of X2 ) related to X1 .
Exercise 2.18
[Proof of Theorem 2.3] Expand E Var(X1 |X2 ) and Var E[X1 |X2 ] .
Hence show that Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ] .
k
Exercise 2.19
Continuing
Exercise
2.16, in which Θ ∼ U (0, 1), (X|Θ) ∼ Bin(2, Θ), and E[X |Θ] = 2Θ, find Var E[X |Θ]
and E Var(X |Θ) . Hence or otherwise show that VarX = 2/3, and comment on the effect on the uncertainty in X of observing Θ.
k
23
2.11
Problems
1. Two fair coins are tossed independently. Let A1 , A2 and A3 be the following events:
A1
A2
A3
=
=
=
‘coin 1 comes down heads’
‘coin 2 comes down heads’
‘results of both tosses are the same’.
(a) Show that A1 , A2 and A3 are pairwise independent (i.e. A1 ⊥
⊥ A2 , A1 ⊥
⊥ A3 and A2 ⊥
⊥ A3 ) but
not mutually independent.
(b) Hence or otherwise construct three random variables X1 , X2 , X3 such that E[X3 |X1 = x1 ] and
E[X3 |X2 = x2 ] are constant, but E[X3 |X1 = x1 &X2 = x2 ] isn’t.
2. Construct three random variables X1 , X2 , X3 with continuous distributions such that X1 ⊥
⊥ X2 ,
X1 ⊥
⊥ X3 and X2 ⊥
⊥ X3 , but any two Xi ’s determine the remaining one.
3. (a) Show that for any random variables X and Y ,
i. E[Y ] = E E[Y |X] ,
ii. Var[Y ] = E Var[Y |X] + Var E[Y |X] .
(b) Suppose that the random variables Xi and Pi , i = 1, . . . , n, have the following distributions:
1 with probability Pi ,
Xi =
0 with probability 1 − Pi ,
IID
Pi
Beta(α, β),
∼
i.e. Pi has density
f (p) =
Γ(α + β) α−1
p
(1 − p)β−1
Γ(α) Γ(β)
with mean µ and variance σ 2 given by
µ = E[Pi ] =
α
,
α+β
σ 2 = Var[Pi ] =
αβ
,
(α + β)2 (α + β + 1)
and Xi has a Bernoulli (Pi ) distribution.
Find
i.
ii.
iii.
iv.
E[X1 |P1 ],
Var[X1 |P1 ],
Var E[X1 |P1 ] , and
E Var[X1 |P1 ] .
Hence find E[Y ] where Y =
Pn
i=1
Xi , and show that Var[Y ] = nαβ/(α + β)2 .
(c) Express E[Y ] and Var[Y ] in terms of µ and σ 2 , and comment on the result.
From Warwick ST217 exam 1998
4. Suppose that the number N of bye-elections occurring in Government-held seats over a 12-month
period follows a Poisson distribution with mean 10.
Suppose also that, independently for each such bye-election, the probability that the Government
hold onto the seat is 1/4. The number X of seats retained in the N bye-elections therefore follows a
binomial distribution:
[X|N ] ∼ Bin(N, 0.25).
(a) What are E[N ], Var[N ], E[X|N ] and Var[X|N ]?
(b) What are E[X] and Var[X]?
(c) What is the distribution of X?
[HINT : try using generating functions—see MSA]
24
5. (a) For continuous random variables X and Y , define
i.
ii.
iii.
iv.
the
the
the
the
marginal density fX (x) of X,
conditional density fY |X (y|x) of Y given X = x,
conditional expectation E[Y |X] of Y given X, and
conditional variance Var[Y |X] of Y given X.
(b) Show that
i. E[g(Y )] = E E[g(Y )|X] , for an arbitrary function g(·), and
ii. Var[Y ] = E Var[Y |X] + Var E[Y |X] .
(c) Suppose that the random variables X and Y have a continuous joint distribution, with PDF
2
f (x, y), means µX & µY respectively, variances σX
& σY2 respectively, and correlation ρ. Also
suppose the conditional mean of Y given X = x is a linear function of x:
E[Y |x] = β0 + β1 x.
Show that
R∞
i. −∞ yf (x, y)dy = (β0 + β1 x)fX (x),
ii. µY = β0 + β1 µX , and
2
iii. ρσX σY + µX µY = β0 µX + β1 (σX
+ µ2X ).
(Hint: use the fact that E[XY ] = E[E[XY |X]]).
(d) Hence or otherwise express β0 and β1 in terms of µX , µY , σX , σY & ρ, and write down (or
derive) the maximum likelihood estimates of β0 & β1 under the assumption that the data
(x1 , y1 ), . . . , (xn , yn ) are i.i.d observations from a bivariate Normal distribution.
From Warwick ST217 exam 1997
6. For discrete random variables X and Y , define:
(i) The conditional expectation of Y given X, E[Y |X], and
(ii) The conditional variance of Y given X, Var[Y |X].
Show that
(iii) E[Y ] = E E[Y |X] , and
(iv) Var[Y ] = E Var[Y |X] + Var E[Y |X] .
(v) Show also that if E[Y |X] = β0 + β1 X for some constants β0 and β1 , then
E[XY ] = β0 E[X] + β1 E[X 2 ].
The random variable X denotes the number of leaves on a certain plant at noon on Monday, Y
denotes the number of greenfly on the plant at noon on Tuesday, and Z denotes the number of
ladybirds on the plant at noon on Wednesday.
Suppose that, given X = x, Y has a Poisson distributions with mean µx. If X has a Poisson
distribution with mean λ, show that
E[Y ] = λµ
and
Var[Y ] = λµ(1 + µ),
(you may assume that for a Poisson distribution the mean and variance are equal).
Suppose further that, given Y = y, Z has a Poisson distributions with mean νy. Find E[Z], Var[Z],
and the correlation between X and Z.
From Warwick ST217 exam 1996
25
7. Using the relationship
E E[h(X1 )|X2 ] = E[h(X1 )],
where
h(x1 ) = (x1 − E[X1 |x2 ] + E[X1 |x2 ] − EX1 )2 ,
prove that
Var(X1 ) = E Var(X1 |X2 ) + Var E[X1 |X2 ]
for any two random variables X1 & X2 .
8. Prove that, for any three RVs X, Y and Z for which the various expectations exist,
(a) X and Y − E(Y |X) are uncorrelated,
(b) Var Y − E(Y |X) = E Var(Y |X) ,
(c) if X and Y are uncorrelated then E Cov(X, Y |Z) = −Cov E(X|Z), E(Y |Z) ,
(d) Cov Z, E(Y |Z) = Cov(Z, Y ).
In scientific thought we adopt the simplest theory which will explain all the facts under consideration and enable us to predict new facts of the same kind. The catch in this criterion lies
in the word ‘simplest’. It is really an aesthetic canon such as we find implicit in our criticisms
of poetry or painting.
J. B. S. Haldane
All models are wrong, some models are useful.
G. E. P. Box
A child of five would understand this. Send somebody to fetch a child of five.
Groucho Marx
26
Chapter 3
The Multivariate Normal
Distribution
3.1
Motivation
A Normally distributed RV
X ∼ N (µ, σ 2 )
has PDF
1 (x − µ)2
f (x; µ, σ 2 ) = constant × exp −
2
σ2
(3.1)
where
µ
σ2
‘constant’
is the mean of X,
is the variance of X, and
is there to make f integrate to 1.
The P
Normal distribution
is important because, by the CLT, as n → ∞, the CDF of a MLE such as
P
θb =
Xi /n or θb = (Xi − ΣXj /n)2 / n, tends uniformly (under reasonable conditions) to the CDF of a
Normal RV with the appropriate mean and variance.
i.e. the log-likelihood tends to a quadratic in θ.
Similarly it can be shown that, for a model with parameter vector θ = (θ1 , . . . , θp )T , under reasonable
conditions the log-likelihood will tend to a quadratic in (θ1 , . . . , θp ).
Therefore, by analogy with Equation 3.1, we will want to define a distribution with PDF
1
f (x; µ, V) = constant × exp − (x − µ)T V−1 (x − µ)
2
where
µ
V
‘constant’
is a (p × 1) matrix or column vector,
is a (p × p) matrix, and
is again there to make f integrate to 1.
27
(3.2)
As an example of a PDF of this form, if X1 , X2 , . . . , Xp IID
∼ N (0, 1), then
f (x)
= f1 (x1 ) × f2 (x2 ) × · · · × fp (xp )
by independence
1
1
=
exp − 12 Σx2i
=
exp − 12 xT x .
p/2
p/2
(2π)
(2π)
(3.3)
Definition 3.1 (Multivariate standard Normal)
The distribution with PDF
f (z) = f (z1 , z2 , . . . , zp ) =
1
exp − 21 zT z
p/2
(2π)
is called the multivariate standard Normal distribution.
The statement ‘Z has a multivariate standard Normal distribution’ is often written
Z ∼ N (0, I),
Z ∼ MVN (0, I),
Z ∼ N p (0, I),
or Z ∼ MVN p (0, I),
and the CDF and PDF of Z are often written Φ(z) and φ(z), or Φp (z) and φp (z), respectively.
In the more general case, where the component RVs X1 , X2 , . . . , Xp in Equation 3.2 aren’t independent,
we need an expression for the constant term.
3.2
Digression: Transforming a Random Vector
Exercise 3.1
Suppose that the RVs Z1 , Z2 , . . . , Zn have a continuous joint distribution, with joint PDF fZ (z).
Consider a 1-1 transformation (i.e. a bijection between the corresponding sample spaces) to new RVs
X1 , X2 , . . . , Xn . What is the PDF fX (x) of the transformed RVs? Solution: Because the transformation
is 1-1 we can invert it and write
Z = u(X)
i.e. a given point (z1 , . . . , zn ) transforms to (x1 , . . . , xn ), where
z1
z2
= u1 (x1 , . . . , xn ),
= u2 (x1 , . . . , xn ),
..
.
zn
= un (x1 , . . . , xn ).
Now assume that each function ui (·) is continuous and differentiable.
Then we can form the following matrix:

∂u1 ∂u1
∂u1
 ∂x1 ∂x2 . . . ∂xn

 ∂u
∂u2
∂u2
2

∂u
...

=  ∂x1 ∂x2
∂xn
 .
∂x
..
..
..
 ..
.
.
.

 ∂un ∂un
∂un
...
∂x1 ∂x2
∂xn
(3.4)











(3.5)
and its determinant J, which is called the Jacobian of the transformation u
[i.e. of the joint transformation (u1 , . . . , un )].
Then it can be shown that
fX (x) = |J| × fZ (z)
at all points in the ‘sample space’ (i.e. set of possible values) of X.
k
28
z + δ2
z + δ1 + δ2
z
infinitesimal δ1 × δ2 rectangle
density = fZ (z)
area = δ1 δ2
z + δ1
∴ probability content = δ1 δ2 fZ (z)
6
u
infinitesimal parallelogram
area = δ1 δ2 /|J|
probability content = δ1 δ2 fZ (z)
u−1 (z + δ 1 + δ 2 )
u−1 (z + δ 2 ) −1
u (z + δ 1 )
x = u−1 (z)
∴ density = |J| × fZ (z)
Figure 3.1: Bivariate Parameter Transformation
3.3
The Bivariate Normal Distribution
Suppose that Z1 and Z2 are IID with N (0, 1) distributions, i.e. (as in Equation 3.3):
fZ (z1 , z2 ) =
1
exp − 12 (z12 + z22 ) .
2π
Now let µ1 , µ2 ∈ (−∞, ∞), σ1 , σ2 ∈ (0, ∞) & ρ ∈ (−1, 1), and define (as in DeGroot §5.12):
X1
X2
= σ1 Z1 + µ1p
,
= σ2 ρZ1 + 1 − ρ2 Z2 + µ2 .
(3.6)
Then the Jacobian of the transformation from Z to X is given by
σ1
p
0
= 1 − ρ2 σ 1 σ 2 .
p
J =
ρ σ2
1 − ρ2 σ2 p
Therefore the Jacobian of the inverse transformation from X to Z is 1/ 1 − ρ2 σ1 σ2 , and the PDF of
X is given by Equations 3.7 & 3.8 below.
Definition 3.2 (Bivariate Normal Distribution)
The continuous bivariate distribution with PDF
fX (x)
1
fZ (z)
|J|
=
!
1
1
Q ,
p
×
× exp −
2π
2 1 − ρ2
1 − ρ2 σ 1 σ 2
1
=
(3.7)
where
Q=
x1 − µ1
σ1
2
− 2ρ
x1 − µ1
σ1
is called the bivariate Normal distribution.
29
x2 − µ2
σ2
+
x2 − µ2
σ2
2
.
(3.8)
Exercise 3.2
If the RV X = (X1 , X2 ) has PDF given by Equations 3.7 & 3.8, then show by substituting
v=
x2 − µ2
σ2
followed by w =
v − ρ(x1 − µ1 )/σ1
p
,
1 − ρ2
or otherwise, that X1 ∼ N (µ1 , σ12 ).
Hence or otherwise show that the conditional distribution of X1 given X2 = x2 is Normal with mean
µ1 + (ρσ1 /σ2 )(x2 − µ2 ) and variance σ12 1 − ρ2 .
k
Comments
1. It’s easy to show (problem 3.4.2, page 31) that EXi = µi , VarXi = σi2 and corr(X1 , X2 ) = ρ. This
suggests that we will be able to write
X
=
(X1 , X2 )T ∼ MVN (µ, V),
T
µ =
V
(µ1 , µ2 )
σ12
=
ρ σ1 σ 2
where
is the ‘mean vector ’ of X, and
ρ σ1 σ 2
is the ‘variance-covariance matrix ’ of X.
σ22
2. The ‘level curves’ (i.e. contours in 2-d) of the bivariate Normal PDF are given by Q = constant in
formula 3.8; i.e. ellipses provided the discriminant is negative:
ρ
σ1 σ2
2
−
1 1
ρ2 − 1
= 2 2 < 0.
2
2
σ 1 σ2
σ 1 σ2
This holds as we are only considering ‘nonsingular’ bivariate Normal distributions with ρ 6= ±1.
3. PLEASE MAKE NO ATTEMPT TO MEMORISE FORMULAE 3.7 & 3.8!!
Exercise 3.3
Show that the inverse of the variance-covariance matrix V =
V
−1
1
=
1 − ρ2
1/σ12
−ρ/σ1 σ2
σ12
ρ σ1 σ 2
−ρ/σ1 σ2
1/σ22
ρ σ1 σ 2
σ22
is
.
k
30
3.4
Problems
1. Suppose that the RVs X1 , X2 , . . . , Xn have a continuous joint distribution with PDF fX (x), and that
the RVs Y1 , Y2 , . . . , Yn are defined by Y = AX, where the (n × n) matrix A is nonsingular. Show
that the joint density of the Yi s is given by
1
fX A−1 y
for y ∈ Rn .
fY (y) =
| det A|
Hence or otherwise show carefully that if X1 and X2 are independent RVs with PDFs f1 and f2
respectively, then the PDF of Y = X1 + X2 is given by
Z ∞
fY (y) =
f1 (y − z)f2 (z)dz
for −∞ < y < ∞
−∞
or equivalently by
Z
∞
f1 (z)f2 (y − z)dz
fY (y) =
for −∞ < y < ∞
−∞
If Xi IID
∼ Exp(1), i = 1, 2, then what is the distribution of X1 + X2 ?
2. Suppose that Z1 and Z2 are i.i.d. random variables with standard Normal N (0, 1) distributions.
Define the random vector (X1 , X2 ) by:
X1 = µ1 + σ1 Z1 ,
i
h
p
X2 = µ2 + σ2 ρZ1 + 1 − ρ2 Z2 ,
where σ1 , σ2 > 0 and −1 ≤ ρ ≤ 1.
Show that E[X1 ] = µ1 , E[X2 ] = µ2 , Var[X1 ] = σ12 , Var[X2 ] = σ22 , and corr[X1 , X2 ] = ρ.
Find E[X2 |X1 ] and Var[X2 |X1 ].
Derive the joint PDF f (x1 , x2 ).
Find the distribution of [X2 |X1 ]. Hence or otherwise show that two r.v.s. with a joint bivariate
Normal distribution are independent if and only if they are uncorrelated.
(e) Now suppose that σ1 = σ2 . Show that the RVs Y1 = X1 +X2 and Y2 = X1 −X2 are independent.
(a)
(b)
(c)
(d)
3. Suppose that X and Y have the joint density
1
p
fX,Y (x, y) =
2π σX σY 1 − ρ2
"
2
2 #!
1
x − µX
x − µX
y − µY
y − µY
× exp −
− 2ρ
+
.
σX
σX
σY
σY
2 1 − ρ2
p
(a) Show by substituting u = (x − µX )/σX and v = (y − µY )/σY followed by w = (u − ρv)/ 1 − ρ2 ,
or otherwise, that fX,Y does indeed integrate to 1.
(b) Show that the ‘joint MGF’ MX,Y (s, t) = E exp(sX + tY ) is given by
2 2
s + 2ρσX σY st + σY2 t2 ) .
MX,Y (s, t) = exp µX s + µY t + 12 (σX
(c) Show that
∂MX,Y ∂s s,t=0
∂ 2 MX,Y ∂s2 = µX ,
2
= µ2X + σX
,
s,t=0
&
∂ 2 MX,Y ∂s∂t s,t=0
= µX µY + ρσX σY .
(d) Guess the formula for the MGF MX (s) of X, where X ∼ MVN (µ, V).
4. Suppose that (X1 , X2 ) have a bivariate Normal distribution. Show that any linear combination
Y = a0 + a1 X1 + a2 X2 has a univariate Normal distribution.
31
3.5
The Multivariate Normal Distribution
Definition 3.3 (Multivariate Normal distribution)
Let µ = (µ1 , µ2 , . . . , µp ) be a p-vector, and let V be a symmetric positive-definite (p × p) matrix.
Then the multivariate probability density defined by
fX (x; µ, V)
=
1
p
(2π)p |V|
exp − 12 (x − µ)T V−1 (x − µ)
(3.9)
is called a multivariate Normal PDF with mean vector µ and variance-covariance matrix V.
Comments
1. Expression 3.9 is a natural generalisation of the univariate Normal density, with V taking the rôle of
σ 2 in the exponent, and its determinant |V| taking the rôle of σ 2 in the ‘normalising constant’ that
makes the whole thing integrate to 1. Many of the properties of the MVN distribution are guessable
from properties of the univariate Normal distribution—in particular, it’s helpful to think of 3.9 as
‘exponential of a quadratic’.
2. The statement ‘X = (X1 , X2 , . . . , Xp ) has a multivariate Normal distribution with mean vector µ
and variance-covariance matrix V’ may be written
X ∼ N (µ, V),
X ∼ MVN(µ, V),
X ∼ N p (µ, V),
or X ∼ MVNp (µ, V).
3. The mean vector µ is sometimes called just the mean, and the variance-covariance matrix V is
sometimes called the dispersion matrix, or simply the variance matrix or covariance matrix.
4. µ = EX, (or equivalently, componentwise, EXi = µi , i = 1, 2, . . . , p). This fact should be obvious
from the name ‘mean vector’, and can be proved in various ways, e.g. by differentiating a multivariate
generalization of the MGF, or simply by symmetry.
5. V = E (X − µ)(X − µ)T = E(XXT ) − µµT , i.e.

 

µ21
µ1 µ2 . . . µ1 µp
X12
X 1 X 2 . . . X1 X p

 X2 X1
. . . µ2 µp 
µ22
X22
. . . X2 X p 


  µ2 µ1
−
E(XXT ) − µµT = E 


.
.
.. 
..
..
.
..
..
..
..
  ..

.
.
. 
.
.
Xp X1



= 

...
Xp2
µp µ1
E X12 − µ21
E(X2 X1 ) − µ2 µ1
..
.
E(X1 X2) − µ1 µ2
E X22 − µ22
..
.
E(Xp X1 ) − µp µ1
E(Xp X2 ) − µp µ2

= V
Xp X2
=




v11
v21
..
.
v12
v22
..
.
...
...
..
.
v1p
v2p
..
.
vp1
vp2
...
vpp
µp µ2
...
µ2p
. . . E(X1 Xp ) − µ1 µp
. . . E(X2 Xp ) − µ2 µp
..
..
.
.
...
E Xp2 − µ2p








,

say,
from which it follows that



V=

σ12
ρ12 σ1 σ2
..
.
ρ12 σ1 σ2
σ22
..
.
. . . ρ1p σ1 σp
. . . ρ2p σ2 σp
..
..
.
.
ρ1p σ1 σp
ρ2p σ2 σp
...
32
σp2



,

(3.10)
where σi is the standard deviation of Xi and ρij is the correlation between Xi and Xj . Again these
results can be proved using a multivariate generalization of the MGF.
6. The p-dimensional MVN p (µ, V) distribution can therefore be parametrised by—
p
p
1
p(p
−
1)
2
means µi ,
variances σi2 , and
correlations ρij
NB. —a total of 21 p(p + 3) parameters.
7. Given n random vectors Xi = (Xi1 , Xi2 , . . . , Xip ) IID
∼ MVN (µ, V), i = 1, 2, . . . , n,
a set of minimal sufficient statistics for the unknown parameters is given by:

n
X


Xij
j = 1, . . . , p,




i=1




n

X
2
Xij
j = 1, . . . , p,


i=1




n

X


&
Xij Xik j = 2, . . . , p, k = 1, . . . , (j − 1), 

(3.11)
i=1
and MLEs for µ and V are given by:
µ
bj
=
σ
bj2
=
ρbjk
=
1X
Xij ,
n i
1X
(Xij − µ
bj )2 ,
n i
P
1
bj )(Xik − µ
bk )
i (Xij − µ
n
,
σ
bj σ
bk
(3.12)
(3.13)
(3.14)
or, in matrix notation,
n
b =
µ
1X
Xi ,
n i=1
b
V
=
1X
b )(Xi − µ
b )T
(Xi − µ
n i=1
=
1X
bµ
bT .
Xi XTi − µ
n i=1
(3.15)
n
(3.16)
n
(3.17)
8. The fact that V is positive-definite implies various (messy!) constraints on the correlations ρij .
9. Surfaces of constant density form concentric (hyper-)ellipsoids (concentric hyper-spheres in the case
of the standard MVN distribution). In particular, the contours of a bivariate Normal density form
concentric ellipses (or concentric circles for the standard bivariate Normal).
10. It can be proved that all conditional and marginal distributions of a MVN are themselves MVN. The
proof of this important fact is quite straightforward, quite tedious, and mercifully omitted from this
course.
33
3.6
Distributions Related to the MVN
Because of the CLT, the MVN distribution is important throughout statistics. For example, the joint
distribution of the MLEs θb1 , θb2 , . . . , θbp of unknown parameters θ1 , θ2 , . . . , θp will under reasonable conditions
b = (θb1 , θb2 , . . . , θbp )T was calculated increases.
tend to a MVN as the size of the sample from which θ
Therefore various distributions arising from the MVN by transformation are also important.
Throughout this Section we shall usually denote independent standard Normal RVs by Zi , i.e.:
Zi
IID
∼
N (0, 1),
i.e. Z
=
(Z1 , Z2 , . . . , Zn )T
i = 1, 2, . . .
∼
MVN (0, I).
Exercise 3.4
Show that if a is a constant (n × 1) column vector, B is a constant nonsingular (n × n) matrix, and Z =
(Z1 , Z2 , . . . , Zn )T is a random n-vector with a MVN (0, I) distribution, then Y = a+BZ ∼ MVN a, BBT .
k
3.6.1
The Chi-squared Distribution
Definition 3.4 (Chi-squared Distribution)
If Zi IID
∼ N(0, 1) for i = 1, 2, . . . , n, then the distribution of
X = Z12 + Z22 + · · · + Zn2
is called a Chi-squared distribution on n degrees of freedom, and we write X ∼ χ2n .
Comments
1. In particular, if Z ∼ N (0, 1), then Z 2 ∼ χ21 .
2. The above construction of the χ2n distribution shows that if X ∼ χ2m , Y ∼ χ2n , and X ⊥
⊥Y ,
then (X + Y ) ∼ χ2m+n .
This summation property accounts for the importance and usefulness of the χ2 distribution:
essentially a squared length is split into two orthogonal components, as in Pythagoras’ theorem.
3. If X ∼ χ2n , then the (unmemorable) density of X can be shown to be
fX (x) =
1
x(n/2)−1 e−x/2
2n/2 Γ(n/2)
for x > 0,
(3.18)
with fX (x) = 0 for x ≤ 0.
Comparing this with the definition of a Gamma distribution (MSA) shows that a Chi-squared distribution on n degrees of freedom is just a Gamma distribution with α = n/2 and β = 1/2 (in the
usual parametrisation).
4. It can be shown that if X ∼ χ2n then EX = n and VarX = 2n.
Note that this implies that E[X/n] = 1 and Var[X/n] = 2/n.
5. The χ2 distributions are positively skewed—for example, χ22 is just an exponential distribution with
mean 2. However, because of the CLT, the χ2n distribution tends (slowly!) to Normality as n → ∞.
6. The PDF 3.18 cannot be integrated analytically except for the special case n = 2. Therefore the
CDFs of χ2n distributions for various n are given in standard Statistical Tables.
34
Figure 3.2: Chi-squared distributions for 1, 2, 5 & 20 d.f. Vertical lines show the 2.5%, 16%, 50%, 84%
and 97.5% points (which for N (0, 1) are at −2, −1, 0, 1, 2).
3.6.2
Student’s t Distribution
Definition 3.5 (t Distribution)
If Z ∼ N(0, 1), Y ∼ χ2n and Y ⊥
⊥ Z, then the distribution of
Z
X=p
Y /n
is called a (Student’s) t distribution on n degrees of freedom, and we write X ∼ tn .
Comments
1. The shape of the t distribution is like that of a Normal, but with heavier tails (since there is variability
in the denominator of t as well as in the Normally-distributed numerator Z).
However, as n → ∞, the denominator becomes more and more concentrated around 1, so (loosely
speaking!) ‘tn → N (0, 1) as n → ∞’.
2. The (highly unmemorable) PDF of X ∼ tn can be shown to be
−(n+1)/2
Γ (n + 1)/2
fX (x) = √
1 + x2 /n
for −∞ < x < ∞.
nπ Γ(n/2)
35
(3.19)
Figure 3.3: t distributions for 1, 2, 5 & 20 d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5%
points.
3. The t distribution on 1 degree of freedom is also called the Cauchy distribution—note that it arises
as the distribution of Z1 /Z2 where Zi IID
∼ N (0, 1).
The Cauchy distribution is infamous for not having a mean. More generally, only the first n − 1
moments of the tn distribution exist.
2
4. Note that if Xi IID
∼ N (0, σ ), then the RV
T = pPn
i=2
X1
Xi2 /(n − 1)
has a tn−1 distribution, and is a measure of the length of X1 compared to the root mean square
length of the other Xi s.
i.e. if X has a spherical MVN (0, σ 2 I) distribution, then we would expect T not to be too large. This
is, in effect, how the t distribution usually arises in practice.
5. The PDF 3.19 cannot be integrated analytically in general (exception: n = 1 d.f.). The CDF must
be looked up in Statistical Tables or approximated using a computer.
36
3.6.3
Snedecor’s F Distribution
Definition 3.6 (F Distribution)
If Y ∼ χ2m , Z ∼ χ2n and Y ⊥
⊥ Z, then the distribution of
X=
Y /m
Z/n
is called an F distribution on m & n degrees of freedom, and we write X ∼ Fm,n .
Figure 3.4: F distributions for selected d.f. Vertical lines show the 2.5%, 16%, 50%, 84% and 97.5% points.
Comments
1. Note that the numerator Y /m and denominator Z/n of X both have mean 1. Therefore, provided
both m and n are large, X will usually take values around 1.
2. If X ∼ Fm,n , then the (extraordinarily unmemorable) density of X can be shown to be
Γ (m + n)/2 mm/2 nn/2
x(m/2)−1
fX (x) =
×
for x > 0,
Γ(m/2) Γ(n/2)
(mx + n)(m+n)/2
with fX (x) = 0 for x ≤ 0.
37
(3.20)
3.7
Problems
1. Let Z ∼ N (0, 1) & Y = Z 2 , and let φ(·) & Φ(·) denote the PDF & CDF respectively of the standard
Normal N (0, 1) distribution.
√
√
(a) Show that FY (y) = Φ( y) − Φ(− y).
√
(b) Express fY (y) in terms of φ( y).
(c) Hence show that
1
fY (y) = √ y −1/2 e−y/2
2π
for y > 0.
(d) Find the MGF of Y .
2. Using Formula 3.18 for the PDF of the χ2 distribution, show that if X ∼ χ2n then the MGF of X is
MX (t) = (1 − 2t)−n/2 .
(3.21)
Deduce that if X ∼ χ2m & Y ∼ χ2n with X ⊥
⊥ Y , then (X + Y ) ∼ χ2m+n .
3. Given Z1 , Z2 IID
∼ N (0, 1), what is the probability that the point (Z1 , Z2 ) lies
(a) in the square {(z1 , z2 ) | −1 < z1 < 1 & −1 < z2 < 1},
(b) in the circle {(z1 , z2 ) | (z12 + z22 ) < 1}?
4. Let Z1 , Z2 , . . . be independent random variables, each with mean 0 and variance 1, and let µi , σi and
ρij be constants with −1 ≤ ρij ≤ 1. Let
Y1
= Z1 ,
Y2
= ρ12 Z1 +
q
1 − ρ212 Z2 ,
and define Xi = µi + σi Yi , i = 1, 2.
(a) Show that E[Xi ] = µi , Var[Xi ] = σi2 (i = 1, 2), and that ρ12 is the correlation between X1 and
X2 .
(b) Find constants c0 , c1 , c2 and c3 such that
Y3 = c0 + c1 Z1 + c2 Z2 + c3 Z3
has mean 0, variance 1, and correlations ρ13 & ρ23 with Y1 and Y2 respectively.
(c) Hence show that the random vector Z = (Z1 , Z2 , Z3 )T with zero mean vector and identity
variance-covariance matrix can be transformed to give a random vector X = (X1 , X2 , X3 )T with
specified first and second moments, subject to constraints on the correlations corr[Xi , Xj ] = ρij
including
ρ212 + ρ213 + ρ223 ∈ [0, 1 + 2ρ12 ρ13 ρ23 ].
(d) What can you say about the distribution of X when Z has a standard trivariate Normal distribution and ρ212 + ρ213 + ρ223 is at one of the extremes of its allowable range (i.e. 0 or 1 + 2ρ12 ρ13 ρ23 )?
From Warwick ST217 exam 2001
5. Let Z = (Z1 , Z2 , . . . , Zm+n )T ∼ MVN m+n (0, I).
qP
m+n 2
Zi .
(a) Describe the distribution of Y = Z/
1
Pm 2
Pm+n 2
(b) Show that the RV X = (n 1 Yi )/(m m+1 Yi ) has an Fm,n distribution.
(c) Hence show that if Y = (Y1 , Y2 , . . . , Ym+nP
)T has any continuous
spherically symmetric distriPm+n
m
bution centred at the origin, then X = (n 1 Yi2 )/(m m+1 Yi2 ) has an Fm,n distribution.
6. Suppose that X has a χ2n distribution with PDF given by Formula 3.18. Find the mean, mode &
variance of X, and an approximate variance-stabilising transformation.
38
7. Suppose that Yi are independent RVs with Poisson distributions: Yi ∼ Poi (λi ), i = 1, . . . , k.
√
(a) Assuming that λi is large, what is the approximate distribution of Zi = (Yi − λi )/ λi ?
Pk
(b) Hence or otherwise show that if all the λi s are large, then the RV X = i=1 (Yi − λi )2 /λi has
approximately a χ2k distribution.
8. Suppose that the RVs Oi have independent Poisson distributions: Oi ∼ Poi (npi ), i = 1, . . . , k, where
Pk
i=1 pi = 1.
(a) Find EOi and Var Oi . Hence or otherwise show that E[Oi − npi ] = 0 and Var[Oi − npi ] = npi .
Pk
(b) Define the RV N by N = i=1 Oi . What is the distribution of N ?
(c) Define the RVs Ei = N pi , i = 1, . . . , k. Show that EEi = npi and VarEi = np2i .
Pk
(d) By writing E[O1 E1 ] = p1 E[O12 ] + E[O1 i=2 Oi ] , or otherwise, show that Cov(O1 , E1 ) = np21 .
(e) Deduce that the RV (Oi − Ei ) has mean 0 and variance npi (1 − pi ) for i = 1, . . . , k.
9. (a) Define a multivariate standard Normal distribution N (0, I), where I denotes the identity matrix.
Given Z = (Z1 , Z2 , . . . , Zn )T ∼ N (0, I), write down functions of Z (i.e. transformed random
variables) having
i. a chi-squared distribution on (n − 1) degrees of freedom, and
ii. a t distribution on (n − 1) degrees of freedom.
T
(b) Let
Pn Z = (Z1 , Z2 , . . . , Zn ) have a multivariate standard Normal distribution, and let Z =
i=1 Zi /n.
Also let A = (aij ) be an n × n orthogonal matrix, i.e. AAT = I, and define the random vector
Y = (Y1 , Y2 , . . . , Yn )T by Y = AZ.
Quoting any properties of probability distributions that you require, show the following:
Pn
Pn
i. Show that i=1 Yi2 = i=1 Zi2 .
ii. Show that Y ∼ N (0, I).
iii. Show that for suitable choices of ki , i = 1, . . . , n (where ki > 0 for all i), the following
matrix A is orthogonal, and find ki :


k1
−k1
0
...
0
0
 k2

k2
−2k2 . . .
0
0


 ..

..
..
.
.
.
.
.
.
 .

.
.
.
.
.
A=
.
 kn−2 kn−2 kn−2 . . . −(n − 2)kn−2

0


 kn−1 kn−1 kn−1 . . .
kn−1
−(n − 1)kn−1 
kn
kn
kn
...
kn
kn
Pn−1
Pn
√
iv. With the above definition of A, show that i=1 Yi2 = i=1 (Zi − Z)2 and that Yn = n Z.
Pn
v. Hence show that the RVs Z and i=1 (Zi − Z)2 are independent and have N (0, 1/n) and
χ2n−1 distributions respectively.
Pn
2
vi. Hence or otherwise show that if X1 , X2 , . . . , Xn IID
∼ N (µ, σ ), and X = i=1 Xi /n, then the
random variable
X
T =q
P
n
1
2
i=1 (Xi − X)
n(n−1)
has a t distribution on n − 1 degrees of freedom.
From Warwick ST217 exam 2000
10. Let z(m, n, P ) denote the P % point of the Fm,n distribution. Without looking in statistical tables,
what can you say about the relationships between the following values:
(a) z(2, 2, 50) and z(20, 20, 50),
(c) z(2, 20, 16) and z(20, 2, 84),
(b) z(2, 20, 50) and z(20, 2, 50),
(d) z(20, 20, 2.5) and z(20, 20, 97.5)?
39
11. Suppose that Zi IID
∼ N (0, 1), i = 1, 2, . . . What is the distribution of the following RVs?
(a)
X1 = Z1 + Z2 − Z3
(b)
X2 =
Z1 + Z2
Z1 − Z2
(c)
X3 =
(Z1 − Z2 )2
(Z1 + Z2 )2
(d)
X4 =
(Z1 + Z2 )2 + (Z1 − Z2 )2
2
(e)
2Z1
X5 = p 2
Z2 + Z32 + Z42 + Z52
(f)
(Z1 + Z2 + Z3 )
X6 = p 2
Z4 + Z52 + Z62
(g)
X7 =
3(Z1 + Z2 + Z3 + Z4 )2
(Z1 + Z2 − Z3 − Z4 )2 + (Z1 − Z2 + Z3 − Z4 )2 + (Z1 − Z2 − Z3 + Z4 )2
(h)
X8 = 2Z12 + (Z2 + Z3 )2
12. For each of the RVs Xi defined in the previous question, use Statistical Tables to find ci (i = 1 . . . 8)
such that Pr(Xi > ci ) = 0.95.
13. Show that the PDFs of the t and F distributions (definitions 3.5 & 3.6) are indeed given by formulae
3.19 & 3.20.
14. (a) Define the Standard Multivariate Normal distribution MVN (0, I).
(b) Given Z = (Z1 , Z2 , . . . , Zm+n )T ∼ MVN (0, I), write down transformed random variables X(Z),
T (Z) and Y (Z) with the following distributions:
i. X ∼ χ2n ,
ii. T ∼ tn ,
iii. Y ∼ Fm,n .
(c) Given that the PDF of X ∼ χ2n is
fX (x) =
2n/2
1
x(n/2)−1 e−x/2
Γ(n/2)
and fX (x) = 0 elsewhere, show that
i. E[X] = n,
ii. E[X 2 ] = n2 + 2n, and
40
for x > 0,
iii. E[1/X] = 1/(n − 2) (provided n > 2).
(d) Hence or otherwise find
2
i. the variance σX
of X ∼ χ2n ,
ii. the mean µY of Y ∼ Fm,n and
iii. the mean µT and variance σT2 of T ∼ tn ,
2
stating under what conditions σX
, µY , µT and σT2 exist.
From Warwick ST217 exam 1998
Theory is often just practice with the hard bits left out.
J. M. Robson
Get a bunch of those 3–D glasses and wear them at the same time. Use enough to get it up to
a good, say, 10– or 12–D.
Rod Schmidt
The Normal . . . is the Ordinary made beautiful; it is also the Average made lethal.
Peter Shaffer
Symmetry, as wide or as narrow as you define is meaning, is one idea by which man through
the ages has tried to comprehend and create order, beauty and perfection.
Hermann Weyl
41
This page intentionally left blank (except for this sentence).
42
Chapter 4
Inference for Multiparameter Models
4.1
4.1.1
Introduction: General Concepts
Modelling
Given a random vector X = (X1 , X2 , . . . , Xp ), we can describe the joint distribution of the Xi s by the
CDF FX (x) or, usually more conveniently, by the PMF or PDF fX (x).
Interrelationships between the Xi s can be described using
1. marginals Fi (xi ), fi (xi ), Fij (xi , xj ), etc.,
2. conditionals Gi (xi |xj , j 6= i), gi (xi |xj , j 6= i), Gij (xi , xj |xk , k 6= i, j), etc.,
3. conditional expectations E[Xi |Xj ], Var[Xi |Xj ], etc.
Often FX (x) is assumed to lie in a family of probability distributions:
F = {F (x|θ) | θ ∈ ΩΘ }
(4.1)
where ΩΘ is the ‘parameter space’.
The process of formulating, choosing within, & checking the reasonableness of, such families F, is called
statistical modelling (or probability modelling, or just modelling).
Exercise 4.1
The data-set in Table 4.1, plotted in Figure 1.1 (page 2), shows patients’ blood pressures before and after
treatment. Suggest some reasonable models for the data.
k
4.1.2
Data
In practice, we typically have a set of data in which d variables are measured on each of n ‘cases’ (or
‘individuals’ or ‘units’):
D
=
case.1
case.2
..
.
case.n





var.1
var.2
···
var.d
x11
x21
..
.
x12
x22
..
.
···
···
..
.
x1d
x2d
..
.
xn1
xn2
···
xnd
43





(4.2)
Patient
Number
before
Systolic
after
change
before
Diastolic
after
change
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
210
169
187
160
167
176
185
206
173
146
174
201
198
148
154
201
165
166
157
147
145
168
180
147
136
151
168
179
129
131
-9
-4
-21
-3
-20
-31
-17
-26
-26
-10
-23
-33
-19
-19
-23
130
122
124
104
112
101
121
124
115
102
98
119
106
107
100
125
121
121
106
101
85
98
105
103
98
90
98
110
103
82
-5
-1
-3
2
-11
-16
-23
-19
-12
-4
-8
-21
4
-4
-18
Table 4.1: Supine systolic and diastolic blood pressures of 15 patients with moderate
hypertension (high blood pressure), immediately before and 2 hours after taking 25mg
of the drug captopril.
Data from HSDS, set 72
Definition 4.1 (Data Matrix)
A set of data D arranged in the form of 4.2 is called a data matrix or a cases-by-variables array.
The data-set D is assumed to be a representative sample (of size n) from an underlying population of
potential cases. This population may be actual, e.g. the resident population of England & Wales at noon
on June 30th 1993, or purely theoretical/hypothetical, e.g. MVN(µ, V).
Exercise 4.2
Table 4.2 presents data on ten asthmatic subjects, each tested with 4 drugs. Describe various ways that
the data might be set out as a data matrix for analysis by a statistical computing package.
k
4.1.3
Statistical Inference
Statistical inference is the art/science of using the sample to learn about the population (and hence,
implicitly, about future samples).
Typically we use statistics (properties of the sample)
to learn about parameters (properties of the population).
This activity might be:
1. Part of analysing a formal probability model,
b of θ, after making an assumption as in Expression 4.1, or
e.g. calculating the MLEs θ
2. Purely to summarise the data as a part of ‘data analysis’ (Section 4.2),
For example, given X1 , X2 , . . . , Xn IID
∼ FX (unknown), the statistics
S1 =
1X
Xi = X,
n
S2 =
1X
(Xi − X)2 ,
n
44
S3 =
1X
(Xi − X)3
n
Patient number
5
6
7
Drug
Time
1
2
3
4
8
9
10
P
−5 mins
+15 mins
0.0
3.8
2.3
9.2
2.4
5.4
1.9
3.3
1.6
4.2
4.8
15.1
0.6
1.3
2.7
6.7
0.9
4.2
1.3
3.1
C
−5 mins
+15 mins
0.5
2.0
1.0
5.3
2.0
7.5
1.1
6.4
2.1
4.1
6.8
9.1
0.6
0.6
3.1
14.8
1.5
2.4
3.0
2.3
D
−5 mins
+15 mins
0.8
2.4
2.3
4.8
0.8
2.4
0.8
1.9
1.2
1.2
9.6
12.5
1.1
1.7
9.7
12.5
0.8
4.3
4.9
8.1
K
−5 mins
+15 mins
0.2
0.4
1.7
3.4
2.2
2.0
0.1
1.3
1.7
3.4
9.2
6.7
0.6
1.1
12.7
12.5
1.1
2.7
2.8
5.7
Table 4.2: NCF (Neutrophil Chemotactic Factor) of ten individuals, each tested with
4 drugs: P (Placebo), C (Clemastine), D (DSCG), K (Ketotifen). On a given day,
an individual was administered the chosen drug, and his NCF measured 5 minutes
before, and 15 minutes after, being given a ‘challenge’ of allergen.
Data from Dr. R. Morgan of Bart’s Hospital
provide measures of location, scale and skewness.
Note that here we’re implicitly estimating the corresponding population quantities
µX = EX,
E (X − µX )2 ,
E (X − µX )3 ,
and using these as measures of population location, scale and skewness. Without a formal probability
model, it can be hard to judge whether these or some other measures may be most appropriate.
In both cases, the CLT & its generalisations (to higher dimensions and to ‘near-independence’) show that,
b or (S1 , S2 , S3 ), is
under reasonable conditions, the joint distribution of the statistics of interest, such as θ
approximately MVN. This approximation improves if
1. the sample size n → ∞, and/or
2. the joint distribution of the random variables being summed (e.g. the original random vectors
X1 , X2 , . . . , Xn ) is itself close to MVN.
QUESTIONS: How should we interpret this? How should we try to link probability models to reality?
4.2
Data Analysis
Data analysis is the art of summarising data while attempting to avoid probability theory.
For example, you can calculate summary statistics such as means, medians, modes, ranges, standard
deviations etc., thus summarising in a few numbers the main features of a possibly huge data-set. For
example, the (0%, 25%, 50%, 75%, 100%) points of the data distribution (i.e. minimum, lower quartile,
median, upper quartile and maximum) form the five-number summary, and the inter-quartile range (IQR =
upper quartile - lower quartile) is a measure of spread, containing the ‘middle 50%’ of the data.
These summaries can be formalised as follows
Definition 4.2 (Order statistics)
Given RVs X1 , X2 , . . . , Xn , one can order them and denote the smallest of the Xi s by X(1) , the second
smallest by X(2) , etc. Then X(k) is called the kth order statistic.
45
Thus X(1) , X(2) , . . . , X(n) are a permutation of X1 , X2 , . . . , Xn , and x(n) , the observed value of X(n) , denotes
the largest observed value in a sample of size n.
Given ordered data x(1) ≤ x(2) ≤ · · · ≤ x(n) , one can define:
Definition 4.3 (Sample median)
xM
We can always write xM = x
n
2



x
=
1

 2 x
+
1
2
n
2
n+1
2
+x
if n is odd,
n
2
if n is even.
+1
, provided we adopt the following convention:
1. If the number in brackets is exactly half-way between two integers, then take the average of the two
corresponding order statistics.
2. Otherwise round the bracketed subscript to the nearest integer, and take the corresponding order
statistic.
Similarly the quartiles etc. can be formally defined as follows:
Definition 4.4 (Sample lower quartile)
xL = x
n
4
+
1
2
,
Definition 4.5 (Sample upper quartile)
xU = x
3n
4
+
1
2
,
Definition 4.6 (100p th sample percentile)
x100p% = x
pn
100
+
1
2
,
Definition 4.7 (Five number summary)
x(1) , xL , xM , xU , x(n) .
4.3
Classical Inference
4.3.1
Introduction
In ‘classical statistical inference’, the typical procedure is:
1. Choose a family F of models indexed by θ (formula 4.1).
2. Assume temporarily that the true distribution lies in F
i.e. data D ∼ F (d|θ) for some true but unknown parameter vector θ ∈ ΩΘ .
3. Compare possible models according to some criterion of compatibility between the model & the data
(equivalently, between the population & the sample).
4. Assess the chosen model(s), and go back to step (1) or (2) if the model proves inadequate.
46
Comments
1. Step 1 is a compromise between
(a) what we believe is the true underlying mechanism that produced the data, and
(b) what we can do mathematically.
If in doubt, keep it simple.
2. Step 2, by assuming a true θ exists, implicitly interprets probability as a property of Nature
e.g. a ‘fair’ coin is assumed to have an intrinsic property: if you toss it n times, then the proportion
of ‘heads’ tends to 1/2 as n → ∞.
Thus probability represents a ‘long-run relative frequency’.
3. Most statistical computer packages currently use the classical approach, and we’ll mainly be using
classical inference in MSB.
4. There are many possible criteria at step 3. For example, hypothesis-testing and likelihood approaches
are both discussed briefly below.
4.3.2
Point Estimation (Univariate)
Given RVs X = (X1 , X2 , . . . , Xn ), a point estimator for an unknown parameter Θ ∈ ΩΘ is simply a function
b
Θ(X)
taking values in the parameter space ΩΘ . Once data X = x are obtained, one can calculate the
b
corresponding point estimate θb = Θ(x).
b to be considered a ‘good’ estimator of Θ. For example:
There are many plausible criteria for Θ
1. Mean Squared Error
b to be small whatever the true value θ of Θ, where
One would like the mean squared error (MSE) of Θ
b = E (Θ
b − θ)2 .
MSE(Θ)
(4.3)
b has minimum mean squared error if
In particular, an estimator Θ
b = min MSE(Θ
b 0 ).
MSE(Θ)
b0
Θ
2. Unbiasedness
Definition 4.8 (Bias)
b is
The bias of an estimator Θ
b = E[Θ
b − θ|Θ = θ].
Bias(Θ)
(4.4)
Exercise 4.3
b = Var(Θ)
b + (Bias Θ)
b 2.
Show that MSE(Θ)
k
Definition 4.9 (Unbiasedness)
b for a parameter Θ is called unbiased if E[Θ|Θ
b
An estimator Θ
= θ] = θ for all possible true
values θ of Θ.
47
Example
Given a random sample X1 , X2 , . . . , Xn , i.e. Xi IID
∼ FX (x), where FX is a member of some family F
of probability distributions,
Pn
(a) X = i=1 Xi /n is an unbiased estimate of the mean µX = EX of FX .
Pn
Pn
(b) More generally, any statistic of the form i=1 wi Xi , where i=1 wi = 1, is an unbiased estimate
of µX .
Pn
2
of FX , but
(c) σ
b12 = i=1 (Xi − X)2 /(n − 1) is an unbiased estimate of the variance σX
P
n
2
2
2
(d) σ
b2 = i=1 (Xi − X) /n is NOT an unbiased estimate of the variance σX of FX .
3. Efficiency & Minimum Variance Unbiased Estimation
b1 & Θ
b 2 for a parameter Θ, the efficiency of Θ
b 1 relative to Θ
b 2 is
Given two unbiased estimators Θ
defined by
b
b 1, Θ
b 2 ) = Var(Θ1 ) .
Eff(Θ
(4.5)
b 2)
Var(Θ
Definition 4.10 (MVUE)
b out
The Minimum Variance Unbiased Estimator of a parameter Θ is the unbiased estimator Θ,
of all possible unbiased estimators, that has minimum variance.
Example
Given Xi IID
∼ FX (x) ∈ F, the family of all probability distributions with finite mean & variance, it can
be shown that
(a) X is the MVUE of the mean µX = EX of FX , and
Pn
2
2
(b)
i=1 (Xi − X) /(n − 1) is the MVUE of the variance σX of FX .
Note that there are major problems with using MVUE as a criterion for estimation:
(a) The MVUE may not exist (e.g. in general there is no unbiased estimator for the underlying
standard deviation σX of X).
(b) The MVUE may exist but be nonsensical (see Problems).
(c) Even if the MVUE exists and appears reasonable, other (biased) estimators may be better by
other criteria, for example by having smaller mean squared error, which is much more important
in practice than being unbiased.
4. Consistency
Definition 4.11 (Consistency)
b 1, Θ
b 2 , . . . is consistent for Θ ∈ ΩΘ if, for all > 0 and for all θ ∈ ΩΘ ,
A sequence of estimators Θ
b n − θ| > |Θ = θ) = 0.
lim Pr(|Θ
n→∞
5. Sufficiency
b 1 , . . . Xn ) is sufficient for Θ if the conditional distribution of (X1 , . . . Xn ) given Θ
b = θb & Θ = θ
Θ(X
does not depend on θ. See MSA.
6. Maximum likelihood
See MSA.
7. Invariance
See Casella & Berger, page 300.
48
8. The ‘plug in’ property
If θ is a specified property of the CDF F (x), then θb is the corresponding property of the empirical
CDF
1
Fb(x) = × (number of Xi ≤ xi ).
(4.6)
n
For example (assuming the named quantities exist):
P
(a) the sample mean θb = x = xi /n is the plug-in estimate of the population mean θ = EX,
(b) the sample median is the plug-in estimate of the population median F −1 (0.5).
4.3.3
Hypothesis Testing (Introduction)
In this approach you
1. Choose a statistic T that has a known distribution F0 (t) if the true parameter value is θ = θ 0 (for
some particular parameter value θ 0 of interest). The statistic T should provide a measure of the
discrepancy of the data D from what would be reasonable if θ = θ 0 .
2. Test the hypothesis ‘θ = θ 0 ’ using the tail probabilities of F0 .
An example is the ‘Chi-squared’ statistic used in MSA. Hypothesis testing will be covered in more detail
in chapter 5.
Some problems with the standard hypothesis testing approach are:
1. In practice, we don’t really believe that θ = θ 0 is ‘true’ and all other possible values of θ are ‘false’;
instead we just wish to adopt ‘θ = θ 0 ’ as a convenient assumption, because it’s as good as, and
simpler than, other models.
2. If we really do want to make a decision [e.g. to give drug ‘A’ or drug ‘B’ to a particular patient],
then we should weigh up the possible consequences.
3. It’s hard to create appropriate hypothesis tests in complex situations, such as to test whether or not
θ lies in a particular subset Ω0 of the parameter space Ω.
Unfortunately, real life is a complex situation.
4.3.4
Likelihood Methods
Use the likelihood function
L(θ; D) =
Pr(D|θ)
(constant) × f (D|θ)
(discrete case)
(continuous case),
(4.7)
or equivalently the log-likelihood or ‘support’ function
`(θ; D) = log L(θ; D)
(4.8)
as a measure of the compatibility between data D and parameter θ.
In particular, the MLE corresponds to the particular F b ∈ F that is most compatible with the data D.
θ
Likelihood underlies the most useful general approaches to statistics:
1. It can handle several parameters simultaneously.
2. The CLT implies that in many cases the log-likelihood will be approximately quadratic in θ (at least
near the MLE).
This makes both theory and numerical computation easier.
49
However, there are difficulties with basing inference solely on likelihood:
1. How should we handle ‘nuisance parameters’ (i.e. components θi that we’re not interested in)?
Note that it makes no sense to integrate over values of θi to get a ‘marginal likelihood’ for the other
θj s, since L(θ; d) is NOT a probability density or probability function—we would get a different
marginal likelihood if we reparametrised say by θi 7→ log θi .
2. A more fundamental problem is that likelihood takes no account of how far-fetched the model might
be (‘high likelihood’ does NOT mean ‘likely’ !)
This suggests that in practice we may wish to incorporate information not contained in the likelihood:
1. Prior information/Expert opinion: Are there external reasons for doubting some values of θ more
than others?
2. For decision-making: How relatively important are the possible consequences of our inferences?
[e.g. an innocent person is punished / a murderer walks free].
4.4
Problems
1. How might the mortality data in Tables 1.1 and 1.2 (pages 8 & 9) be set out as a data matrix?
b 1 , . . . Xn ) is unbiased. Show that θb is consistent iff limn→∞ Var(θ(X
b 1 , . . . Xn )) = 0.
2. Suppose that θ(X
3. Given Xi IID
∼ FX (x), where FX is a member of some family F of probability distributions, show that
Pn
Pn
(a) Any statistic of the form i=1 wi Xi , where i=1 wi = 1, is an unbiased estimate of µX = EX,
Pn
(b) The mean X = i=1 Xi /n is the unique UMVUE of this form,
Pn
2
(c) σ
b2 = i=1 (Xi − X)2 /(n − 1) is an unbiased estimate of the variance σX
of FX .
4. The number of mistakes made each lecture by a certain lecturer follow independent Poisson distributions, each with mean λ > 0.
You decide to attend the Monday lecture, note the number of mistakes X, and use X to estimate
the probability p that there will be no mistakes in the remaining two lectures that week.
(a) Show that p = exp(−2λ).
(b) Show that the only unbiased estimator of p (and hence, trivially, the MVUE), is
1 if X is even,
pb =
−1 if X is odd.
(c) What is the maximum likelihood estimator of p?
(d) Discuss (briefly) the relative merits of the MLE and the MVUE in this case.
5. Let T be an unbiased estimator for g(θ), let S be a sufficient statistic for θ, and let φ(S) = E[T |S].
Prove the Rao-Blackwell theorem:
φ(S) is also an unbiased estimator of g(θ), and Var[φ(S)|θ] ≤ Var[T |θ], for all θ,
and interpret this result.
50
6. (a) Explain what is meant by an unbiased estimator for an unknown parameter θ.
(b) Show, using moment generating functions or otherwise, that if X1 & X2 have independent
Poisson distributions with means λ1 & λ2 respectively, then their sum (X1 + X2 ) follows a
Poisson distribution with mean (λ1 + λ2 ).
(c) A particular sports game comprises four ‘quarters’, each lasting 15 minutes, and a statistician
attending the game wishes to predict the probability p that no further goals will be scored before
full time.
The statistician assumes that the numbers Xk of goals scored in the kth quarter follow independent Poisson distributions, each with (unknown) mean λ, so that
Pr(Xk = x) =
λx −λ
e
x!
(k = 1, 2, 3, 4; x = 0, 1, 2, . . .).
Suppose that the statistician makes his prediction halfway through the match (i.e. after observing
X1 = x1 & X2 = x2 ). Show that an unbiased estimator of p is
1 if (x1 + x2 ) = 0,
T =
0 otherwise.
(d) Suppose the statistician also made a prediction after 15 minutes. Show that in this case the
ONLY unbiased estimator of p given X1 = x1 is
x
2 1
if x1 is even,
T =
−2x1 if x1 is odd.
(e) What are the maximum likelihood estimators of p after 15 and after 30 minutes?
(f) Briefly compare the advantages of maximum likelihood and unbiased estimation for this situation.
From Warwick ST217 exam 1997
7. (a) Explain what is meant by a minimum variance unbiased estimator (MVUE).
(b) Let X and Y be random variables. Write down (without proof) expressions relating E[Y ] and
Var[Y ] to the conditional moments E[Y |X] and Var[Y |X].
(c) Let S be a sufficient statistic for a parameter θ, let T be an unbiased estimator for τ (θ), and
define W = E[T |S]. Show that
i. W is an unbiased estimator for τ (θ), and
ii. Var[W ] ≤ Var[T ] for all θ.
Deduce that a MVUE, if one exists, must be a function of a sufficient statistic.
(d) Let X1 , X2 , . . . , Xn be IID Bernoulli random variables, i.e.
Pr(Xi = 1) =
θ
i = 1, 2, . . . , n.
Pr(Xi = 0) = 1 − θ
i. Show that S =
ii. Define T by
Pn
i=1
Xi is a sufficient statistic for θ.
T =
1
0
if X1 = 1 and X2 = 0,
otherwise.
What is E[T ]?
iii. Find E[T |S], and hence show that S(n − S)/(n − 1) is an MVUE of Var[S] = nθ(1 − θ).
From Warwick ST217 exam 1999
51
8. Given Xi IID
∼ Poi (θ), compare the following possible estimators for θ in terms of unbiasedness, consistency, relative efficiency, etc.
n
θb1
1X
Xk ,
= X=
n
k=1
θb2
=
θb3
=
θb4
=
1
n
100 +
n
X
!
Xk
,
k=1
1
(X2 − X1 )2 ,
2
n
1X
(Xk − X)2 ,
n
k=1
n
=
1 X
(Xk − X)2 ,
n−1
θb6
=
(θb1 + θb5 )/2,
θb7
=
median(X1 , X2 , . . . , Xn ),
θb8
=
mode(X1 , X2 , . . . , Xn ),
θb9
=
X
2
kXk ,
n(n + 1)
θb5
k=1
n
k=1
θb10
9. [Light relief]
=
1
n−1
n
X
Xk .
k=2
Discuss the following possible defence submission at a murder trial:
‘The supposed DNA match placing the defendant at the scene of the crime would have arisen with
even higher probability if the defendant had a secret identical twin
[the more people with that DNA, the more chances of getting a match at the crime scene].
‘Now assume that my client has been cloned θ times, θ ∈ {0, 1, . . . , n} for some n > 0. Clearly the
larger the value of θ, the higher the probability of obtaining the observed DNA results
[every increase in θ means another clone who might have been at the scene of the crime].
‘Therefore the MLE of θ is n.
‘But then, even assuming somebody with my client’s DNA committed this terrible crime, the probability that it was my client is only 1/(n + 1) (under reasonable assumptions).
‘Therefore you cannot say that my client is, beyond a reasonable doubt, guilty.
‘The defence rests.’
4.5
4.5.1
Bayesian Inference
Introduction
Classical inference regards probability as a property of physical objects (e.g. a ‘fair coin’).
An alternative interpretation uses probability to represent an individual’s (lack of) understanding of an
uncertain situation.
52
Examples
1. ‘I have no reason to suspect that “heads” or “tails” are more likely. Therefore, by symmetry, my
current probability for this particular coin’s coming down “heads” is 1/2.’
2. ‘I doubt the accused has any previously-unknown identical siblings. I’d bet 100,000 to 1 against’
(i.e. if θ is the number of identical siblings, then my probability for θ > 0 is 1/100001).
Different people, with different knowledge, can legitimately have different probabilities for real-world events
(therefore it’s good discipline to say ‘my probability for. . . ’ rather than ‘the probability of. . . ’).
As you learn. your probabilities can be continually updated using Bayes’ theorem, i.e.
Pr(A|B) =
Pr(B|A) × Pr(A)
Pr(B)
(4.9)
assuming Pr(B) is positive, and using the fact that Pr(A&B) = Pr(A|B) Pr(B) = Pr(B|A) Pr(A) .
The Bayesian approach to statistical inference treats all uncertainty via probability, as follows:
1. You have a probability model for the data, with PMF p(D|Θ).
2. Your prior PMF for Θ (i.e. your PMF for Θ based on a combination of expert opinion, previous
experience, and your own prejudice), is p(θ).
3. Then Bayes’ theorem says
p(θ|D) =
p(D|θ) p(θ)
p(D)
or, since once the data have been obtained p(D) is a constant,
p(θ|D) ∝ p(D|θ) p(θ)
∝ L(θ; D) p(θ)
i.e. ‘posterior probability ∝ ‘likelihood’ × ‘prior’
(4.10)
Formula 4.10 also applies in the continuous case, in which case p(·) represents a PDF.
Comments
1. Further applications to decision theory are given in the third year course ST301.
2. Note that if θ = (θ1 , θ2 , . . . , θp ), then p(θ|D) is a p-dimensional function, and may prove difficult to
manipulate, summarise or visualise.
3. Treating all uncertainty via probability has the advantage that one-off events (e.g. management
decisions, or the results of horse races) can be handled. However, it’s not at all obvious that all
uncertainty can be treated via probability!
4. As with Classical inference, a Bayesian analysis of a problem should involve checking whether the
assumptions underlying p(D|θ) and p(θ) are reasonable, and rethinking & reanalysing the model if
necessary.
Exercise 4.4
Describe the Bayesian approach to statistical inference, denoting the data by x, the prior by fΘ (θ), and
the likelihood by L(θ; x) = fX|Θ (x|θ).
k
53
4.6
Nonparametric Methods
Standard Classical and Bayesian methods make strong assumptions, e.g. Xi IID
∼ F (x|θ) for some θ ∈ Ω.
Assumptions of independence are critical (what aspects of the problem provide information about other
aspects?)
Assumptions about the form of probability distributions are often less important, at least provided the
sample size n is large. However, there are exceptions to this:
1. It might be that the probability distribution encountered in practice is fundamentally different from
the form assumed in our model. For example, some probability distributions are so ‘heavy-tailed’
that their means don’t exist e.g. the Cauchy distribution with f (x) = 1/π(1 + x2 ), x ∈ R .
2. Some data may be recorded incorrectly, or there may be a few atypically large/small data values
(‘outliers’), etc.
3. In any case, what if n is small and the CLT can’t be invoked?
‘Nonparametric’ methods don’t assume that the actual probability distribution F (·|θ) lies in a particular
parametric family F; instead they make more general assumptions, for example
1. ‘F (x) is symmetric about some unknown value Θ’.
Note that this may be a reasonable assumption even if EX doesn’t exist.
Θ is the (unknown) median of the population, i.e. Pr(X < Θ) = Pr(X > Θ).
Therefore one could estimate Θ by the median of the data (though better methods may exist).
2. ‘F (x, y) is such that if (Xi , Yi ) IID
∼ F , (i = 1, 2), then Pr(Y1 < Y2 |X1 < X2 ) = 1/2’.
This is a nonparametric version of the statement ‘X & Y are uncorrelated’.
Many statistical methods involve estimating means, as we’ll see in the rest of the course (t-tests, linear
regression, many MLEs etc.)
Corresponding nonparametric methods typically involve medians—or equivalently, various probabilities.
Exercise 4.5
Suppose that X has a continuous distribution. Show that a test of the statement ‘median of X is θ0 ’ is
equivalent to a test of the statement ‘Pr(X < θ0 ) = 1/2’.
If Xi are IID, what is the distribution of R = (number of Xi < θT ), where θT is the true value of θ?
k
Other nonparametric methods involve ranking the data Xi : replacing the smallest Xi by 1, the next
smallest by 2, etc. Classical statistical methods can then be applied to the ranks. Note that the effect of
outliers will be reduced.
Example
Given data (Xi , Yi ), i = 1, . . . , n from a continuous bivariate distribution, ‘Spearman’s rank correlation’
(often written ρS ) can be calculated as follows:
1. replace the Xi values by their ranks Ri ,
2. similarly replace the Yi values by their ranks Si ,
3. calculate the usual (‘product-moment’ or ‘Pearson’s’) correlation between the Ri s and Si s.
54
Comments
1. If the distribution of the original RVs is not continuous, then some data values may be repeated (‘tied
ranks’). Repeated Xi s are given averaged ranks (for example, if there are two Xi with the smallest
value, then they are each given rank 1.5 = (1 + 2)/2).
2. If X ⊥
⊥ Y , so the ‘true’ ρS is zero,Pthen the distribution of the calculated ρS is easily approximated
n
(using the standard formulae for i=1 ik ).
3. ‘Easily approximated’ does not necessarily mean ‘well approximated’ !
4. Most books give another formula for ρS , which is equivalent unless there are tied ranks, but which
obscures the relationship with the standard product-moment correlation
P
(xi − x)(yi − y)
ρ = pP
.
P
(xi − x)2 (yi − y)2
5. Other, perhaps better, types of nonparametric correlation have been defined (‘Kendall’s τ ’).
4.7
Graphical Methods
A vital part of data analysis is to plot the data using bar-charts, histograms, scatter diagrams etc. Plotting
the data is important no matter what further formal statistical methods will be used:
1. It enables you to ‘get a feel for’ the data,
2. It helps you look for patterns and anomalies,
3. It helps in checking assumptions (such as independence, linearity or Normality).
Many useful plots can be easily churned out using a computer, though sometimes you have to devise original
plots to display the data in the most appropriate way.
Exercise 4.6
The following table shows 66 measurements on the speed of light, made by S. Newcomb in 1882. Values
are the times in nanoseconds (ns), less 24,800 ns, for light to travel from his laboratory to a mirror and
back. Values are to be read row-by-row, thus the first to observations are 24,828 ns and 24,826 ns.
28
29
24
37
36
26
29
26
22
20
25
23
32
27
33
24
36
28
27
32
28
24
21
32
26
27
24
29
34
25
36
30
28
39
16
-44
30
28
32
27
28
23
27
23
25
36
31
24
16
29
21
26
27
25
40
31
28
30
26
32
-2
19
29
22
33
25
Produce a histogram, a Normal probability plot and a time plot of Newcomb’s data. Decide which (if any)
observations to ignore, and produce a normal probability plot of the remaining reduced data set. Finally
compare the mean of this reduced data set with (i) the mean and (ii) the 10% trimmed mean of the original
data. Solution: Plots are shown in Figure 4.1. There are clearly 2 large outliers, but the time plot also
suggests that the 6th to 10th observations are unusually variable, and that the last two observations are
atypically low (both being lower than the previous 20 observations).
The Normal probability plot is calculated by calculating y(i) (the sorted data) and zi as follows, and
plotting y(i) against zi .
55
i
y(i)
xi = (i+0.5)/(n+1)
zi = Φ(xi )
1
2
3
4
..
.
−44
−2
16
16
..
.
0.0075
0.0224
0.0373
0.0522
..
.
−2.434
−2.007
−1.783
−1.624
..
.
65
66
39
40
0.9776
0.9925
2.007
2.434
Omitting the first 10 and the last 2 recorded observations leaves a data-set where the Normality and
independence assumptions are much more reasonable—see plot (d) of Figure 4.1.
Location estimates are (i) 26.2, (ii) 27.4, (iii) 27.9. The trimmed mean is reasonably close to the mean of
observations 11–64.
Figure 4.1: Plots of Newcomb’s data: (a) histogram, (b) Normal probability plot, (c) time plot, (d) Normal
probability plot of data after excluding the first 10 and last 2 observations.
k
4.8
Bootstrapping
‘Bootstrap’ methods have become increasingly used over the past few years. They address the general
question:
56
b given that the underlying
‘What are the properties of the calculated statistics (e.g. MLEs θ)
distributional assumptions may be false (and, in reality, will be false)?’
Bootstrapping uses the observed data directly as an estimate of the underlying population, then uses
‘plug-in’ estimation, and typically involves computer simulation.
Several other computer-intensive approaches to statistical inference have also become very popular recently.
4.9
Problems
1. [Light relief]
Discuss the following quote:
‘As a statistician, I want to use mathematics to help deal with practical uncertainty. The natural
mathematical way to handle uncertainty is via probability.
‘About the simplest practical probability statement I can think of is “The probability that a fair coin,
tossed at random, will come down ‘heads’ is 1/2”.
‘Now try to define “fair coin”, “at random” and “probability 1/2” without using subjective probability
or circular definitions.
‘Summary: if a practical probability statement is not subjective, then it must be tautologous, illdefined, or useless.
‘Of course, for balance, some of the time I teach subjective methods, and some of the time I teach
useless methods :-).’
Ewart Shaw (Internet posting 13–Aug–1993).
2. (a) Plot the captopril data (Table 4.1), and suggest what sort of models seem reasonable.
(b) Roughly estimate from your graph(s) the effect of captopril (C) on systolic and diastolic blood
pressure (SBP & DBP).
(c) Suggest a single summary measure (SBP, DBP or a combination of the two) to quantify the
effect of treatment.
(d) Do you think a transformation of the data would be appropriate?
(e) Comment on the number of parameters in your model(s).
(f) Calculate ρS and ρ between ∆S , the change (after-before) in SBP, and ∆D , the change (afterbefore) in DBP.
Suggest some advantages and disadvantages in using ρS and ρ here.
(g) Calculate some further summary statistics such as means, variances, correlations and fivenumber summaries, and comment on how useful they are as summaries of the data.
(h) Are there any problems in using the data to estimate the effect of captopril? What further
information would be useful?
(i) What advantages/disadvantages would there be in using bootstrapping here, i.e. using the discrete distribution that assigns probability 1/15 to each of the 15 points x1 = (210, 201, 130, 125),
x2 = (169, 165, 122, 121), . . . , x15 = (154, 131, 100, 82) as an estimate of the underlying population, and working out the properties of ρS , ρ, etc. based on that assumption?
57
This page intentionally left blank (except for this sentence).
58
Chapter 5
Hypothesis Testing
5.1
Introduction
A hypothesis is a claim about the real world; statisticians will be interested in hypotheses like:
1. ‘The probabilities of a male panda or a female panda being born are equal’,
2. ‘The number of flying bombs falling on a given area of London during World War II follows a Poisson
distribution’,
3. ‘The mean systolic blood pressure of 35-year-old men is no higher than that of 40-year-old women’,
4. ‘The mean value of Y = log(systolic blood pressure) is independent of X = age’
(i.e. E[Y |X = x] = constant).
These hypotheses can be translated into statements about parameters within a probability model:
1. ‘p1 = p2 ’,
n
2. ‘N ∼ Poi (λ) for some λ > 0’,
Pi.e.: pn = Pr(N = n) = λ exp(−λ)/n! (within the general probability
model pn ≥ 0 ∀n = 0, 1, . . .;
pn = 1),
3. ‘θ1 ≤ θ2 ’ and
4. ‘β1 = 0’ (assuming the linear model E[Y |x] = β0 + β1 x).
Definition 5.1 (Hypothesis test)
A hypothesis test is a procedure for deciding whether to accept a particular hypothesis as a reasonable
simplifying assumption, or to reject it as unreasonable in the light of the data.
Definition 5.2 (Null hypothesis)
The null hypothesis H0 is the simplifying assumption we are considering making.
Definition 5.3 (Alternative hypothesis)
The alternative hypothesis H1 is the alternative explanation(s) we are considering for the data.
Definition 5.4 (Type I error)
A type I error is made if H0 is rejected when H0 is true.
Definition 5.5 (Type II error)
A type II error is made if H0 is accepted when H0 is false.
59
Comments
1. In the first example above (pandas) the null hypothesis is H0 : p1 = p2 .
2. The alternative hypothesis in the first example would usually be H1 : p1 6= p2 , though it could also
be (for example)
(a) H1 : p1 < p2 ,
(b) H1 : p1 > p2 , or
(c) H1 : p1 − p2 = δ for some specified δ 6= 0.
5.2
Simple Hypothesis Tests
The simplest type of hypothesis testing occurs when the probability distribution giving rise to the data is
specified completely under the null and alternative hypotheses.
Definition 5.6 (Simple hypotheses)
A simple hypothesis is of the form Hk : θ = θk ,
i.e. the probability distribution of the data is specified completely.
Definition 5.7 (Composite hypotheses)
A composite hypothesis is of the form Hk : θ ∈ Ωk ,
i.e. the parameter θ lies in a specified subset Ωk of the parameter space ΩΘ .
Definition 5.8 (Simple hypothesis test)
A simple hypothesis test tests a simple null hypothesis H0 : θ = θ0 against a simple alternative
H1 : θ = θ1 , where θ parametrises the distribution of our experimental random variables X =
X 1 , X 2 , . . . Xn .
There may be many seemingly sensible approaches to testing a given hypothesis. A reasonable criterion
for choosing between them is to attempt to minimise the chance of making a mistake: incorrectly rejecting
a true null hypothesis, or incorrectly accepting a false null hypothesis.
Definition 5.9 (Size)
A test of size α is one which rejects the null hypothesis H0 : θ = θ0 in favour of the alternative
H1 : θ = θ1 iff
X ∈ Cα
where Pr(X ∈ Cα | θ = θ0 ) = α
for some subset Cα of the sample space S of X.
Definition 5.10 (Critical region)
The set Cα in Definition 5.9 is called the critical region or rejection region of the test.
Definition 5.11 (Power & power function)
The power function of a test with critical region Cα is the function
β(θ) = Pr(X ∈ Cα | θ),
and the power is β = β(θ1 ), i.e. the probability that we reject H0 in favour of H1 when H1 is true.
A hypothesis test typically uses a test statistic T (X), whose distribution is known under H0 , and such that
extreme values of T (X) are more compatible with H1 that H0 .
Many useful hypothesis tests have the following form:
60
Definition 5.12 (Simple likelihood ratio test)
A simple likelihood ratio test (SLRT) of H0 : θ = θ0 against H1 : θ = θ1 rejects H0 iff
n L(θ ; x)
o
0
≤ Aα
X ∈ Cα∗ = x L(θ1 ; x)
where L(θ; x) is the likelihood of θ given the data x, and the number Aα is chosen so that the size of
the test is α.
Exercise 5.1
Suppose that X1 , X2 , . . . , Xn IID
∼ N (θ, 1). Show that the likelihood ratio for testing H0 : θ = 0 against
H1 : θ = 1 can be written
λ(x) = exp n x − 12 .
Hence show that √
the corresponding SLRT of size α rejects H0 when the test statistic T (X) = X satisfies
T > Φ−1 (1 − α)/ n.
k
Comments
1. For a simple hypothesis test, both H0 and H1 are ‘point hypotheses’, each specifying a particular
value for the parameter θ rather than a region of the parameter space.
2. The size α is the probability of rejecting H0 when H0 is in fact true; clearly we want α to be small
(α = 0.05, say).
3. Clearly for a fixed size α of test, the larger the power β of a test the better.
However, there is an inevitable trade-off between small size and high power (as in a jury trial: the
more careful one is not to convict an innocent defendant, the more likely one is to free a guilty one
by mistake).
4. In practice, no hypothesis will be precisely true, so the whole foundation of classical hypothesis testing
seems suspect!
5. Regarding likelihood as a measure of compatibility between data and model, an SLRT compares the
compatibility of θ0 and θ1 with the observed data x, and accepts H0 iff the ratio is sufficiently large.
6. One reason for the importance of likelihood ratio tests is the following theorem, which shows that
out of all tests of a given size, an SLRT (if one exists) is ‘best’ in a certain sense.
Theorem 5.1 (The Neyman-Pearson lemma)
Given random variables X1 , X2 , . . . , Xn , with joint density f (x|θ), the simple likelihood ratio test of
a fixed size α for testing H0 : θ = θ0 against H1 : θ = θ1 is at least as powerful as any other test of
the same size.
Exercise 5.2
[Proof of Theorem 5.1] Prove the Neyman-Pearson lemma. Solution: Fix the size of the test to be α.
Let A be a positive constant and C0 a subset of the sample space satisfying
1. Pr(X ∈ C0 | θ = θ0 ) = α,
2. X ∈ C0
⇐⇒
L(θ0 ; x)
f (x|θ0 )
=
≤ A.
L(θ1 ; x)
f (x|θ1 )
Suppose that there exists another test of size α, defined by the critical region C1 , i.e.
61
C0
C1
B2
B1
B3
ΩX
Figure 5.1: Proof of Neyman-Pearson lemma
Reject H0 iff x ∈ C1 , where Pr(x ∈ C1 |θ = θ0 ) = α.
Let B1 = C0 ∩ C1 , B2 = C0 ∩ C1c , B3 = C0c ∩ C1 .
Note that B1 ∪ B2 = C0 , B1 ∪ B3 = C1 , and B1 , B2 & B3 are disjoint.
Let the power of the likelihood ratio test be I0 = Pr(X ∈ C0 | θ = θ1 ),
and the power of the other test be I1 = Pr(X ∈ C1 | θ = θ1 ).
We want to show that I0 − I1 ≥ 0.
But
I0 − I1
=
R
f (x|θ1 )dx −
R
f (x|θ1 )dx
R
= B1 ∪B2 f (x|θ1 )dx − B1 ∪B3 f (x|θ1 )dx
R
R
= B2 f (x|θ1 )dx − B3 f (x|θ1 )dx.
C0
C1
R
Also B2 ⊆ C0 , so f (x|θ1 ) ≥ A−1 f (x|θ0 ) for x ∈ B2 ,
similarly B3 ⊆ C0c , so f (x|θ1 ) ≤ A−1 f (x|θ0 ) for x ∈ B3 ,
Therefore
I0 − I1
i
f
(x|θ
)dx
0
B3
i
hR
R
−1
= A
f
(x|θ
)dx
−
f
(x|θ
)dx
0
0
C0
C1
≥ A−1
hR
f (x|θ0 )dx −
B2
= A−1 [α − α]
=
R
0
as required.
k
5.3
Simple Null, Composite Alternative
Suppose that we wish to test the simple null hypothesis H0 : θ = θ0 against the composite alternative
hypothesis H1 : θ ∈ Ω1 .
The easiest way to investigate this is to imagine the collection of simple hypothesis tests with null hypothesis
H0 : θ = θ0 and alternative H1 : θ = θ1 , where θ1 ∈ Ω1 . Then, for any given θ1 , an SLRT is the most
powerful test for a given size α. The only problem would be if different values of θ1 result in different
SLRTs.
62
Definition 5.13 (UMP Tests)
A hypothesis test is called a uniformly most powerful test of H0 : θ = θ0 against H1 : θ = θ1 , θ1 ∈ Ω1 ,
if
1. There exists a critical region Cα corresponding to a test of size α not depending on θ1 ,
2. For all values of θ1 ∈ Ω1 , the critical region Cα defines a most powerful test of H0 : θ = θ0
against H1 : θ = θ1 .
Exercise 5.3
2
Suppose that X1 , X2 , . . . , Xn IID
∼ N (0, σ ).
1. Find the UMP test of H0 : σ 2 = 1 against H1 : σ 2 > 1.
2. Find the UMP test of H0 : σ 2 = 1 against H1 : σ 2 < 1.
3. Show that no UMP test of H0 : σ 2 = 1 against H1 : σ 2 6= 1 exists.
k
Comments
1. If a UMP test exists, then it is clearly the appropriate test to use.
2. Often UMP tests don’t exist!
3. A UMP test involves the data only via a likelihood ratio, so is a function of the sufficient statistics.
4. The critical region Cα therefore often has a simple form, and is usually easily found once the distribution of the sufficient statistics have been determined
(hence the importance of the χ2 , t and F distributions).
5. The above three examples illustrate how important is the form of alternative hypothesis being considered. The first two are one-sided alternatives whereas H1 : σ 2 6= 1 is a two-sided alternative
hypothesis, since σ 2 could lie on either side of 1.
5.4
Composite Hypothesis Tests
The most general situation we’ll consider is where the parameter space Ω is divided into two subsets:
Ω = Ω0 ∪ Ω1 , where Ω0 ∩ Ω1 = ∅, and the hypotheses are H0 : θ ∈ Ω0 , H1 : θ ∈ Ω1 .
For example, one may want to test the null hypothesis that the data come from an exponential distribution
against the alternative that the data come from a more general gamma distribution. Note that here, as in
many other cases, dim(Ω0 ) < dim(Ω1 ) = dim(Ω).
One possible approach to this situation is to regard the maximum possible likelihood over θ ∈ Ωi as a
measure of compatibility between the data and the hypothesis Hi (i = 0, 1). It’s therefore convenient to
define the following:
b
θ
b
θ0
b1
θ
is the MLE of θ over the whole parameter space Ω,
is the MLE of θ over Ω0 , i.e. under the null hypothesis H0 , and
is the MLE of θ over Ω1 , i.e. under the alternative hypothesis H1 .
b must therefore be the same as either θ
b0 or θ
b1 , since Ω = Ω0 ∪ Ω1 .
Note that θ
b1 ; x)/L(θ
b0 ; x), by direct analogy with the SLRT.
One might consider using the likelihood ratio criterion L(θ
b
b0 ; x):
However, it’s generally easier to use the equivalent ratio L(θ; x)/L(θ
63
Definition 5.14 (Likelihood Ratio Test (LRT))
b ∈ Ω0 in favour of the alternative H1 : θ
b ∈ Ω1 = Ω \ Ω0 iff
A likelihood ratio test rejects H0 : θ
λ(x) =
b x)
L(θ;
≥ λ,
b
L(θ 0 ; x)
(5.1)
b is the MLE of θ over the whole parameter space Ω, θ
b0 is the MLE of θ over Ω0 , and the
where θ
value λ is fixed so that
sup Pr(λ(X) ≥ λ|θ) = α
θ∈Ω0
where α, the size of the test, is some chosen value.
Equivalently, the test criterion uses the log LRT statistic:
b x) − `(θ
b0 ; x) ≥ λ0 ,
r(x) = `(θ;
(5.2)
where `(θ; x) = log L(θ; x), and λ0 is chosen to give chosen size α = supθ∈Ω0 Pr(r(X) ≥ λ0 |θ).
Comments
1. The size α is typically chosen by convention to be 0.05 or 0.01.
2. Note that high values of the test statistic λ(x), or equivalently of r(x), are taken as evidence against
the null hypothesis H0 .
3. The test given in Definition 5.14 is sometimes referred to as a generalized likelihood ratio test, and
Equation 5.1 a generalized likelihood ratio test statistic.
4. Equation 5.2 is often easier to work with than Equation 5.1—see the exercises and problems.
Exercise 5.4
P
P
[Paired t-test] Suppose that X1 , X2 , . . . , Xn IID
N (µ, σ 2 ), and let X = Xi /n, S 2 = (Xi − X)2 /(n − 1).
√ ∼
What is the distribution of T = X/(S/ n)?
Is the test based on rejecting H0 : µ = 0 for large T a likelihood ratio test?
Assuming that the observed differences in diastolic blood pressure (after–before) are IID and Normally
distributed with mean δD , use the captopril data (4.1) to test the null hypothesis H0 : δD = 0 against the
alternative hypothesis H1 : δD 6= 0.
Comment: this procedure is called the paired t test
k
Exercise 5.5
IID
2
2
[Two sample t-test] Suppose X1 , X2 , . . . , Xm IID
∼ N (µX , σ ) and Y1 , Y2 , . . . , Yn ∼ N (µY , σ ).
1. Derive the LRT for testing H0 : µX = µY versus H1 : µX 6= µY .
2. Show that the LRT can be based on the test statistic
T =
where
Sp2 =
Pm
i=1 (Xi
X −Y
q
1
Sp m
+
.
(5.3)
Pn
− X)2 + i=1 (Yi − Y )2
.
m+n−2
(5.4)
3. Show that, under H0 , T ∼ tm+n−2 .
64
1
n
4. Two groups of female rats were placed on diets with high and low protein content, and the gain
in weight (grammes) between the 28th and 84th days of age was measured for each rat, with the
following results:
High protein diet
134 146 104 119 124 161 107 83 113 129 97 123
Low protein diet
70 118 101 85 107 132 94
Using the test statistic T above, test the null hypothesis that the mean weight gain is the same under
both diets.
Comment: this is called the two sample t-test, and Sp2 is the pooled estimate of variance.
k
Exercise 5.6
IID
2
2
[F -test] Suppose X1 , X2 , . . . , Xm IID
∼ N (µX , σX ) and Y1 , Y2 , . . . , Yn ∼ N (µY , σY ), where µX , µY , σX and σY
are all unknown.
2
2
Suppose we wish to test the hypothesis H0 : σX
= σY2 against the alternative H1 : σX
6= σY2 .
Pn
Pm
2
1. Let SX
= i=1 (Xi − X)2 and SY2 = i=1 (Yi − Y )2 .
2
2
What are the distributions of SX
/σX
and SY2 /σY2 ?
2. Under H0 , what is the distribution of the statistic
V =
2
SX
/(m − 1)
?
SY2 /(n − 1)
3. Taking values
of V much
or smaller than
P
P larger P
P 1 2as evidence against H0 , and given data with m = 16,
n = 16,
xi = 84,
yi = 18,
x2i = 563,
yi = 72, test the null hypothesis H0 .
2
Comment: with the alternative hypothesis H1 : σX
> σY2 , the above procedure is called an F test.
k
Even in simple cases like this, the null distribution of the log likelihood ratio test statistic r(x) (5.2) can be
difficult or impossible to find analytically. Fortunately, there is a very powerful and very general theorem
that gives the approximate distribution of r(x):
Theorem 5.2 (Wald’s Theorem)
Let X1 , X2 , . . . , Xn IID
∼ f (x|θ) where θ ∈ Ω, and let r(x) denote the log likelihood ratio test statistic
b x) − `(θ
b0 ; x),
r(x) = `(θ;
b is the MLE of θ over Ω and θ
b0 is the MLE of θ over Ω0 ⊂ Ω.
where θ
Then under reasonable conditions on the PDF (or PMF) f (·|·), the distribution of 2r(x) converges
to a χ2 distribution on dim(Ω) − dim(Ω0 ) degrees of freedom as n → ∞.
Comments
1. A proof is beyond the scope of this course, but may be found in e.g. Kendall & Stuart, ‘The Advanced
Theory of Statistics’, Vol. II.
2. Wald’s theorem implies that, provided the sample size is large, you only need tables of the χ2
distribution to find the critical regions for a wide range of hypothesis tests.
65
Another important theorem, see Problem 3.7.9, page 39, is the following:
2
Theorem 5.3 (Sample Mean and Variance of Xi IID
∼ N (µ, σ ))
IID
2
Let X1 , X2 , . . . , Xn ∼ N(µ, σ ). Then
P
P
1. X = Xi /n and Y = (Xi − X)2 are independent RVs,
2. X has a N(µ, σ 2 /n) distribution,
3. Y /σ 2 has a χ2n−1 distribution.
Exercise 5.7
Suppose X1 , X2 , . . . , Xn IID
∼ N (θ, 1), with hypotheses H0 : θ = 0 and H1 : θ arbitrary.
Show that 2r(x) = nx2 , and hence that Wald’s theorem holds exactly in this case.
k
Exercise 5.8
Suppose now that Xi ∼ N (θi , 1), i = 1, . . . , n are independent, with null hypothesis H0 : θi = θ ∀i and
alternative hypothesis H1 : θi arbitrary.
Pn
Show that 2r(x) = i=1 (xi − x)2 . and hence (quoting any other theorems you need) that Wald’s theorem
again holds exactly.
k
5.5
Problems
1. Suppose that X ∼ Bin(n, p). Under the null hypothesis H0 : p = p0 , what are EX and VarX?
Show that if n is large and p0 is not too close to 0 or 1, then
X/n − p0
p
p0 (1 − p0 )/n
∼ N (0, 1)
approximately.
Out of 1000 tosses of a given coin, 560 were heads and 440 were tails. Is it reasonable to assume that
the coin is fair? Justify your answer.
2. Out of 370 new-born babies at a Hospital, 197 were male and 173 female.
Test the null hypothesis H0 : p < 1/2 versus H1 : p ≥ 1/2, where p denotes the probability that a
baby born at the Hospital will be male.
Discuss any assumptions you make.
3. X is a single observation whose density is given by
(1 + θ)xθ
f (x) =
0
if 0 < x < 1,
otherwise.
Find the most powerful size α test of H0 : θ = 0 against H1 : θ = 1.
Is there a U.M.P. test of H0 : θ ≤ 0 against H1 : θ > 0? If so, what is it?
2
2
2
4. Suppose X1 , X2 , . . . , Xn IID
∼ N (µ, σ ) with null hypothesis H0 : σ = 1 and alternative H1 : σ is
arbitrary. Show
v −1−log vb),
Pn that the LRT will reject H0 for large values of the test statistic r(x) = n(b
where vb = i=1 (xi − x)2 /n.
66
5. Let X1 , . . . , Xn be independent each with density
λx−2 e−λ/x
f (x) =
0
if x > 0,
otherwise,
where λ is an unknown parameter.
(a) Show that the UMP test of H0 : λ = 12 against H1 : λ > 12 is of the form:
Pn
‘reject H0 if i=1 Xi−1 ≤ A∗ ’, where A∗ is chosen to fix the size of the test.
Pn
(b) Find the distribution of i=1 Xi−1 under the null & alternative hypotheses.
(c) You observe values 0.59, 0.36, 0.71, 0.86, 0.13, 0.01, 3.17, 1.18, 3.28, 0.49 for X1 , . . . , X10 .
Test H0 against H1 , & comment on the test in the light of any assumptions made.
6. (a) Define the size and power of a hypothesis test of a simple null hypothesis H0 : θ = θ0 against a
simple alternative hypothesis H1 : θ = θ1 .
(b) State and prove the Neyman-Pearson Lemma for continuous random variables X1 , . . . , Xn when
testing the null hypothesis H0 : θ = θ0 against the alternative H1 : θ = θ1 .
(c) Assume that a particular bus service runs at regular intervals of θ minutes, but that you do
not know θ. Assume also that the times you find you have to wait for a bus on n occasions,
X1 , . . . , Xn , are independent and identically distributed with density
−1
θ
if 0 ≤ x ≤ θ,
f (x|θ) =
0
otherwise.
i. Discuss briefly when the above assumptions would be reasonable in practice.
ii. Find the likelihood L(θ; x) for θ given the data (X1 , . . . , Xn ) = x = (x1 , . . . , xn ).
iii. Find the most powerful test of size α of the hypothesis H0 : θ = θ0 = 20 against the
alternative H1 : θ = θ1 > 20.
From Warwick ST217 exam 1997
7. The following problem is quoted verbatim from Osborn (1979), ‘Statistical Exercises in Medical
Research’ :
A study of immunoglobulin levels in mycetoma patients in the Sudan involved 22 patients to be
compared to 22 normal individuals. The levels of IgG recorded for the 22 mycetoma patients are
shown below. The mean level for the normal individuals was calculated to be 1,477 mg/100ml before
the data for this group was lost overboard from a punt on the river Nile. Use the data below to
estimate the within group variance and hence perform a ‘t’ test to investigate the significance of the
difference between the mean levels of IgG in mycetoma patients and normals.
IgG levels (mg/100ml) in 22 mycetoma patients
1,047
1,377
1,210
1,103
1,270
1,135
1,375
1,067
907
1,230
1,350
804
1,032
960
1,122
1,062
1,002
960
1,345
1,204
1,053
936
Osborn (1979) 4.6.16
8. Let X1 , X2 , . . . , Xn ∼ Exp(θ), i.e. f (x|θ) = θe−θx for θ ∈ (0, ∞).
IID
Show that a likelihood ratio test for H0 : θ ≤ θ0 versus H1 : θ > θ0 has the form:
Z
‘Reject H0 iff θ0 x < k, where k is given by α =
0
nk
1 n−1 −z
z
e dz’.
Γ(n)
Show that a test of this form is UMP for testing H0 : θ = θ0 versus H1 : θ > θ0 .
67
9. (a) Define the size and power function of a hypothesis test procedure.
(b) State and prove the Neyman-Pearson lemma in the case of a test statistic that has a continuous
distribution.
2
2
(c) Let X1 , X2 , . . . , Xn IID
∼ N (µ, σ ), where σ is known. Find the likelihood ratio
fX (x|µ1 )/fX (x|µ0 )
and hence show that the most powerful test of size α for testing the null hypothesis H0 : µ = µ0
against the alternative H1 : µ = µ1 , for some µ1 < µ0 , has the form:
√
‘Reject H0 if X < µ0 + σ Φ−1 (α)/ n ’,
Pn
where X = i=1 Xi /n is the sample mean, and Φ−1 (α) is the 100 α% point of the standard
Normal N (0, 1) distribution.
(d) Define a uniformly most powerful (UMP) test, and show that the above test is UMP for testing
H0 : µ = µ0 against H1 : µ < µ0 .
(e) What is the UMP test of H0 : µ = µ0 against H1 : µ > µ0 ?
(f) Deduce that no UMP test of size α exists for testing H0 : µ = µ0 against H1 : µ 6= µ0 .
(g) What test would you choose to test H0 : µ = µ0 against H1 : µ 6= µ0 , and why?
From Warwick ST217 exam 1999
10. A group of clinicians wish to study survival after heart attack, by classifying new heart attack patients
according to
(a) whether they survive at least 7 days after admission, and
(b) whether they currently smoke 10 or more cigarettes per day.
From previous experience, the clinicians predict that after N days the observed counts
Smoker
Non-smoker
Survive
Die
R1
R3
R2
R4
will follow independent Poisson distributions with means
Smoker
Non-smoker
Survive
Die
N r1
N r3
N r2
N r4
The clinicians intend to estimate the population log-odds ratio ` = log(r1 r4 /r2 r3 ) by the sample
value L = log(R1 R4 /R2 R3 ), and they wish to choose N to give a probability 1 − β of being able to
reject the hypothesis H0 : ` = 0 at the 100α% significance level, when the true value of ` is `0 > 0.
2
Using the formula Var f (X) ≈ f 0 (EX) Var(X), show that L has approximate variance
1
1
1
1
+
+
+
,
N r1
N r2
N r3
N r4
and hence, assuming a Normal approximation to the distribution of L, that the required number of
days is roughly
2
1
1
1
1
1 −1
N= 2
+
+
+
Φ (α/2) + Φ−1 (β) ,
`0 r1
r2
r3
r4
where Φ is the standard Normal cumulative distribution function.
Comment critically on the clinicians’ method for choosing N .
From Warwick ST332 exam 1988
68
11. (a) Define the size and power of a hypothesis test, and explain what is meant by a simple likelihood
ratio test and by a uniformly most powerful test.
(b) Let X1 , X2 , . . . , Xn be independent random variables, each having a Poisson distribution with
mean λ. Find the likelihood ratio test for testing H0 : λ = λ0 against H1 : λ = λ1 , where
λ1 > λ 0 .
Show also that this test is uniformly most powerful.
(c) Twenty-five leaves were selected at random from each of six similar apple trees. The number of
adult female European red mites on each was counted, with the following results:
No. of mites
Frequency
0
70
1
38
2
17
3
10
4
9
5
3
6
2
7
1
Assuming that the number of mites per leaf follow IID Poisson distributions, and using a Normal
approximation to the Poisson distribution, carry out a test of size 0.05 of the null hypothesis
H0 that the mean number of mites per leaf is 1.0, against the alternative H1 that it is greater
than 1.0.
Discuss briefly whether the assumptions you have made in testing H0 appear reasonable here.
From Warwick ST217 exam 2000
12. Hypothesis test procedures can be inverted to produce confidence intervals or more generally confidence regions. Thus, given a size α test of the null hypothesis H0 : θ = θ0 , the set of all values θ0
that would NOT be rejected forms a ‘100(1 − α)% confidence interval for θ’.
An amateur statistician argues as follows:
Suppose something starts at time t0 and ends at time t1 . Then at time t ∈ (t0 , t1 ), the
ratio r of its remaining lifetime (t1 − t) to its current age (t − t0 ), i.e.
r(t) =
t1 − t
,
t − t0
is clearly a monotonic decreasing function of t. Also it is easy to check that r = 39 after
(1/40)th of the total lifetime, and that r = 1/39 after (39/40)th of the total lifetime.
Therefore, for 95% of something’s existence, its remaining lifetime lies in the interval
(t − t0 )/39, 39(t − t0 ) ,
where t is the time under consideration, and t0 is the time the thing came into existence.
The statistician is also an amateur theologian, and firmly believes that the World came into existence
6006 year ago. Using his pet procedure outlined above, he says he is ‘95% confident that the World
will end sometime between 154 years hence, and 234234 years hence’.
His friend, also an amateur statistician, says she has an even more general procedure to produce
confidence intervals:
In any situation I simply roll an icosahedral (fair 20-sided) die. If the die shows ‘13’ then I
quote the empty set ∅ as a 95% confidence interval, otherwise I quote the whole real line R.
She rolls the die, which comes up 13. She therefore says she is ‘95% confident that the World ended
before it even began (although presumably no-one has noticed yet).’
Discuss.
69
The Multinomial Distribution and χ2 Tests
5.6
5.6.1
Multinomial Data
Definition 5.15 (Multinomial Distribution)
The multinomial distribution Mn(n, θ) is a probability distribution on points y = (y1 , y2 , . . . , yk ),
Pk
where yi ∈ {0, 1, 2, . . .}, i = 1, 2, . . . , k, and i=1 yi = n, with PMF
f (y1 , y2 , . . . , yk ) =
where θi > 0 for i = 1, . . . , k, and
Pk
i=1 θi
k
Y
n!
θ yi
y1 !y2 ! · · · yk ! i=1 i
(5.5)
= 1.
Comments
1. The multinomial distribution arises when one has n independent observations, each classified in one
of k ways (e.g. ‘eye colour’ classified as ‘Brown’, ‘Blue’ or ‘Other’; here k = 3).
Let θi denote the probability that any given observation lies in category number i, and let Yi denote
the number of observations falling in category i. Then the random vector Y = (Y1 , Y2 , . . . , Yk ) has a
Mn(n, θ) distribution.
2. A binomial distribution is the special case k = 2, and is usually parametrised by p = θ1 (so θ2 = 1−p).
Exercise 5.9
By partial differentiation of the likelihood function, show that the MLEs θbi of the parameters θi of the
Mn(n, θ) satisfy the equations
yi
yk
−
= 0,
(i = 1, . . . , k − 1)
P
k−1 b
θbi
1−
θj
j=1
and hence that θbi = yi /n for i = 1, . . . , k.
k
5.6.2
Chi-Squared Tests
Suppose one wishes to test the null hypothesis H0 that, in the multinomial distribution 5.5, θ is some
function θ(φ) of another parameter φ. The alternative hypothesis H1 is that θ is arbitrary.
Exercise 5.10
Suppose H0 is that X1 , X2 , . . . Xn IID
∼ Bin(3, φ). Let Yi (for i = 1, 2, 3, 4) denote the number of observations
Xj taking value i − 1. What is the null distribution of Y = (Y1 , Y2 , Y3 , Y4 )?
k
The log likelihood ratio test statistic r(X) is given by
r(X) =
k
X
Yi log θbi −
i=1
k
X
b
Yi log θi (φ)
(5.6)
i=1
where θbi = yi /n for i = 1, . . . , k.
By Walds theorem, under H0 , 2r(X) has approximately a χ2 distribution:
2
k
X
b
Yi [log θbi − log θi (φ)]
i=1
where
70
∼
χ2k1 −k0
(5.7)
θbi = Yi /n,
k0 is the dimension of the parameter φ, and
k1 = k − 1 is the dimension of θ under the constraint
Pk
i=1 θi
= 1.
Comments
b = X = P4 Yi .
1. in Example 5.10, k = 4, k0 = 1, k1 = 3 and φ
i=1
We would reject H0 , that the sample comes from a Bin(3, φ) distribution for some φ, if 2r(x) is
greater than the 95% point of the χ22 distribution, where r(x) is given in Formula 5.6.
2. It is straightforward to check, using a Taylor series expansion of the log function, that provided EYi
is large ∀ i,
k
k
X
X
(Yi − µi )2
b l
,
(5.8)
2
Yi [log θbi − log θi (φ)]
µi
i=1
i=1
b is the expected number of individuals (under H0 ) in the ith category.
where µi = nθi (φ)
Definition 5.16 (Chi-squared Goodness of Fit Statistic)
X2 =
k
X
(oi − ei )2
i=1
ei
,
(5.9)
where oi is the observed count in the ith category and ei is the corresponding expected count under
the null hypothesis, is called the χ2 goodness-of-fit statistic.
Comments
1. Under H0 , X 2 has approximately a χ2 distribution with number of degrees of freedom being (number
of categories) - 1 - (number of parameters estimated under H0 ).
This approximation works well provided all the expected counts are reasonably large (say all are at
least 5).
2. This χ2 test was suggested by Karl Pearson before the theory of hypothesis testing was fully developed.
71
5.7
Problems
1. In a genetic experiment, peas were classified according to their shape (‘round’ or ‘angular’) and
colour (‘yellow’ or ‘green’). Out of 556 peas, 315 were round+yellow, 108 were round+green, 101
were angular+yellow and 32 were angular+green.
Test the null hypothesis that the probabilities of these four types are 9/16, 3/16, 3/16 and 1/16
respectively.
2. A sample of 300 people was selected from a population, and classified into blood type (O/A/B/AB,
and Rhesus positive/negative), as shown in the following table:
O
82
13
Rh positive
Rh negative
A
89
27
B
54
7
AB
19
9
The null hypothesis H0 is that being Rhesus negative is independent of whether an individual’s blood
group is O, A, B or AB. Estimate the probabilities under H0 of falling into each of the 8 categories,
and hence test the hypothesis H0 .
P
3. The random variables X1 , X2 , . . . , Xn are IID with Pr(Xi = j) = pj for j = 1, 2, 3, 4, where
pj = 1
and pj > 0 for each j = 1, 2, 3, 4.
Interest centres on the hypothesis H0 that p1 = p2 and simultaneously p3 = p4 .
(a) Define the following terms
i. a hypothesis test,
ii. simple and composite hypotheses, and
iii. a likelihood ratio test.
(b) Letting θ = (p1 , p2 , p3 , p4 ), X = (X1 , . . . , Xn )T with observed values x = (x1 , . . . , xn )T , and
letting yj denote the number of x1 , x2 , . . . , xn equal to j, what is the likelihood L(θ|x)?
(c) Assume the usual regularity conditions, i.e. that the distribution of −2 log L(θ|x) tends to χ2ν
as the sample size n → ∞. What are the dimension of the parameter space Ωθ and the number
of degrees of freedom ν of the asymptotic chi-squared distribution?
(d) By partial differentiation of the log-likelihood, or otherwise, show that the maximum likelihood
estimator of pj is yj /n.
(e) Hence show that the asymptotic test statistic of H0 : p1 = p2 and p3 = p4 is
−2 log L(x) = 2
4
X
yj log(yj /mj ),
j=1
where m1 = m2 = (y1 + y2 )/2 and m3 = m4 = (y3 + y4 )/2.
(f) In a hospital casualty unit, the numbers of limb fractures seen over a certain period of time are:
Arm
Leg
Left
Side
Right
46
22
49
32
Using the test developed above, test the hypothesis that limb fractures are equally likely to
occur on the right side as on the left side.
Discuss briefly whether the assumptions underlying the test appear reasonable here.
From Warwick ST217 exam 1998
72
Prudens quaestio dimidium scientiae.
Half of science is asking the right questions.
Roger Bacon
We all learn by experience, and your lesson this time is that you should never lose sight of the
alternative.
Sir Arthur Conan Doyle
One forms provisional theories and then waits for time or fuller knowledge to explode them.
Sir Arthur Conan Doyle
What used to be called prejudice is now called a null hypothesis.
A. W. F. Edwards
The conventional view serves to protect us from the painful job of thinking.
John Kenneth Galbraith
Science must begin with myths, and with the criticism of myths.
Sir Karl Raimund Popper
73
This page intentionally left blank (except for this sentence).
74
Chapter 6
Linear Statistical Models
6.1
Introduction
Definition 6.1 (Response Variable)
a response variable is a random variable Y whose value we wish to predict.
Definition 6.2 (Explanatory Variable)
An explanatory variable is a random variable X whose values can be used to predict Y .
Definition 6.3 (Linear Model)
A linear model is a prediction function for Y in terms of the values x1 , x2 , . . . , xk of X1 , X2 , . . . , Xk
of the form
E[Y |x1 , x2 , . . . , xk ] = β0 + β1 x1 + β2 x2 + · · · + βk xk
(6.1)
Thus if Y1 , Y2 , . . . , Yn are the responses for cases 1, 2, . . . , n, and xij is the value of Xj (j = 1, . . . , k) for
case i, then
E[Y|X] = Xβ
(6.2)
where

Y1
Y2
..
.


Y=






Yn
is the vector of responses,
X = (xij )
where xi0 = 1 for i = 1, . . . n,
is the matrix of explanatory variables, and

β0
β1
..
.


β=

βk
is the (unknown) parameter vector.
75





Examples
Consider the captopril data (page 44), and let
X1
X3
Z1
=
=
=
Diastolic BP before treatment,
Diastolic BP after treatment,
2X1 + X2 ,
X2
X4
Z2
=
=
=
Systolic BP before treatment,
Systolic BP after treatment,
2X3 + X4 .
Some possible linear models of interest are:
1. Response Y = X4 ,
(a) explanatory variable X2 (this is a ‘simple linear regression model ’, with just 1 explanatory
variable),
(b) explanatory variable X3
(c) explanatory variables X1 and X2 (a ‘multiple regression model ’).
2. Response Y = Z2 ,
(a) explanatory variable Z1
(b) explanatory variables Z1 and Z12 (a ‘quadratic regression model ’).
Note how new explanatory variables may be obtained by transforming and/or combining old ones.
3. Looking just at the interrelationship between SBP and DBP at a given time:
(a) response Y = X2 , explanatory variable X1 ,
(b) response Y = X1 , explanatory variable X2 ,
(c) response Y = X4 , explanatory variable X3 , etc.
Comments
1. A linear relationship is the simplest possible relationship between response variables and explanatory
variables, so linear models are easy to understand, interpret and also to check for plausibility.
2. One can (in theory) approximate an arbitrarily complicated relationship by a linear model, for example quadratic regression can obviously be extended to ‘polynomial regression’
E[Y |x] = β0 + β1 x + β2 x2 + · · · + βm xm .
3. Linear models have nice links with
• geometry,
• linear algebra,
• conditional expectations and variances,
• the Normal distribution.
4. Distributional assumptions (if any!) will typically be made ONLY about the response variable Y ,
NOT about the explanatory variables.
Therefore the model makes sense even if the Xi s are chosen nonrandomly (‘designed experiments’).
5. The response variable Y is sometimes called the ‘dependent variable’, and the explanatory variables
are sometimes called ‘predictor variables’, ‘regressor variables’, or (very misleadingly) ‘independent
variables’.
76
6.2
Simple Linear Regression
Definition 6.4
A simple linear regression model is a linear model with one response variable Y and one explanatory
variable X, i.e. a model of the form
E[Y |x1 ] = β0 + β1 x1 .
(6.3)
Typically in practice we have n data points (xi , yi ) for i = 1, . . . , n, and we want to predict a future
response Y from the corresponding observed value x of X.
Often there’s a natural candidate for which variable should be treated as the response:
1. X may precede Y in time, for example
(a) X is BP before treatment and Y is BP after treatment, or
(b) X is number of hours revision and Y is exam mark;
2. X may be in some way more fundamental, for example
(a) X is age and Y is height or
(b) X is height and Y is weight;
3. X may be easier or cheaper to observe, so we hope in future to estimate Y without measuring it.
In simple linear regression we don’t know β0 or β1 , but need to estimate them in order to predict Y by
Yb = βb0 + βb1 x.
To make accurate predictions we require the prediction error
Y − Yb
=
Y − βb0 + βb1 x
to be small.
This suggests that, given data (xi , yi ) for i = 1, . . . , n, we should fit βb0 and βb1 by simultaneously making
all the vertical deviations of the observed data points from the fitted line y = βb0 + βb1 x small.
P
The easiest way to do this is to minimise the sum of squared deviations (yi − ybi )2 , i.e. to use the ‘least
squares’ criterion.
6.3
Method of Least Squares
For simple linear regression,
ybi = β0 + β1 xi
(i = 1, . . . , n)
(6.4)
Therefore to estimate β0 and β1 by least squares, we need to minimise
Q=
n
X
[yi − (β0 + β1 xi )]2 .
(6.5)
i=1
Exercise 6.1
Show that Q in equation 6.5 is minimised at values β0 and β1 satisfying the simultaneous equations
βP
0n
β0 xi
P
+ β1 P xi
+ β1 x2i
77
P
= P yi ,
=
xi yi ,
(6.6)
and hence that
P
xi yi − n x y
P 2
,
xi − nx2
βb1
=
βb0
= y − βb1 x.
(6.7)
(6.8)
k
Comments
b
1. Forming ∂ 2 Q/∂β02 , ∂ 2 Q/∂β12 and ∂ 2 Q/∂β0 β1 verifies that Q is minimised at β = β.
2. Equations 6.6 are called the ‘normal equations’ for β0 and β1 (‘normal’ as in ‘perpendicular’ rather
than as in ‘standard’ or as in ‘Normal distribution’).
3. y = βb0 + βb1 x is called the ‘least squares fit’ to the data.
4. From equations 6.7 and 6.8, the least squares fitted line passes through (x, y), the centroid of the
data points.
b rather than on memorising
5. Concentrate on understanding and remembering the method for finding β,
the formulae 6.7 and 6.8 for βb1 and βb1 .
6. Geometrical interpretation
We have a vector y = (y1 , y2 , . . . , yn )T of observed responses, i.e. a point in n-dimensional space,
together with a surface S representing possible joint predicted values under the model (for simple
linear regression, it’s the 2-dimensional surface β0 + β1 x for real values of β0 and β1 ).
P
Minimising (yi − ybi )2 is equivalent to dropping a perpendicular from the point y to the surface S;
b . Thus we are literally finding the model closest to the data.
the perpendicular hits the surface at y
6.4
Problems
P
1. P
Show that the expression
xi yi −Pn x y occurring in the formula for βb1 could also be written as
P
(xi − x)(yi − y), (xi − x)yi , or
xi (yi − y).
Pn
2. Show that the ‘residual sum of squares’, i=1 (yi − ybi )2 , satisfies the following identity:
n
X
(yi − ybi )2 =
i=1
n
X
(yi − βb0 − βb1 xi )2 =
i=1
n
X
(yi − y)2 − βb1
i=1
n
X
(xi − x)(yi − y).
i=1
3. For the captopril data, find the least squares lines
(a) to predict SBP before captopril from DBP before captopril,
(b) to predict SBP after captopril from DBP after captopril,
(c) to predict DBP before captopril from SBP before captopril.
Compare these three lines.
Discuss whether it is sensible to combine the before and after measurements in order to obtain a
better prediction of SBP at a given time from DBP measured at that time.
4. Illustrate the geometrical interpretation of least squares (see above comments) in the following two
cases
(a) model E[Y |x] = β0 + β1 x with 3 data points (x1 , y1 ), (x2 , y2 ) and (x3 , y3 ),
(b) model E[Y |x] = βx with 2 data points (x1 , y1 ) and (x2 , y2 ).
What does Pythagoras’ theorem tell us in the second case?
78
6.5
The Normal Linear Model (NLM)
6.5.1
Introduction
Definition 6.5 (NLM)
Given n response RVs Yi (i = 1, 2, . . . , n), with corresponding values of explanatory variables xTi , the
NLM makes the following assumptions:
1. (Conditional) Independence
The Yi are mutually independent given the xTi .
2. Linearity
The expected value of the response variable is linearly related to the unknown parameters β:
EYi = xTi β.
3. Normality
The random variation Yi |xi is Normally distributed.
4. Homoscedasticity (Equal Variances)
i.e. Yi |xi ∼ N(xTi β, σ 2 ).
6.5.2
Matrix Formulation of NLM
The NLM for responses y = (y1 , y2 , . . . , yn )T can be recast as follows
1. E[Y] = Xβ for some parameter vector β = (β1 , β2 , . . . , βp )T ,
2. = Y − E[Y] ∼ MVN(0, σ 2 I), where I is the (n × n) identity matrix.
It can be shown that the least squares estimates of β are given by solving the simultaneous linear equations
XT y = XT Xβ
(6.9)
(the normal equations), with solution (assuming that XT X is nonsingular)
b = (XT X)−1 XT y,
β
(6.10)
Comments
1. Note that, by formula 6.10, each estimator βbj is a linear combination of the Yi s.
b has a MVN distribution.
Therefore under the NLM, β
2. Even if the Normality assumption doesn’t hold, the CLT implies that, provided the number n of
b will still be approximately MVN.
cases is large, the distribution of the estimator β
3. The most important assumption is independence, since it’s relatively easy to modify the standard
NLM to account for
• nonlinearity: transform the data, or include e.g. x2ij as an explanatory variable,
• unequal variances (‘heteroscedasticity’): e.g. transform from yi − ybi to zi = (yi − ybi )/b
σi .
• non-Normality: transform, or simply get more data!
79
4. In the general formulation the constant term β0 is omitted, though in practice the first column of the
matrix X will often contain 1’s and the corresponding parameter β1 will be the ‘constant term’.
b and the vector of residuals is r = y − y
b = XT β,
b,
5. The corresponding fitted values are y
Pp
Tb
i.e. ri = yi − ybi , where ybi = xi β = j=1 xij βbj .
Definition 6.6 (RSS)
The residual sum of squares (RSS) in the fitted NLM is
s2
=
Pn
=
b T (y − Xβ)
b
(y − Xβ)
i=1 (yi
− ybi )2
(6.11)
Important Fact about the RSS
Considering the RSS s2 to be the observed value of a corresponding RV S 2 , it can be shown that
• S 2 /σ 2 ∼ χ2(n−p) ,
b
• S 2 is independent of β.
Exercise 6.2
1. Show that the log-likelihood function for the NLM is
(constant) −
n
1
log(σ 2 ) − 2 (y − Xβ)T (y − Xβ).
2
2σ
(6.12)
2. Show that the maximum likelihood estimate of β is identical to the least squares estimate.
b
What is the distribution of β?
3. Show that the MLE σ
b2 of σ 2 is
σ
b2 =
s2
.
n
(6.13)
What are the mean and variance of σ
b2 ?
4. Show that an unbiased estimator of σ 2 is given by the formula
Residual Sum of Squares
Residual Degrees of Freedom
6.5.3
k
Examples of the NLM
1. Simple Linear Regression (again)
Yi = β0 + β1 xi + i ,
(6.14)
2
where i IID
∼ N (0, σ ).
2. Two-sample t-test

x1
x2
..
.





y=
 xm
 y1

 .
 ..
yn






,











X=





1
1
..
.
0
0
..
.
1
0
..
.
0
1
..
.
0
1






,





and we’re interested in the hypothesis H0 : (β0 − β1 ) = 0.
80
β=
β0
β1
,
(6.15)
3. Paired t-test
Some quantity Y is measured on each of n individuals under 2 different conditions (e.g. drugs A
and B), and we want to test whether the mean of Y can be assumed equal in both circumstances.




1 0 ··· 0 0
y11
 0 1 ··· 0 0 
 y21 






α1
 .. .. . .
 .. 
.. .. 
 . .
 . 
. . . 
 α2 






 0 0 ··· 1 0 
 yn1 
 .. 
,

,
X
=
(6.16)
y=
β
=
,

 1 0 ··· 0 1 
 y12 
 . 






αn
 0 1 ··· 0 1 
 y22 




δ
 . . .
 . 

.
.
. . .. .. 
 .. ..
 .. 
yn2
0 0 ··· 1 1
where δ is the difference between the expected responses under the two conditions, and the αi are
‘nuisance parameters’ representing the overall level of response for the ith individual.
The null hypothesis is H0 : δ = 0.
4. Multiple Regression (example thereof)
Y = SBP after captopril, x1 = SBP before captopril,



1 210
201
 1 169
 165 



 166 
 1 187



 1 160
 157 



 1 167
 147 



 1 176
 145 



 1 185
 168 



 1 206
,
180
X
=
y=



 1 173
 147 



 1 146
 136 



 1 174
 151 



 1 201
 168 



 1 198
 179 



 1 148
 129 
1 154
131
x2 = DBP before captopril,

130
122 

124 

104 

112 

101 



β0
121 

124 
β =  β1  ,
,
115 
β2

102 

98 

119 

106 

107 
100
(6.17)
where (roughly speaking) β1 represents the increase in EY per unit increase in SBP before captopril
(x1 ), allowing for the fact that EY also depends partly on DBP before captopril (x2 ), and β2 has a
similar interpretation in terms of the effect of x2 allowing for x1 .
b = (XT X)−1 XT y, and also (for example) to
In all the above examples, it’s straightforward to calculate β
calculate the sampling distribution of βbi under the null hypothesis H0 : βi = 0.
Exercise 6.3
Verify the following calculations from the data given in 6.17 above:



15
2654
1685
XT X =  2654 475502 300137  ,
XT y = 
1685 300137 190817


8.563 −0.009165
−0.06120
0.0003026 −0.0003951  ,
(XT X)−1 =  −0.009165
−0.06120 −0.0003951
0.001167

2370
424523  ,
268373


−20.7
b =  0.724  .
β
0.450
k
81
6.6
Checking Assumptions of the NLM
Clearly it’s very important in practice to check that your assumptions seem reasonable; there are various
ways to do this
6.6.1
Formal hypothesis testing
2
χ tests are not very powerful, but are simple and general: count the number of data points satisfying
various (exhaustive & mutually exclusive) conditions, and compare with the expected counts under your
assumptions.
Other tests, for example to test for Normality, have been devised. However, a general problem with
statistical tests is that they don’t usually suggest what to do if your null hypothesis is rejected.
Exercise 6.4
How might you use a χ2 test to check whether SBP after captopril is independent of SBP before captopril?
k
Exercise 6.5
A possible test for linearity in the simple Normal linear regression model (i.e. the NLM with just one
explanatory variable x) is to fit the quadratic NLM
EY = β0 + β1 x + β2 x2
(6.18)
and test the null hypothesis H0 : β2 = 0.
Suppose that Y is SBP and x is dose of drug, and that you have rejected the above null hypothesis.
Comment on the advisability of using Formula 6.18 for predicting Y given x.
k
6.6.2
Graphical Methods and Residuals
If all the assumptions of the NLM are valid, then the residuals
ri
= yi − ybi
b
= yi − xTi β
(6.19)
should resemble observations on IID Normal random variables.
Therefore plots of ri against ANYTHING should be patternless
SEE LECTURE
Comments
1. Before fitting a formal statistical model (including e.g. performing a t-test), you should plot the data,
particularly the response variable against each explanatory variable.
2. After fitting a model, produce several residual plots. The computer is your friend!
3. Note that it’s the residual plots that are most informative. For example, the NLM DOESN’T assume
that the Yi are Normally distributed about µY , but DOES assume that each Yi is Normally distributed
about EYi |xi .
i.e. it’s the conditional distributions, not the marginal distributions, that are important.
82
6.7
Problems
1. Show that the following is an equivalent formulation of the two-sample t-test to that given above in
Formulae 6.15




x1
1 0
 x2 
 1 0 




 .. 
 .. .. 
 . 
 . . 




β0




Y =  xm  ,
β=
,
(6.20)
X =  1 0 ,
β1
 y1 
 1 1 




 . 
 . . 
 .. 
 .. .. 
yn
1 1
with null hypothesis H0 : β1 = 0.
2. Independent samples of 10 U.S. men aged 25–34 years, and 15 U.S. men aged 45–54 years were taken.
Their heights (in inches) were as follows:
(a) Age 25–34
73.3 64.8 72.1 68.9 68.7 70.4 66.8 70.7 74.4 71.8
(b) Age 45–54
73.2 68.5 62.4 65.5 71.3 69.5 74.5 70.6 69.3 67.1 64.7 73.0 66.7 68.1 64.3
Use a two-sample t-test to test the hypothesis that the population means of the two age-groups are
equal (the 90%, 95%, 97.5%, and 99% points of the t23 distribution are 1.319, 1.714, 2.069 and 2.500
respectively).
Comment on whether the underlying assumptions of the two-sample t-test appear reasonable for this
set of data.
Comment also on whether the data can be used to suggest that the population of the U.S. has (or
hasn’t) tended to get taller over the last 20 years.
3. Verify that the least squares estimates in simple linear regression
P
xi yi − n x y
βb1 = P 2
,
βb0 = y − βb1 x,
xi − nx2
b = (XT X)−1 XT y.
are a special case of the general formula β
4. The following data-set shows average January minimum temperature in degrees Fahrenheit (y), together with Latitude (x1 ) and Longitude (x2 ) for 28 US cities. Plot y against x1 , and comment on
what this plot suggests about the reasonableness of the various assumptions underlying the NLM for
predicting y from x1 and x2 .
y
x1
x2
y
x1
x2
y
x1
x2
44
31
15
30
58
19
22
12
21
8
31.2
35.4
40.7
39.7
26.3
42.3
38.1
44.2
43.1
47.1
88.5
92.8
105.3
77.5
80.7
88.0
97.6
70.5
83.9
112.4
38
47
22
45
37
21
27
25
2
32.9
34.3
41.7
31.0
33.9
39.8
39.0
39.7
45.9
86.8
118.7
73.4
82.3
85.0
86.9
86.5
77.3
93.9
35
42
26
65
22
11
45
23
24
33.6
38.4
40.5
25.0
43.7
41.8
30.8
42.7
39.3
112.5
123.0
76.3
82.0
117.1
93.6
90.2
71.4
90.5
Data from HSDS, set 262
83
5. (a) Assuming the model
E[Y |x] = β0 + β1 x,
Var[Y |x] = σ 2 independently of x,
derive formulae for the least squares estimates βb0 and βb1 from data (xi , yi ), i = 1, . . . , n.
What advantages are gained if the corresponding random variables Yi |xi can be assumed to be
independently Normally distributed?
(b) The following table shows the tensile strength (y) of different batches of cement after being
‘cured’ (dried) for various lengths of time x: 3 batches were cured for 1 day, 3 for 2 days, 5 for
3 days, etc. The batch means and standard deviations (s.d.) are also given.
Curing time
Tensile strength
2
(kg/cm ) y
(days) x
1
2
3
7
28
13.0
21.9
29.8
32.4
41.8
13.3
24.5
28.0
30.4
42.6
11.8
24.7
24.1
34.5
40.3
24.1
33.1
35.7
26.2
35.7
37.3
mean
s.d.
12.7
23.7
26.5
33.2
40.0
0.8
1.6
2.5
2.0
3.0
Plot y against x and discuss briefly how reasonable seem each of the following assumptions:
(i) linearity: E[Yi |xi ] = β0 + β1 xi for some constants β0 and β1 .
(ii) independence: the Yi are mutually independent given the xi .
If conditional independence (ii) is assumed true, then how reasonable here are the further assumptions:
(iii) homoscedasticity: Var[Yi |xi ] = σ 2 for all i = 1, . . . , n,
(iv) Normality: the random variables Yi are each Normally distributed.
Say briefly whether you consider any of the above assumptions (i)–(iv) would be more plausible
following
(A) transforming from y to y 0 = loge (y), and/or
(B) transforming x in an appropriate way.
NOTE: you do not need to carry out numerical calculations such as finding the
least-squares fit explicitly.
From Warwick ST217 exam 2000
84
6. To monitor an industrial process for converting ammonia to nitric acid, the percentage of ammonia
lost (y) was measured on each of 21 consecutive days, together with explanatory variables representing
air flow (x1 ), cooling water temperature (x2 ) and acid concentration (x3 ). The data, together with
the residuals after fitting the model yb = 3.614 + 0.072 x1 + 0.130 x2 − 0.152 x3 , are given in the
following table:
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
y
Air
Flow
(x1 )
Water
Temp.
(x2 )
Acid
Conc.
(x3 )
Resid.
4.2
3.7
3.7
2.8
1.8
1.8
1.9
2.0
1.5
1.4
1.4
1.3
1.1
1.2
0.8
0.7
0.8
0.8
0.9
1.5
1.5
80
80
75
62
62
62
62
62
58
58
58
58
58
58
50
50
50
50
50
56
70
27
27
25
24
22
23
24
24
23
18
18
17
18
19
18
18
19
19
20
20
20
58.9
58.8
59.0
58.7
58.7
58.7
59.3
59.3
58.7
58.0
58.9
58.8
58.2
59.3
58.9
58.6
57.2
57.9
58.0
58.2
59.1
0.323
−0.192
0.456
0.570
−0.171
−0.301
−0.239
−0.139
−0.314
0.127
0.264
0.278
−0.143
−0.005
0.236
0.091
−0.152
−0.046
−0.060
0.141
−0.724
Some residual plots are shown on the next page (Fig. 6.1).
(a) Discuss whether the pattern of residuals casts doubt on any of the assumptions underlying the
Normal Linear Model (NLM).
Describe any further plots or calculations that you think would help you assess whether the
fitted NLM is appropriate here.
Continued. . .
85
(b) Various suggestions could be made for improving the model, such as
i.
ii.
iii.
iv.
v.
vi.
vii.
viii.
transforming the response (e.g. to log y or to y/x1 ),
transforming some or all of the explanatory variables,
deleting outliers,
including quadratic or even higher-order terms (e.g. x22 ),
including interaction terms (e.g. x1 x3 ),
carrying out a nonparametric analysis of the data,
applying a bootstrap procedure,
fitting a nonlinear model.
Outline the merits and disadvantages of each of these suggestions here. What would be your
next step in analysing this data-set?
Figure 6.1: Residual plots
From Warwick ST217 exam 1999
86
7. Table 6.1, originally from Narula & Wellington (1977), shows data on selling prices of 28 houses
in Erie, Pennsylvania, together with explanatory variables that could be used to predict the selling
price. The variables are:
X1
X2
X3
X4
X5
X6
X7
X8
X9
Y
=
=
=
=
=
=
=
=
=
=
current taxes (local, school and county) ÷ 100,
number of bathrooms,
lot size ÷ 1000 (square feet),
living space ÷ 1000 (square feet),
number of garage spaces,
number of rooms,
number of bedrooms,
age of house (years),
number of fireplaces,
actual sale price ÷ 1000 (dollars).
Find a function of X1 –X9 that predicts Y reasonably accurately (such functions are used to fix
property taxes, which should be based on the current market value of each property).
X1
X2
X3
X4
X5
X6
X7
X8
X9
Y
4.9176
5.0208
4.5429
4.5573
5.0597
3.8910
5.8980
5.6039
15.4202
14.4598
5.8282
5.3003
6.2712
5.9592
5.0500
8.2464
6.6969
7.7841
9.0384
5.9894
7.5422
8.7951
6.0931
8.3607
8.1400
9.1416
12.0000
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
2.5
2.5
1.0
1.0
1.0
1.0
1.0
1.5
1.5
1.5
1.0
1.0
1.5
1.5
1.5
1.5
1.0
1.5
1.5
3.4720
3.5310
2.2750
4.0500
4.4550
4.4550
5.8500
9.5200
9.8000
12.8000
6.4350
4.9883
5.5200
6.6660
5.0000
5.1500
6.9020
7.1020
7.8000
5.5200
4.0000
9.8900
6.7265
9.1500
8.0000
7.3262
5.0000
0.9980
1.5000
1.1750
1.2320
1.1210
0.9880
1.2400
1.5010
3.4200
3.0000
1.2250
1.5520
0.9750
1.1210
1.0200
1.6640
1.4880
1.3760
1.5000
1.2560
1.6900
1.8200
1.6520
1.7770
1.5040
1.8310
1.2000
1.0
2.0
1.0
1.0
1.0
1.0
1.0
0.0
2.0
2.0
2.0
1.0
1.0
2.0
0.0
2.0
1.5
1.0
1.5
2.0
1.0
2.0
1.0
2.0
2.0
1.5
2.0
7
7
6
6
6
6
7
6
10
9
6
6
5
6
5
8
7
6
7
6
6
8
6
8
7
8
6
4
4
3
3
3
3
3
3
5
5
3
3
2
3
2
4
3
3
3
3
3
4
3
4
3
4
3
42
62
40
54
42
56
51
32
42
14
32
30
30
32
46
50
22
17
23
40
22
50
44
48
3
31
30
0
0
0
0
0
0
1
0
1
1
0
0
0
0
1
0
1
0
0
1
0
1
0
1
0
0
1
25.9
29.5
27.9
25.9
29.9
29.9
30.9
28.9
84.9
82.9
35.9
31.5
31.0
30.9
30.0
36.9
41.9
40.5
43.9
37.5
37.9
44.5
37.9
38.9
36.9
45.8
41.0
Table 6.1: House price data
Weisberg (1980)
87
8. The number of ‘hits’ recorded on J.E.H.Shaw’s WWW homepage in late 1999 are given below. ‘Local’
means the homepage was accessed from within Warwick University, ‘Remote’ means it was accessed
from outside. Data for the week beginning 7–Nov–1999 were unavailable. Note that there was an
exam on Wednesday 8–Dec–1999 for the course ST104, taught by J.E.H.Shaw.
Week
Beginning
Number of Hits
Local Remote Total
26 Sept
3 Oct
10 Oct
17 Oct
24 Oct
31 Oct
7 Nov
14 Nov
21 Nov
28 Nov
5 Dec
12 Dec
19 Dec
0
35
901
641
1549
823
—
1136
2114
2097
3732
5
0
182
253
315
443
525
344
—
383
584
536
461
352
296
182
288
1216
1084
2074
1167
—
1519
2698
2633
4193
357
296
(a) Fit a linear least-squares regression line to predict the number of remote hits (Y ) in a week from
the observed number x of local hits.
(b) Calculate the residuals and plot them against date. Does the plot give any evidence that the
interrelationship between X and Y changes over time?
(c) Using both general considerations and residual plots, comment on how reasonable here are the
assumptions underlying the simple Normal linear regression model, and suggest possible ways
to improve the prediction of Y .
9. The following table shows the assets x (billions of dollars) and net income y (millions of dollars) for
the 20 largest US banks in 1973.
Bank
x
y
Bank
x
y
Bank
x
y
Bank
x
y
1
2
3
4
5
49.0
42.3
36.3
16.4
14.9
218.8
265.6
170.9
85.9
88.1
6
7
8
9
10
14.2
13.5
13.4
13.2
11.8
63.6
96.9
60.9
144.2
53.6
11
12
13
14
15
11.6
9.5
9.4
7.5
7.2
42.9
32.4
68.3
48.6
32.2
16
17
18
19
20
6.7
6.0
4.6
3.8
3.4
42.7
28.9
40.7
13.8
22.2
(a) Plot income (y) against assets (x), and also log(income) against log(assets).
(b) Verify that the least squares fit regression lines are
fit 1:
fit 2:
y = 4.987 x + 7.57,
log(y) = 0.963 log(x) + 1.782
(Note: logs to base e),
and show the fitted lines on your plots.
(c) Produce Normal probability plots of the residuals from each fit.
(d) Which (if either) of these models would you use to describe the relationship between total assets
and net income? Why?
(e) Bank number 19 (the Franklin National Bank) failed in 1974, and was the largest ever US bank to
fail. Identify the point representing this bank on each of your plots, and discuss briefly whether,
from the data presented, one might have expected beforehand that the Franklin National Bank
was in trouble.
88
10. The following data show the blood alcohol levels (mg/100ml) at post mortem for traffic accident
victims. Blood samples in each case were taken from the leg (A) and from the heart (B). Do these
results indicate that blood alcohol levels differ systematically between samples from the leg and the
heart?
Case
A
B
Case
A
B
1
2
3
4
5
6
7
8
9
10
44
265
250
153
88
180
35
494
249
204
44
269
256
154
83
185
36
502
249
208
11
12
13
14
15
16
17
18
19
20
265
27
68
230
180
149
286
72
39
272
277
39
84
228
187
155
290
80
50
290
Osborn (1979) 4.6.5
11. (a) Assume the linear model
E[Y|X] = Xβ,
Var[Y|X] = σ 2 In ,
where In denotes the n × n identity matrix, and XT X is nonsingular. By writing Y − Xβ =
b + X(β
b − β), or otherwise, show that for this model, the residual sum of squares
(Y − Xβ)
(Y − Xβ)T (Y − Xβ)
b = (XT X)−1 XT Y.
is minimised at β = β
b = β and that Var[β]
b = σ 2 (XT X)−1 .
(b) Show that E[β]
(c) Let A = X(XT X)−1 XT . Show that A and In − A are both idempotent, i.e. AA = A and
(In − A)(In − A) = In − A.
(d) For the particular case of a Normal linear model, find the joint distribution of the fitted values
b and show that Y − Y
b = Xβ,
b is independent of Y.
b Quote carefully any properties of the
Y
Normal distribution you use.
(e) For the simple linear regression model (EYi = β0 + β1 xi ), write down the corresponding matrix
X and vector Y, find (XT X)−1 , and hence find the least squares estimates βb0 and βb1 and their
variances.
From Warwick ST217 exam 2001
89
6.8
The Analysis of Variance (ANOVA)
6.8.1
One-Way Analysis of Variance: Introduction
This is a generalization of the two-sample t-test to p > 2 groups.
Suppose there are observations yij (j = 1, 2, . . . , ni ) in the ith group (i = 1, 2, . . . , p),
and let n = n1 + n2 + · · · + np denote the total number of observations.
Denote the corresponding RVs by Yij , and assume that Yij ∼ N (βi , σ 2 ) independently.
Traditionally the main aim has been to test the null hypothesis
H0 : β1 = β2 = . . . = βp
i.e. : β = β 0 = (β0 , β0 , . . . , β0 )
b and β
b and apply a likelihood ratio test, i.e. test whether the ratio
The idea is to fit MLEs β
0
change in RSS
RSS
b to y
b0
squared distance from y
b
squared distance from y to y
=
b and y
b0 are the corresponding fitted values) is larger than would be expected by chance.
(where y
A useful notation for group means etc. uses overbars and ‘+’ suffices as follows:
!
p ni
p
ni
1 X
1 XX
1X
y i+ =
yij ,
y ++ =
yij
=
ni y i+ ,
ni j=1
n i=1 j=1
n i=1
etc.
The underlying models fit naturally in the NLM framework:
Definition 6.7 (One-Way ANOVA)
The one-way ANOVA model is a NLM of the form
Y ∼ MVN(Xβ, σ 2 I),
where




Y=

Y1
Y2
..
.
Yn



,















X=













0 ···
0 ···
.. . .
.
.
0 ···
0 ···
.. . .
.
.
1
1
..
.
0
0
..
.
0
0
..
.
1
0
..
.
0
1
..
.
0
0
..
.
1
0
..
.
0
..
.
0
..
.
0
..
.
0
..
.
0 ··· 0
1 ··· 0
..
..
..
.
.
.
1 ··· 0
..
..
..
.
.
.
0 ··· 1
.. . .
.
. ..
.
0
0
0 ··· 1
0
0
..
.















,
















β=

β1
β2
..
.



,

(6.21)
βp
where X has n1 rows of the first type, . . . np rows of the last type, and n1 + n2 + · · · + np = n.
Exercise 6.6
b = (Y 1+ , Y 2+ , . . . , Y p+ )T .
Show that for one-way ANOVA, XT X = diag(n1 , n2 , . . . , np ), and hence β
k
90
6.8.2
One-Way Analysis of Variance: ANOVA Table
Let
p
β0
= E[Y ++ ]
αi
= βi − β0
p
n
i
1 XX
EYij
n i=1 j=1
=
1X
ni βi ,
n i=1
=
(i = 1, 2, . . . , p).
Typically the p groups correspond to p different treatments, and αi is then called the ith treatment effect.
We’re interested in the hypotheses
H0
H1
: αi = 0 (i = 1, 2, . . . , p),
: the αi are arbitrary.
Note that
1. Y ++ is the MLE of β = β0 under H0 ,
2. Y i+ is the MLE of β + αi , i.e. the mean response given the i treatment.
Hence the fitted values under H0 and H1 are given by Y ++ and Y i+ respectively.
If we also include the ‘null model’ that all the βi are zero, then the possible models of interest are:
Model
βi = 0 ∀ i
# params
i.e. ybij = 0
βi = β0 ∀ i
0
i.e. ybij = y ++
βi arbitrary,
DF
n
1
i.e. ybij = y i+
RSS
P
p
i,j
n−1
P
n−p
P
i,j (yij
i,j (yij
2
yij
(1)
2
− y ++ )
(2)
− y i+ )2
(3)
The calculations needed to test H0 , involving the RSS formulae given above, can be conveniently presented
in an ‘ANOVA table’:
Source of
variation
Degrees of
freedom (DF)
Overall mean
1
Sum of squares
(SS)
ny 2++
(1)–(2) =
Treatment
p−1
(2)–(3) =
Residual
n−p
(3) =
Total
n
(1) =
Mean square
(MS) = SS/DF
ni (y i+ − y ++ )2
P
2
i,j (yij − y i+ )
P
2
i,j yij
P
i
ny 2++
ni (y i+ − y ++ )2 (p − 1)
P
2
(n − p)
i,j (yij − y i+ )
P
i
Finally, calculate the ‘F ratio’
F =
Treatment MS
Treatment SS/(p − 1)
=
Residual MS
Residual SS/(n − p)
(6.22)
which, under H0 , has an F distribution on (p − 1) and (n − p) d.f.
Large values of F are evidence against H0 .
Note: DON’T try too hard to remember formulae for sums of squares in an ANOVA table.
Instead THINK OF THE MODELS BEING FITTED. The ‘lack of fit’ of each model is given by the
corresponding RSS, & the formulae for the differences in RSS simplify.
91
6.9
Problems
1. Show that the formulae for sums of squares in one-way ANOVA simplify:
p
X
ni (Y i+ − Y ++ )2
=
i=1
p
X
2
2
ni Y i+ − nY ++ ,
i=1
p X
ni
X
(Yij − Y i+ )2
=
i=1 j=1
p X
ni
X
Yij2 −
i=1 j=1
p
X
2
ni Y i+ .
i=1
2. (a) Define the Normal Linear Model, and describe briefly how each of its assumptions may be
informally checked by plotting residuals.
(b) The following data summarise the number of days survived by mice inoculated with three strains
of typhoid (31 mice with ‘9D’, 60 mice with ‘11C’ and 133 mice with ‘DSCI’).
Days
to
Death
2
3
4
5
6
7
8
9
10
11
12
13
14
Total
P
P X2i
Xi
Numbers of Mice
Inoculated with. . .
9D 11C DSCI
Total
6
4
9
8
3
1
1
3
3
6
6
14
11
4
6
2
3
1
3
5
5
8
19
23
22
14
14
7
8
4
1
10
12
17
22
28
38
33
18
20
9
11
5
1
31
125
561
60
442
3602
133
1037
8961
224
1604
13124
(Xi is the survival time of the ith mouse in the given group).
Without carrying out any calculations, discuss briefly how reasonable seem the assumptions
underlying a one-way ANOVA on the data, and whether a transformation of the data may be
appropriate.
(c) Carry out a one-way ANOVA on the untransformed data. What do you conclude about the
responses to the three strains of typhoid?
From Warwick ST217 exam 1997
3. The amount of nitrogen-bound bovine serum albumin produced by three groups of mice was measured.
The groups were: normal mice treated with a placebo (i.e. an inert substance), alloxan-diabetic mice
treated with a placebo, and alloxan-diabetic mice treated with insulin. The resulting data are shown
in the following table:
92
Normal
+ placebo
Alloxan-diabetic
+ placebo
Alloxan-diabetic
+ insulin
156
282
197
297
116
127
119
29
253
122
349
110
143
64
26
86
122
455
655
14
391
46
469
86
174
133
13
499
168
62
127
276
176
146
108
276
50
73
82
100
98
150
243
68
228
131
73
18
20
100
72
133
465
40
46
34
44
(a) Produce appropriate graphical display(s) and numerical summaries of these data, and comment
on what can be learnt from these.
(b) Carry out a one-way analysis of variance on the three groups. You may feel it necessary to
transform the data first.
Data from HSDS, set 304
4. The following table shows measurements of the steady-state haemoglobin levels for patients with
different types of sickle-cell anaemia (‘HB SS’, ‘HB S/-thalassaemia’ and ‘HB SC’). Construct an
ANOVA table and hence test whether the steady-state haemoglobin levels differ between the three
types.
HB SS
HB S/-thalassaemia
HB SC
7.2
7.7
8.0
8.1
8.3
8.4
8.4
8.5
8.6
8.7
9.1
9.1
9.1
9.8
10.1
10.3
8.1
9.2
10.0
10.4
10.6
10.9
11.1
11.9
12.0
12.1
10.7
11.3
11.5
11.6
11.7
11.8
12.0
12.1
12.3
12.6
12.6
13.3
13.3
13.8
13.9
Data from HSDS, set 310
93
5. The data in Table 6.2, collected by Brian Everitt, are described in HSDS as being the ‘weights, in
kg, of young girls receiving three different treatments for anorexia over a fixed period of time with
the control group receiving the standard treatment’.
(a) Using a one-way ANOVA on the weight gains, compare the three methods of treatment.
(b) Plot the data so as to clarify the effects of the three treatments, and discuss whether the above
formal analysis was appropriate.
Cognitive
behavioural
treatment
Control
Weight
before after
Weight
before after
80.5
84.9
81.5
82.6
79.9
88.7
94.9
76.3
81.0
80.5
85.0
89.2
81.3
81.3
76.5
70.0
80.4
83.3
83.0
87.7
84.2
86.4
76.5
80.2
87.8
83.3
79.7
84.5
80.8
87.4
82.2
85.6
81.4
81.9
76.4
103.6
98.4
93.4
73.4
82.1
96.7
95.3
82.4
82.4
72.5
90.9
71.3
85.4
81.6
89.1
83.9
82.7
75.7
82.6
100.4
85.2
83.6
84.6
96.2
86.7
80.7
89.4
91.8
74.0
78.1
88.3
87.3
75.1
80.6
78.4
77.6
88.7
81.3
81.3
78.1
70.5
77.3
85.2
86.0
84.1
79.7
85.5
84.4
79.6
77.5
72.3
89.0
Family
therapy
Weight
before after
80.2
80.1
86.4
86.3
76.1
78.1
75.1
86.7
73.5
84.6
77.4
79.5
89.6
89.6
81.4
81.8
77.3
84.2
75.4
79.5
73.0
88.3
84.7
81.4
81.2
88.2
78.8
83.8
83.3
86.0
82.5
86.7
79.6
76.9
94.2
73.4
80.5
81.6
82.1
77.6
77.6
83.5
89.9
86.0
87.3
95.2
94.3
91.5
91.9
100.3
76.7
76.8
101.6
94.9
75.2
77.8
95.5
90.7
90.7
92.5
93.8
91.7
98.0
Table 6.2: Anorexia data
Data from HSDS, set 285
94
6. The following data come from a study of pollution in inland waterways. In each of seven localities,
five pike were caught and the log concentration of copper in their livers measured.
Locality
1.
2.
3.
4.
5.
6.
7.
Windermere
Grassmere
River Stour
Wimbourne St Giles
River Avon
River Leam
River Kennett
Log concentration of copper (ppm)
0.187
0.449
0.628
0.412
0.243
0.134
0.471
0.836
0.769
0.193
0.286
0.258
0.281
0.371
0.704
0.301
0.810
0.497
-0.276
0.529
0.297
0.938
0.045
0.000
0.417
-0.538
0.305
0.691
0.124
0.846
0.855
0.337
0.041
0.459
0.535
(a) The data are plotted in Figure 6.2. Discuss briefly what the plot suggests about the relative
copper pollution in the various localities.
Figure 6.2: Concentration of copper in pike livers
(b) Carry out a one-way analysis of variance to test for differences between the data between localities. Do the results of the formal analysis agree with your subjective impressions from
Figure 6.2?
95
6.10
Two-Way Analysis of Variance
Here there are two factors (e.g. two treatments, or patient number and treatment given) that can be varied
independently.
Factor A has I ‘levels’ 1, 2, . . . , I, and factor B has J ‘levels’ 1, 2, . . . , J. For example:
(a) A is patient number 1, 2, . . . , I, every patient receiving each treatment j = 1, 2, . . . , J in turn,
(b) A is treatment number 1, 2, . . . , I, and B is one of J possible supplementary treatments.
Data can be conveniently tabulated:
Factor A
1
2
3
..
.
1
Y11
Y21
Y31
..
.
Factor B
2
...
Y12 . . .
Y22 . . .
Y32 . . .
..
..
.
.
J
Y1J
Y2J
Y3J
..
.
I
YI1
YI2
YIJ
...
i.e. there is precisely one observation Yij at each (i, j) combination of factor levels.
Again assume the NLM with
E[Yij ] = θi + φj
for i = 1 . . . I and j = 1 . . . J.
i.e.
Yij ∼ N (θi + φj , σ 2 )
independently.
(6.23)
A problem here is that one could transform θi →
7 θi + c and φj 7→ φj − c for each i and j, where c is
arbitrary. Therefore for identifiability one needs to impose some (arbitrary) constraints.
The simplest and most symmetrical reformulation for the two-way ANOVA model is
Yij
∼ N (µ + αi + βj , σ 2 ),
PI
αi
= 0,
PJ
βj
=
i=1
j=1
where
(6.24)
0.
Exercise 6.7
What is the matrix formulation of the model 6.24?
k
Particular models of interest within the framework of Formulae 6.24 are:
(1) Yij ∼ N (0, σ 2 ),
X
RSS =
Yij2 ,
DF = n = IJ.
i,j
(2) Yij ∼ N (µ, σ 2 ),
X
RSS =
(Yij − Y ++ )2 ,
DF = n − 1 = IJ − 1.
i,j
(3) Yij ∼ N (µ + αi , σ 2 ),
Ybij = µ
b+α
bi = Y i+ .
X
Therefore RSS =
(Yij − Y i+ )2 ,
DF = n − I = I(J − 1).
i,j
96
(4) Yij ∼ N (µ + βj , σ 2 ),
Ybij = µ
b + βbj = Y +j .
X
Therefore RSS =
(Yij − Y +j )2 ,
DF = n − J = (I − 1)J.
i,j
(5) Yij ∼ N (µ + αi + βj , σ 2 ),
Ybij = µ
b+α
bi + βbj = Y i+ + Y +j − Y ++ .
X
Therefore RSS =
(Yij − Y i+ − Y +j + Y ++ )2 ,
DF = n − I − J + 1 = (I−1)(J−1).
i,j
Again, we can form an ANOVA table summarising the independent ‘sources of variation’.
The degrees of freedom are the differences between the DFs associated with the various models.
The sums of squares are the differences between the SSs associated with the various models.
Source of
variation
Degrees of
freedom (DF)
Sum of squares
(SS)
Mean square
(MS)
Overall mean
Effect of Factor A
Effect of Factor B
Residuals
1
I−1
J−1
(I−1)(J−1)
(1)−(2)
(2)−(3)
(2)−(4)
(5)
(2)−(3) (I−1)
(2)−(4)
(J−1)
(5)
(I−1)(J−1)
Total
IJ = n
(1)
Table 6.3: Two-way ANOVA table
Comments
1. DeGroot gives a more general version.
2. As with one-way ANOVA, one can test H0 : αi = 0, i = 1 . . . I, by comparing
(SS due to A)/(I − 1)
(Residual SS)/([I − 1][J − 1])
with the 95% point of F(I−1),([I−1][J−1]) .
3. Similarly one can test H0 : βj = 0, j = 1 . . . J, by comparing
(SS due to B)/(J − 1)
(Residual SS)/([I − 1][J − 1])
with the 95% point of F(J−1),([I−1][J−1]) .
4. The above two F tests are using completely separate aspects of the data (row sums of the Yij table,
column sums of the Yij table).
5. The case J = 2 is equivalent to the paired t-test (Exercise 5.4).
6. As for one-way ANOVA, the formulae for sums of squares simplify:
‘sum over each observation the squared difference between the fitted values under the two models
being considered’.
The residual SS is then most easily obtained by subtraction.
See problem 6.11.1
97
6.11
Problems
1. For the two-way analysis of variance (Table 6.3, page 97), find simplified formulae for the sums of
squares analogous to those found for the one-way ANOVA (exercise 6.9.1).
2. Three pertussis vaccines were tested on each of ten days. The following table shows estimates of the
log doses of vaccine (in millions of organisms) required to protect 50% of mice against a subsequent
infection with pertussis organisms.
Day
A
Vaccine
B
C
Total
1
2
3
4
5
6
7
8
9
10
2.64
2.00
3.04
2.07
2.54
2.76
2.03
2.20
2.38
2.42
2.93
2.52
3.05
2.97
2.44
3.18
2.30
2.56
2.99
3.20
2.93
2.56
3.35
2.55
2.45
3.25
2.17
2.18
2.74
3.14
8.50
7.08
9.44
7.59
7.43
9.19
6.50
6.94
8.11
8.76
Total
24.08
28.14
27.32
79.54
Test the statistical significance of the differences between days and between vaccines.
Osborn (1979) 8.1.2
3. (a) Explain what is meant by the Normal Linear Model (NLM), and show how the two-way analysis
of variance may be formulated in this way.
(b) The following table gives the average UK cereal yield (tonnes per hectare) from 1994 to 1998,
together with the row, column, and overall totals.
Wheat
Barley
Oats
Other cereal
Total
1994
1995
1996
1997
1998
Total
7.35
5.37
5.50
5.65
7.70
5.73
5.52
5.52
8.15
6.14
6.14
5.86
7.38
5.76
5.78
5.52
7.56
5.29
6.00
5.04
38.14
28.29
28.94
27.59
23.87
24.47
26.29
24.44
23.89
122.96
Calculate the fitted yields and residuals for Wheat in each of the five years
i. under the NLM assuming no column effect, and
ii. under the NLM assuming that row & column effects are additive.
(c) Describe briefly how to test the null hypothesis that there is no column effect (i.e. no consistent
change in yield from year to year). You do not need to carry out the numerical calculations.
(d) A nonparametric test of the above hypothesis may be carried out as follows: rank the data for
each row from lowest to highest (thus for Wheat the values 7.35, 7.70, 8.15, 7.38 and 7.56 are
replaced by 1, 4, 5, 2 and 3 respectively), then sum the four ranks for each year, and finally
carry out a one-way analysis of variance on the five sums of ranks.
Comment on the advantages and disadvantages of applying this procedure, rather than the
standard two-way ANOVA, to the above data.
From Warwick ST217 exam 2001
98
4. The following table gives the estimated hospital waiting lists (000s) by month & region, throughout
the years 2000 & 2001.
Month Year
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2000
2001
2001
2001
2001
2001
2001
2001
2001
2001
2001
2001
2001
NY
T
E
L
SE
SW
137.6
132.0
125.1
129.5
129.9
128.9
127.9
126.7
124.6
123.4
121.0
121.3
122.4
121.6
119.6
122.6
124.0
124.2
123.2
123.2
123.0
124.1
123.0
124.3
105.5
103.2
98.7
99.5
99.5
99.9
99.0
98.9
98.2
97.1
97.2
99.0
97.7
96.9
95.3
96.5
97.2
98.0
98.9
99.6
99.4
99.0
99.6
100.7
121.6
118.7
111.9
114.3
114.6
114.4
113.6
113.2
111.9
112.0
111.7
112.7
113.3
113.0
109.7
110.7
111.0
111.8
111.4
111.9
111.6
111.8
113.2
115.8
173.3
167.9
162.3
163.5
163.2
163.3
160.8
159.6
158.1
156.0
155.7
158.0
159.5
159.2
156.3
158.9
160.0
160.6
160.9
161.4
159.1
156.7
155.4
159.1
192.6
190.4
184.0
186.9
186.4
183.7
183.9
183.4
183.8
183.4
183.4
188.2
188.8
186.7
181.1
184.1
185.9
187.2
187.4
185.7
185.2
184.6
183.7
186.7
111.0
107.7
100.7
101.5
100.6
100.2
99.6
100.1
99.6
98.8
98.9
99.4
99.8
100.1
97.1
99.3
100.3
99.7
100.2
100.0
100.2
101.1
102.0
103.6
WM
99.4
95.8
89.6
92.1
92.4
91.9
90.7
90.2
91.0
91.1
92.1
92.8
93.4
92.1
87.2
88.8
90.1
90.8
91.0
91.6
91.0
90.7
89.8
92.1
NW
177.6
172.2
164.8
166.5
166.2
165.5
164.9
165.6
164.6
163.1
161.2
162.9
164.0
163.3
160.5
162.7
164.4
165.6
165.5
166.2
165.8
165.5
164.7
168.0
Key
NY
E
SW
Northern & Yorkshire
Eastern
L
South West
WM
London
West Midlands
T
SE
NW
Trent
South East
North West
[Data extracted from archived Press Releases at http://tap.ccta.gov.uk/doh/intpress.nsf]
Fit a two-way ANOVA model, possibly after transforming the data, and address (briefly) the following
questions:
(a) Does the pattern of change in waiting lists differ across the regions?
(b) Is there a simple (but not misleading) description of the overall change in waiting lists over the
two years?
(c) Predict the values for the eight regions in March 2002 (to the nearest 100, as in the Table).
(d) The set of figures for March 2001 were the latest available at the time of the General Election
in May 2001. A cynical acquaintance suggests to you that the March 2001 waiting lists were
‘unusually good’. What do you think?
99
5. Table 4.2, page 45, presented data on the preventive effect of four different drugs on allergic response
in ten patients.
A simple way to analyse the
patient √
response,
√ data is via a two-way ANOVA on a suitable measure of√
such as√the increase
in
NCF,
which
is
tabulated
below
(for
example,
1.95
=
3.8
−
0.0 and
√
1.52 = 9.2 − 2.3).
Drug
1
2
3
4
P
C
D
K
1.95
0.71
0.65
0.19
1.52
1.30
0.67
0.54
0.77
1.32
0.65
−0.07
0.44
1.48
0.48
0.82
Patient number
5
6
0.78
0.58
0.00
0.54
1.69
0.41
0.44
−0.44
7
8
9
0.37
0.00
0.26
0.27
0.95
2.09
0.42
−0.03
1.10
0.32
1.18
0.59
10
0.62
−0.22
0.63
0.71
(a) Test the statistical significance of the differences between drugs and between patients.
(b) Plot the original data (Table 4.2) in a way that would help you assess whether the assumptions
underlying the above two-way ANOVA are reasonable.
(c) Comment on the analysis you have made suggesting possible improvements where appropriate.
You do NOT need to carry out any further complicated calculations.
6. Table 6.4 shows purported IQ scores of identical twins, one raised in a foster home (Y ), and the
other raised by natural parents (X). The data are also categorised according to the social class of
the natural parents (upper, middle, low). The data come from Burt (1966), and are also available in
Weisberg (1980).
upper class
Case
Y
X
1
2
3
4
5
6
7
82
80
88
108
116
117
132
82
90
91
115
115
129
131
middle class
Case
Y
X
8
9
10
11
12
13
71
75
93
95
88
111
78
79
82
97
100
107
lower class
Case
Y
X
14
15
16
17
18
19
20
21
22
23
24
25
26
27
63
77
86
83
93
97
87
94
96
112
113
106
107
98
68
73
81
85
87
87
93
94
95
97
97
103
106
111
Table 6.4: Burt’s twin IQ data
(a) Plot the data.
(b) Fit simple linear regression models to predict Y from X within each social class.
(c) Fit parallel lines predicting Y from X within each social class (i.e. fit regression models with
the same slope in each of the three classes, but possibly different intercepts).
(d) Produce an ANOVA table and an F -test to test whether the parallelism assumption is reasonable. Comment on the calculated F ratio.
100
For we know in part, and we prophesy in part.
But when that which is perfect is come, then that which is in part shall be done away.
1 Corinthians 13:9–10
Everything should be made as simple as possible, but not simpler.
Albert Einstein
A theory is a good theory if it satisfies two requirements: it must accurately describe a large
class of observations on the basis of a model that contains only a few arbitrary elements, and
it must make definite predictions about the results of future observations.
Stephen William Hawking
The purpose of models is not to fit the data but to sharpen the question.
Samuel Karlin
Science may be described as the art of systematic oversimplification.
Sir Karl Raimund Popper
101
This page intentionally left blank (except for this sentence).
102
Chapter 7
Further Topics
7.1
Generalisations of the Linear Model
You can generalise the systematic part of the linear model, i.e. the formula for E[Y |x]
and/or the random part, i.e. the distribution of Y − E[Y |x].
7.1.1
Nonlinear Models
These are models of the form
E[Y |x] = g(x, β)
(7.1)
T
where Y is the response, x is a vector of explanatory variables, β = (β1 . . . βp ) is a parameter vector, and
the function g is nonlinear in the βi s.
Examples
1. Asymptotic regression:
Yi
=
i
IID
∼
α − βγ xi + i
(i = 1, 2, . . . , n),
2
N (0, σ ).
There are four parameters to be estimated: β = (α, β, γ, σ 2 )T .
Assuming that 0 < γ < 1, we have:
(a) E[Y |x] is monotonic increasing in x,
(b) E[Y |x = 0] = α − β,
(c) as x → ∞, E[Y |x] → α.
This ‘asymptotic regression’ model might be appropriate, for example, if
(a) x = age of an animal,
y = height or weight, or
(b) x = time spent training,
y = height jumped
(for n people of similar build).
2. The ‘Michaelis-Menten’ equation in enzyme kinetics
E[Y |x] =
β1 x
β2 + x
with various possible distributional assumptions, the simplest of which is
[Y |x] ∼ N (β1 x/(β2 +x), σ 2 ).
103
Comments
1. Nonlinear models can be fitted, in principle, by maximum likelihood.
2. In practice one needs computers and iteration.
3. Even if the random variation is assumed to be Normal, the likelihood may have a very non-Normal
shape.
7.1.2
Generalised Linear Models
Definition 7.1 (GLM)
A generalized linear model (GLM) has a random part and a systematic part:
Random Part
1. The ith response Yi has a probability distribution with mean µi .
2. The distributions are all of the same form (e.g. all Normal with variance σ 2 , or all Poisson, etc.)
3. The Yi s are independent.
Systematic Part
g(µi ) = xTi β
=
p
X
βj xij ,
where
j=1
1. xi = (xi1 . . . xip )T is a vector of explanatory variables,
2. β = (β1 . . . βp )T is a parameter vector, and
3. g(·) is a monotonic function called the link function.
Comments
1. If Yi ∼ N (µi , σ 2 ) and g(·) is the identity function, then we have the NLM.
2. Other GLMs typically must have their parameters estimated by maximising the likelihood numerically
(iteratively in a computer).
3. The principles behind fitting GLMs are similar to those for fitting NLMs
Example: ‘logistic regression’
1. Random part: binary response
e.g. Yi |xi =
1 if individual i survived
0 if individual i died
(and all Yi s are conditionally independent given the corresponding xi s).
Note that µi = E[Yi |xi ] is here the probability of surviving given explanatory variables xi , and is
usually written pi or πi .
2. Systematic part:
g(πi ) = log
104
πi
1 − πi
.
Exercise 7.1
Show that under the logistic regression model, if n patients have identical explanatory variables x say, then
1. Each of these n patients has probability of survival given by
π=
exp(xT β)
,
1 + exp(xT β)
2. The number R surviving out of n has expected values nπ and variance nπ(1 − π).
k
7.2
Simpson’s Paradox
Simpson’s paradox occurs when there are three RVs X, Y and Z, such that the conditional distributions
[X, Y |Z] show a relationship between [X|Z] and [Y |Z], but the marginal distribution [X, Y ] apparently
shows a very different relationship between X and Y . For example,
1. X(Y )=male(female) death rate, Z=age,
2. X(Y )=male(female) admission rate to University, Z=admission rate for student’s chosen course.
7.3
Problems
1. (a) Explain what is meant by
i. the Normal linear model,
ii. simple linear regression, and
iii. nonlinear regression.
(b) For simple linear regression applied to data (xi , yi ), i = 1, . . . , n, show that the maximum likelihood estimators βb0 and βb1 of the intercept β0 and slope β1 satisfy the simultaneous equations
βb0 n + βb1
n
X
xi =
i=1
and
βb0
n
X
xi + βb1
i=1
n
X
n
X
yi
i=1
x2i =
n
X
xi yi .
i=1
i=1
Hence find βb0 and βb1 .
(c) The following table shows Y , the survival time (weeks) of leukaemia patients and x, the corresponding log of initial white blood cell count.
x
Y
x
Y
x
3.36
2.88
3.63
3.41
3.78
4.02
65
156
100
134
16
108
4.00
4.23
3.73
3.85
3.97
4.51
121
4
39
143
56
26
4.54
5.00
5.00
4.72
5.00
Y
22
1
1
5
65
Plot the data and, without carrying out any calculations, discuss how reasonable are the assumptions underlying simple linear regression in this case.
From Warwick ST217 exam 1998
105
2. Because of concerns about sex discrimination, a study was carried out by the Graduate Division at
the University of California, Berkeley. In fall 1973, there were 8,442 male applications and 4,321
female applications to graduate school. It was found that about 44% of the men and 35% of the
women were admitted.
When the data were investigated further, it was found that just 6 of the more than 100 majors
accounted for over one-third of the total number of applicants. The data for these six majors (which
Berkeley forbids identifying by name) are summarized in the table below.
Men
Women
Major
Number of
applicants
Percent
admitted
Number of
applicants
Percent
admitted
A
B
C
D
E
F
825
560
325
417
191
373
62
63
37
33
28
6
108
25
593
375
393
341
82
68
34
35
24
7
Discuss the possibility of sex discrimination in admission, with particular reference to explanatory
variables, conditional probability, independence and Simpson’s paradox.
Data from Freedman et al. (1991), page 17
1
3. (a) At a party, the POTAS of your dreams approaches you, and says by way of introduction:
Hi—I’m working on a study of human pheromones, and need some statistical help. Can
you explain to me what’s meant by ‘logistic regression’, and why the idea’s important?
Give a brief verbal explanation of logistic regression, without (i) using any formulae, (ii) saying anything that’s technically incorrect, (iii) boring the other person senseless and ruining a
potentially beautiful friendship, (iv) otherwise embarrassing yourself.
(b) Repeat the exercise, replacing logistic regression successively with:
Bayesian inference,
a multinomial distribution,
nuisance parameters,
the Poisson distribution,
statistical independence,
conditional expectation,
multiple regression,
one-way ANOVA,
a linear model,
a t-test.
likelihood,
the Neyman-Pearson lemma,
order statistics,
size & power,
(c) Suddenly, a somewhat inebriated student (SIS) appears and interrupts your rather impressive
explanation with the following exchange:
SIS:
POTASOYD:
SIS:
Think of a number from 1 to 10.
Erm—seven?
Wrong. Get your clothes off.
You then watch aghast while he starts introducing himself in the same way to everyone in the
room. As a statistician, you of course note down the numbers xi he is given, namely
7, 2, 3, 1, 5, 2, 10, 10, 7, 3, 9, 1, 2, 2, 7, 10, 5, 8, 5, 7, 3, 10, 6, 1, 5, 3, 2, 7, 8, 5, 7.
His response yi is ‘Wrong’ in each case, and you formulate the hypotheses
H0 : yi
H1 : yi
=
‘Wrong’ irrespective of xi
‘Right’ if xi = x0 , for some x0 ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
=
‘Wrong’ if xi 6= x0 .
How might you test the null hypothesis H0 against the alternative H1 ?
1 Person
Of The Appropriate Sex
106
4.
(i) Explain what is meant by:
(a) a generalised linear model,
(b) a nonlinear model.
(ii) Discuss the models you would most likely consider for the following data sets:
(a) Data on the age, sex, and weight of 100 people who suffered a heart attack (for the first
time), and whether or not they were still alive two years later.
(b) Data on the age, sex and weight of 100 salmon in a fish farm.
From Warwick ST217 exam 1996
I have yet to see any problem, however complicated, which, when you looked at it the right
way, did not become still more complicated.
Poul Anderson
The manipulation of statistical formulas is no substitute for knowing what one is doing.
Hubert M. Blalock, Jr.
A judicious man uses statistics, not to get knowledge, but to save himself from having ignorance
foisted upon him.
Thomas Carlyle
The best material model of a cat is another, or preferably the same, cat.
A. Rosenblueth & Norbert Wiener
A little inaccuracy sometimes saves tons of explanation.
Saki (Hector Hugh Munro)
karma police arrest this man he talks in maths he buzzesLikeAfridge hes like a detuned radio.
Thom Yorke
Better is the end of a thing than the beginning thereof.
Ecclesiastes 7:8
107
Bibliography
[1] V. Barnett. Comparative Statistical Inference. John Wiley and Sons, New York, second edition, 1982.
[2] C. Burt. The genetic determination of differences in intelligence: A study of monozygotic twins reared
together and apart. Brit. J. Psych., 57:137–153, 1966.
[3] G. Casella and R. L. Berger. Statistical Inference. Wadsworth & Brooks/Cole, Pacific Grove, CA,
1990.
[4] G. Casella and R. L. Berger. Statistical Inference. Wadsworth & Brooks/Cole, Pacific Grove, CA,
second edition, 2001.
[5] M. H. DeGroot. Probability and Statistics. Addison-Wesley, Reading, Mass., second edition, 1989.
[6] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap, volume 57 of Monographs on Statistics
and Applied Probability. Chapman and Hall, New York, 1993.
[7] D. Freedman, R. Pisani, R. Purves, and A. Adhikari. Statistics. W. W. Norton, New York, second
edition, 1991.
[8] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice.
Chapman and Hall, London, 1996.
[9] D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski, editors. A Handbook of Small
Data Sets. Chapman and Hall, London, 1994.
[10] R. V. Hogg and A. T. Craig. Introduction to Mathematical Statistics. MacMillan, New York, 1970.
[11] B. W. Lindgren. Statistical Theory. Chapman and Hall, London, fourth edition, 1994.
[12] A. M. Mood, F. A. Graybill, and D. C. Boes. Introduction to the Theory of Statistics. McGraw-Hill,
New York, third edition, 1974.
[13] D. S. Moore and G. S. McCabe. Introduction to the Practice of Statistics. W. H. Freeman & Company
Limited, Oxford, UK, third edition, 1998.
[14] S. C. Narula and J. F. Wellington. Prediction, linear regression and minimum sum of relative errors.
Technometrics, 19:185–190, 1977.
[15] O.P.C.S. 1993 Mortality Statistics, volume 20 of DH2. Her Majesty’s Stationery Office, London, 1995.
[16] J. F. Osborn. Statistical Exercises in Medical Research. Blackwell Scientific Publications, Oxford, UK,
1979.
[17] J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth, Pacific Grove, CA, second edition,
1995.
[18] P. Sprent. Data Driven Statistical Methods. Chapman and Hall, London, 1998.
[19] S. Weisberg. Applied Linear Regression. John Wiley and Sons, New York, 1980.
108