Download Statistics 1: MATH11400

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

German tank problem wikipedia , lookup

Least squares wikipedia , lookup

Time series wikipedia , lookup

Robust statistics wikipedia , lookup

Coefficient of determination wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Statistics 1: MATH11400
Oliver Johnson: [email protected]
Twitter: @BristOliver
School of Mathematics, University of Bristol
Teaching Block 2, 2017
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
1 / 279
Course outline
10 chapters, 20 lectures, 10 weekly problem sheets
Notes have gaps, model solutions will only be handed out on paper:
IT IS YOUR RESPONSIBILITY TO ATTEND LECTURES
AND TO ENSURE YOU HAVE A FULL SET OF NOTES AND
SOLUTIONS
Course webpage for notes, problem sheets, links etc:
https://people.maths.bris.ac.uk/∼maotj/teaching.html
Datasets for the course
https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData
Drop-in sessions: Tuesdays 1-2. Just turn up to Room 3.17 in these
times. (Other times, I may be out or busy - but just email
[email protected] to fix an appointment).
This material is copyright of the University unless explicitly stated
otherwise. It is provided exclusively for educational purposes at the
University and is to be downloaded or copied for your private study
only.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
2 / 279
Contents
1
Introduction
2
Section 1: Basics of data analysis
3
Section 2: Parametric families and method of moments
4
Section 3: Likelihood and Maximum Likehood Estimation
5
Section 4: Assessing the Performance of Estimators
6
Section 5: Sampling distributions related to the Normal distribution
7
Section 6: Confidence intervals
8
Section 7: Hypothesis Tests
9
Section 8: Comparison of population means
10
Section 9: Linear regression
11
Section 10: Linear Regression: Confidence Intervals & Hypothesis Tests
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
3 / 279
Textbooks
The recommended textbook for the unit is:
Mathematical statistics and data analysis by JA Rice
This covers both Probability and Statistics material and some of the
second year Statistics unit. It combines modern ideas of data analysis,
using graphical and computational techniques, with more traditional
approaches to mathematical statistics.
The statistical package R will be used to illustrate ideas in lectures
and you will need to use it for set work. The notes and handouts
should provide sufficient information, but a good introductory text for
further reading is:
Introductory Statistics with R by Peter Dalgaard
It will be particularly useful for students who intend to continue
studying statistics in their second, third (and fourth) year.
Other books are listed on the course web page.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
4 / 279
Section 1: Basics of data analysis
Aims of this section:
This section introduces a selection of simple graphical and
numerical methods for exploring and summarizing single data sets.
These methods generally form part of an approach called
Exploratory Data Analysis.
Such analysis and evaluation can be informative in its own right.
It also forms an essential first step before any detailed statistical
analysis is performed on the data.
The section also introduces the statistical package R through its
use for simple graphical and numerical computation of plots and
summary statistics.
Suggested reading: Rice
Oliver Johnson ([email protected])
Sections 10.1-10.6
Statistics 1: @BristOliver
c
TB 2 UoB
2017
5 / 279
Objectives: by the end of this section you should be able to
Construct simple graphical plots of data sets (stem-and-leaf plot,
histogram, boxplot and (if appropriate) time-plot).
Use simple graphical plots to comment on the overall pattern of
data in a data set, and identify and comment on any striking
deviations from this pattern.
Calculate simple measures of location for a data set (median,
mean and trimmed mean).
Calculate simple measures of spread for a data set (variance,
standard deviation, hinges, quartiles and inter-quartile range).
Use the statistical package R to produce simple graphical plots
and compute simple measures of location and spread for a given
set of real-valued data.
Compute the order statistics for a given set of real-valued data.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
6 / 279
Section 1.1: Our framework
Definition 1.1.
Many statistical problems can be described by a simple framework of:
a
of objects
a real-valued
population
associated with each member of the
some quantity of interest determined by the overall distribution of
X values in the population
a sample of n members of the population
a
{x1 , . . . , xn } of observed values of X for the sampled
members.
The key problem is to infer the unknown value of the population
quantity from the known sample data.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
7 / 279
Motivating Example
Example 1.2.
the
is ‘all students graduating from Bristol in 2016’,
the variable is ‘debt of each student at the time of graduation’,
the quantity of interest is the average debt in the population,
the
is ‘100 students chosen randomly from email database’,
the data set {x1 , . . . , x100 } is their individual level of debt.
Example 1.3.
the population is all lightbulbs made by Firm X,
the
is the lifetime of each lightbulb,
the quantity of interest is the proportion of lifetimes exceeding 2 years.
the
is ‘1000 lightbulbs fitted in 2013, checked in 2015’
the data set x1 , . . . , x1000 is ‘the date that they failed’ (some might
not have failed yet)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
8 / 279
More general settings are considered in later chapters
May have sample data from several populations and want to
determine if there is any pattern in
in quantity of interest
between populations.
I
I
e.g. given data on debt for a sample of students from several
universities, and want to explore how the average debt
from
university to university.
e.g. given data on lifetimes of a sample of lightbulbs from several
manufacturers, and want to explore how the proportion of lifetimes
exceeding
varies from manufacturer to manufacturer.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
9 / 279
Simple random samples
Definition 1.4.
We say that sample data values x1 , . . . , xn are the observed values of a
of size n from the population if
each sample member is chosen
members, and
each population member is
of the other sample
to be included in the sample.
Note that this can be hard to achieve in practice.
Remark 1.5.
For simple random samples, data values are representative of values in
the population as a whole.
On average, different values occur in the sample in the same
proportion as they occur in the population.
Thus we can use data values from (possibly small) sample to make
about values in the population as a whole.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
10 / 279
Section 1.2: Exploratory Data Analysis (EDA)
Exploratory Data Analysis refers to a collection of techniques for
exploration of a data set.
EDA can help check that data is compatible with assumptions that:
1
2
3
observations are independent,
observations all come from a
distribution,
the distribution is a particular type (e.g. normal, exponential).
EDA can give simple direct estimates of some population quantities,
without assuming any particular type of distribution.
Features of EDA which we will use include:
Numerical summaries of centre or location of data (Section 1.4)
I
median, mean,
mean
Numerical summaries of spread of data (Section 1.5)
I
variance, standard deviation,
, quartiles and inter-quartile-range
Initial graphical plots (Section 1.6)
I
stem-and-leaf plots, histograms (or bar charts), time plots, boxplots
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
11 / 279
Section 1.3: Random samples and order statistics
Definition 1.6.
Write {x1 , . . . , xn } for a data set where the n values are arranged in
time order – for example, the order of
or recording.
The first value seen is x1 , . . . , the last seen is xn .
Definition 1.7.
Write {x(1) , . . . , x(n) } for the
of the sample – that is,
the data set rearranged so the values are increasing in size.
The smallest (minimum) value seen is x(1) .
The largest (maximum) is x(n) .
We say that x(i) has rank i.
We can find x(i) if we know all xi .
We cannot find xi if we know all x(i) .
If xi observations of independent identically distributed (IID) random
variables, x(i) are neither independent nor identically distributed.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
12 / 279
Section 1.4: Measures of centre or location of data
Definition 1.8.
The
is the ‘middle observation’ in the ranked order
x(1) ≤ x(2) ≤ . . . ≤ x(n) .
If n = 2m + 1 is odd, the median is
.
If n = 2m is even, the median is the average (x(m) + x(m+1) )/2.
Pro: not sensitive to extreme values.
Con: not easy to calculate when combining two samples.
Definition 1.9.
Sample
defined as x = (x1 + . . . + xn )/n.
Equivalently, x = (x(1) + . . . + x(n) )/n.
Pro: easy to calculate, easy to combine two samples, easy to derive
statistical properties.
Con: sensitive to extreme values.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
13 / 279
Trimmed Sample Mean
Definition 1.10.
We define the ∆%
mean as follows:
First take k = bn∆/100c, where b·c denotes
Remove smallest
values and largest
.
values of the sample.
Calculate the sample mean of the remaining values.
This is harder to calculate, but less sensitive to extreme values.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
14 / 279
Section 1.5: Numerical measures of range or spread of data
Definition 1.11.
2
is defined as s =
Pn
j=1 (xj
− x)2
.
n−1
Equivalently (easier to calculate – check they are equivalent!!)
2
s =
Pn
2
j=1 xj −
P
n
j=1 xj
n−1
2
/n
=
Pn
2
j=1 xj
− nx 2
n−1
.
NB: Please note normalization of second term in numerator.
√
Sample
s = s 2.
s 2 represents spread: s 2 large means large spread around x, small s 2
means small spread.
Logic for dividing by (n − 1): sum of (xj − x) is zero, so only
independent values of (xj − x).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
15 / 279
Hinges and quartiles
Definition 1.12.
Want to divide sample in 4 roughly equal parts.
Lower hinge H1 is median of set
{data values with
≤ rank of sample median}.
Upper hinge H3 is median of set
{data values with rank ≥ rank of sample median}.
Definition 1.13.
Related idea is that of
: definitions vary slightly.
Rice defines Q1 to be data value with rank (n + 1)/4 and Q3 to be
data value with rank 3(n + 1)/4.
That is Q1 = x((n+1)/4) etc.
If these ranks aren’t integers, we will interpolate.
Different authors/packages interpolate in different ways (see sheet ).
For large samples, H1 ' Q1 and H3 ' Q3 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
16 / 279
IQR, outliers, skewness
Definition 1.14.
A
number summary can be calculated from the data – give the
median, the upper and lower hinges and the maximum and minimum.
These numbers roughly divide the data into four equally-sized groups.
Definition 1.15.
Interquartile range: IQR = Q3 − Q1 measures spread around median.
In this course, we define
to be points more than
1.5 × (H3 − H1 ) ' 1.5 × IQR away from the hinges.
We say a distribution is
to the right if the histogram has long
right tail, that is if H3 − median > median − H1 .
Distribution is skewed to the left if H3 − median < median − H1 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
17 / 279
Example calculations
Example 1.16.
If data is seen in order
then
x1 = 3, x2 = 1, x3 = 5, x4 = 9, x5 = 0, x6 = −1.
x(1) = −1, x(2) = 0, x(3) = 1, x(4) = 3, x(5) = 5, x(6) = 9.
n = 6, so median is (x(3) + x(4) )/2 = (1 + 3)/2
.
x = (−1 + . . . + 9)/6 =
.
The 20% trimmed mean found since k = b6 ∗ 20/100c = b1.2c = 1.
Hence, discarding 1 largest and 1 smallest value gives 20% trimmed
mean P
of (0 + 1 + 3 + 5)/4
.
Since j xj2 = 32 + 12 + 52 + 92 + 02 + (−1)2 = 117, so that
s 2 = (117 − 172 /6)/(6 − 1) = (117 − 289/6)/5 = 13.77.
H1 = 0, median of {−1, 0, 1}.
H3 = 5, median of {3, 5, 9}.
(n + 1)/4 = 7/4, so take
Q1 = x(1) + 3/4(x(2) − x(1) ) = −1 + 3/4(0 − (−1)) = −1/4.
3(n + 1)/4 = 21/4, so take Q3 = x(5) + 1/4(x(6) − x(5) ) = 6.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
18 / 279
Section 1.6: Initial Graphical Plots
In graphical plots, should check:
I
I
I
I
the overall
of variation within the data (e.g. symmetric,
skew, bi-modal)
any unusual features within a pattern or striking deviations from a
pattern (outliers)
whether any
features are just random occurrences or are
systematic features
any evidence of
or granularity (data clumping at certain
sequences of values, reflecting measurement scale)
For more than one variable, first examine each variable by itself, then
study relationships between variables.
For data from more than one population, compare variation within
each data set with variation between data sets – use numerical (e.g.
summary statistics) or graphical (e.g. boxplots) summaries
Best to generate graphical plots using R – powerful language for
statistics (and applications)
Typing data() will give a list of available datasets.
Course dataset is online
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
19 / 279
EDA (a): Stem-and-leaf plots in R – command stem
Example 1.17.
> load(url("https://people.maths.bris.ac.uk/~maotj/teach/stats1.RData"))
> sort(quakes)
[1]
9
30
33
36
38
40
40
44
46
76
82
83
92
[14]
99 121 129 139 145 150 157 194 203 209 220 246 263
[27] 280 294 304 319 328 335 365 375 384 402 434 436 454
[40] 460 556 562 567 584 599 638 667 695 710 721 735 736
[53] 759 780 832 840 887 937 1336 1354 1617 1901
> stem(quakes)
The decimal point is 2 digit(s) to the right of the |
0 | 133444445888902345569
2 | 01256890234788
4 | 034566678
6 | 0470124468
8 | 3494
10 |
12 | 45
14 |
16 | 2
18 | 0
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
20 / 279
Formal description of Stem-and-Leaf Plot
1
If necessary, truncate or round the data values so that all the
variation is in last two or three significant digits.
2
Separate each data value into a
except the rightmost), and a
3
Write the stems in a vertical column – smallest at the top – and
draw a separator (e.g. a vertical line) to the right of this column.
4
Write each leaf in the row to the
of the corresponding stem,
in increasing order out from the stem.
5
Record any strikingly low or high values separately from the main
stem, displaying the individual values in a group above the main
stem (low values) or below it (high values).
Oliver Johnson ([email protected])
(consisting of all the digits
(the rightmost digit).
Statistics 1: @BristOliver
c
TB 2 UoB
2017
21 / 279
Earthquakes stem-and-leaf plot example (cont.)
Example 1.17.
R decided to put the decimal point digits to the right of the | (the
bar) and use a scale where each stem corresponds to intervals of 200
days – and hence each leaf corresponds to an interval of 10 days.
Thus the first line represents (rounded) data values of 10, 30, 30, 40,
40 ,. . . , 90, 100, 120, . . . , 190 in that order.
Because of the
and small number of data values, it can be
difficult to tell that the last line, for example, represents the data
value 1900 rather than 1800.
Change scale on which data is displayed using e.g. stem(quakes,
scale=2) which produces a scale where each stem corresponds to
days.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
22 / 279
Earthquakes stem-and-leaf plot example (cont.)
Example 1.17.
> stem(quakes, scale=2)
The decimal point is 2 digit(s) to the right of the |
0 | 1334444458889
1 | 02345569
2 | 0125689
3 | 0234788
4 | 03456
5 | 6678
..
.
18 |
19 | 0
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
23 / 279
EDA (b): Histograms in R – command hist
Example 1.18.
Produce histogram of earthquake times using hist(quakes)
The standard histogram on the left, a customised version on the
right. This is done using par(mfrow= c(1,2)).
Earthquakes − maotj
0.0015
Density
0.0000
0
0
500
1500
quakes
Oliver Johnson ([email protected])
0.0010
0.0005
10
5
Frequency
15
0.0020
20
Histogram of quakes
0
500
1500
Time between earthquakes in days
Statistics 1: @BristOliver
c
TB 2 UoB
2017
24 / 279
Formal description of histogram
1
Divide the range of data values into K intervals (cells or
) of
equal width. If width is too large, the plot may be too coarse to
see the details of any pattern; if too small many cells may have
just one or two observations.
2
Count the number (frequency) or the percentage of observations
falling into each interval. Be consistent with the allocation of
values that equal the end points of intervals.
3
Display a plot of joined columns or bars above each interval, with
height proportional to the
or percentage for that interval.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
25 / 279
Customising histograms in R
Example 1.18.
The plots can be customised using sub-commands, such as:
I
I
I
I
freq=FALSE to display
(i.e. proportions) rather than counts
by specifying breaks to give a certain number of
or to give cells
of a desired width;
by adding titles using main="Plot name - Your ID";
labelling axes using xlab="Label for X axis", similarly ylab.
ALWAYS ADD YOUR OWN ID before printing a plot.
For example we can customise the histogram above as follows:
> hist(quakes, breaks=seq(0,2000,100), freq=FALSE,
xlab="Time between earthquakes in days",
main="Earthquakes - maotj")
gives histogram of proportions rather than frequencies; breaks
between cells form a sequence starting at 0 days, finishing at 2000
days, and of length 100 days apart; adds a label to the x-axis; and
revises the title of the plot.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
26 / 279
EDA (c): Time Plots in R – command plot
A plot of the data in the order it was recorded may give valuable
information when values represent outcomes of repetitions of a single
statistical
repeated over time.
In R this is obtained with command plot, e.g.
> load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData"))
> plot(quakes)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
27 / 279
EDA (d): Boxplot and five number summary
A boxplot is a very simple graphical summary of a data set, devised by
John Tukey, based on the five number summary (see Definition 1.14).
The plot consists of a box with
I
I
I
top drawn level with the value of the upper hinge,
bottom drawn level with the value of the lower hinge,
horizontal line drawn across the middle, level with the median.
Vertical lines (sometimes called
) are drawn from the top of
the box to a point level with the maximum value, and from the
bottom of the box to a point level with the minimum value.
If there are any
, then the whiskers are drawn to the largest
data value within 1.5 × IQR from the corresponding hinge and the
remaining outlier data points are plotted individually.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
28 / 279
Creating boxplots in R
In R a boxplot and the corresponding five numbers on which it is
based can be obtained with the commands boxplot and fivenum.
For example, for the dataset quakes
> boxplot(quakes)
> fivenum(quakes)
A typical boxplot looks something like the following plot:
Individual values of outliers
Largest value within 1.5 IQR of Upper Hinge
Upper Hinge
Median
Lower Hinge
Smallest value within 1.5 IQR of Lower Hinge
Individual values of outliers
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
29 / 279
Section 1.7: Newcomb data example
Example 1.19.
Newcomb data sample refers to attempts to measure speed of light.
Data values record each time in terms of deviations from a standard
time of 0.000024800 seconds - so a value of 28 indicates a recorded
time of 0.000024828 seconds etc
> load(url("https://people.maths.bris.ac.uk/~maotj/teach/stats1.RData"))
> newcomb
[1] 28 26 33 24 34 -44
[20] 19 24 20 36 32 36
[39] 30 22 36 23 27 27
[58] 25 32 25 29 27 28
27
28
28
29
16
25
27
16
40 -2 29 22 24 21 25 30 23 29 31
21 28 29 37 25 28 26 30 32 36 26
31 27 26 33 26 32 32 24 39 28 24
23
> plot(newcomb, main="Plot of Newcomb data - maotj")
> hist(newcomb, breaks=seq(-45, 45,2.5),
+ main="Histogram of Newcomb data - maotj")
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
30 / 279
Example 1.19.
0
−40
newcomb
40
Plot of Newcomb data − maotj
0
10
20
30
40
50
60
Index
10
5
0
Frequency
Histogram of Newcomb data − maotj
−40
−20
0
20
40
newcomb
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
31 / 279
Numerical summaries
Example 1.19.
We can use R commands median(), mean(), summary(), var(),
sd() and IQR() to calculate numerical summaries.
> median(newcomb)
[1] 27
> mean(newcomb)
[1] 26.21212
> mean(newcomb, trim = 0.1)
[1] 27.42593
> mean(newcomb, trim = 0.2)
[1] 27.35
> summary(newcomb)
Min. 1st Qu. Median
Mean 3rd Qu.
-44.00
24.00
27.00
26.21
30.75
> var(newcomb)
[1] 115.462
> sd(newcomb)
[1] 10.74532
> IQR(newcomb)
[1] 6.75
Oliver Johnson ([email protected])
Max.
40.00
Statistics 1: @BristOliver
c
TB 2 UoB
2017
32 / 279
Graphical summary
Example 1.19.
We can also produce a graphical summary of the data with the
command boxplot(), for which the relevant numerical values are
given by the command fivenum().
> boxplot(newcomb, main="Boxplot of Newcomb data - maotj")
> fivenum(newcomb)
[1] -44 24 27 31 40
The boxplot scale can be distorted by the presence of outliers.
We can produce a boxplot of the newcomb data set which excludes
the outliers (the 6th and 10th data values) with the command
boxplot(newcomb[-c(6,10)])
The standard boxplot is shown on the left below, while a boxplot
without outliers is shown on the right.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
33 / 279
Boxplot with and without outliers
Example 1.19.
40
35
−40
20
−20
25
0
30
20
40
Boxplot of Newcomb data − maotj
Boxplot of Newcomb data − maotj
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
34 / 279
Section 2: Parametric families and method of moments
Aims of this section:
This section introduces the idea of modelling the distribution of a
variable in a population in terms of a family of parametric
distributions, where the parameters of the distribution relate to
specific quantities of interest in the population, such as the
population mean or variance.
If our model is appropriate for the data, then we can make
inferences about the population from which data was obtained
simply by estimating the parameters for the distribution.
The section starts by introducing one of the simplest methods of
parametric estimation - the method of moments - but note that
other methods (maximum likelihood methods and least squares
methods) will be introduced in later sections.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
35 / 279
Aims of this section:
For discrete observations representing counts, the correct choice
of parametric model is often determined by the context.
However, for continuous uni-modal data, the correct choice of
model can be more difficult. There are often a number of feasible
models whose distributions appear at first glance to similar while
differing in essential details.
In the second half of the section we introduce probability plots
(plots of the quantiles of the data against the quantiles of the
fitted distribution) way of assessing the fit of the data to the
chosen parametric family.
Rice
Suggested reading: Rice
Rice
Oliver Johnson ([email protected])
Sections 8.1-8.4
Section 10.2.1
Section 9.9
Statistics 1: @BristOliver
c
TB 2 UoB
2017
36 / 279
Objectives: by the end of this section you should be able to
Understand how parametric families can be used to provide
flexible models for the distribution of population random variables.
Recall the formulae for the probability mass function and the
probability density function for the standard parametric families of
discrete and continuous distributions (Binomial, Geometric,
Poisson, Exponential, Gamma, Normal, Uniform) and be able to
relate the parameters of each distribution to population quantities
such as the population mean and the population variance.
Write down equation(s) defining the method of moments
estimators for parametric families defined in terms of one or more
unknown parameters.
Solve the equations for the methods of moments estimators in
simple one and two parameter cases, and hence compute
appropriate methods of moments estimates from a given data set.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
37 / 279
Objectives: by the end of this section you should be able to
Use R to numerically calculate the quantiles of simple standard
distributions (Uniform, Exponential, Gamma, Normal).
Construct simple probability plots of the quantiles of a data set
against the quantiles of a given or fitted distribution (by hand or
in R ), and use the plot to assess the fit of the data to the
specified distribution.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
38 / 279
Section 2.1: Parametric Models
Definition 2.1.
When X is a
variable and population size is
, some
probability density function fX (x) may give a reasonable, if idealised,
model of the frequency of X values in the population.
Call X the population random variable and call fX (x) the population
probability density function (pdf). Similarly call pX (x) the population
probability mass function (pmf) if X is discrete.
Although we do not know the population distribution, we often have
theory, experience or data suggesting a certain
of probability
density function is appropriate for the population in question.
Example 2.2.
Inspection of the data may lead us to believe the earthquake times in
Section 1.6 come from an Exponential(θ) distribution and the Newcomb
times in Section 1.7 come from a Normal(µ, σ 2 ) distribution.
We would like to estimate the population parameters θ, µ and σ 2 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
39 / 279
Definition of parametric families
Definition 2.3.
A
is a collection of distributions of the same type
which differ only in the value of one (or more) parameter, say θ.
In other words, the form of the distribution is a
of θ.
Write fX (x; θ) for the pdf in the parametric family corresponding to
the parameter θ.
Write E(X ; θ) for the mean of the corresponding distribution etc.
A summary sheet of parametric families and graphs comparing
probability density functions is provided in the appendix.
Discrete: Bernoulli(θ), Binomial(K , θ), Geometric(θ), Poisson(θ)
Continuous: U(0, θ), Exp(θ), Γ(α, β), N(µ, σ 2 ).
Other parametric families such as the Lognormal, the Pareto and the
Weibull families can, for example, provide better models of skewed
data populations.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
40 / 279
Estimation
Definition 2.4.
b 1 , . . . , xn ) (usually abbreviated to θ)
b is our ‘best
An
θ(x
guess’ for the unknown parameter, based on the data {x1 , . . . , xn }.
Sometimes refer to function (x1 , . . . , xn ) → θb as ‘estimator’.
Remark 2.5.
θ is the true, fixed but
value.
We hope that θb will be close to θ in some sense.
However θb depends on the data, so is itself
.
b
IT IS VITAL TO USE THE NOTATION θ AND θ
CORRECTLY: THEY ARE NOT INTERCHANGABLE!
Will distinguish between methods of estimating (method of moments,
maximum likelihood, least squares) with subscripts if necessary.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
41 / 279
Estimating quantities of interest
Remark 2.6.
Although the original ‘quantity of interest’ will not necessarily
correspond to θ itself, it must be a function of θ, say
.
One way to estimate τ (θ) is to first estimate θ, then to plug the
b
estimate θb into the expression τ (θ) to get the estimate τ (θ).
See Example 3.10 below for more details.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
42 / 279
Section 2.2: Sampling from parametric families
From now onwards, at the start of each problem we will
that
an appropriate parametric family has been identified.
Family has probability density function f (x; θ) (or probability mass
function p(x; θ) if
).
Can have single unknown parameter θ, or in general k unknown
parameters.
Definition 2.7.
We assume
data values x1 , . . . , xn are the observed values of a
simple random sample of size n from the population represented by
the parametric family.
i.e. assume X1 , . . . , Xn independent, identically distributed ∼ f (x; θ).
For simple random samples, Remark 1.5 explains that the data values
are representative of the values in the population as a whole.
Thus we can use data values from the (possibly small) sample to
make inferences about the values in the population as a whole.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
43 / 279
Joint probability density of simple random sample
Lemma 2.8.
For a simple random sample from a distribution in a parametric family,
fX1 ,...,Xn (x1 , . . . , xn ; θ) =
n
Y
fX (xi ; θ).
i=1
Proof.
Since X1 , . . . , Xn are independent, their joint probability density
function factorises as the
of the
density functions:
fX1 ,...,Xn (x1 , . . . , xn ; θ) = fX1 (x1 ; θ)fX2 (x2 ; θ) · · · fXn (xn ; θ).
X1 , . . . , Xn are identically distributed with the same distribution as X ,
so each marginal density function has the same form as the density
function for X , i.e. for all i: fXi (xi ; θ) = fX (xi ; θ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
44 / 279
Section 2.3: Method of moments estimation
Definition 2.9.
Assume the
random variable X comes from a parametric
family with parameter θ.
For k = 1, 2, 3, . . . we call E(X k ; θ) the kth population moment (i.e.
the average value of X k in the population).
Hence E(X ; θ) is the first population
etc.R
∞
Can look up E(X ; θ) in Appendix or calculate as −∞ x k f (x; θ)dx.
Definition 2.10.
For a sample with data values {x1 , . . . , xn }, for k = 1, 2, 3, . . . define
the kth sample moment by
mk =
x1k + x2k + . . . + xnk
.
n
Hence mk is the average value of x k in the
.
Hence m1 = (x1 + . . . + xn )/n is the sample mean etc.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
45 / 279
Method of moments estimation: motivation
If data comes from a simple random sample, the sample values are
representative of population values.
Suggests population moment
' sample moment
.
Definition 2.11.
Given a sample {x1 , . . . , xn } from a
I
I
I
family:
If the family has one parameter θ: define the method of moments
estimator θbmom to be the solution of
E(X ; θbmom ) = m1 .
If the family has two parameters α, β: define the method of moments
estimators α
bmom and βbmom to be the (simultaneous) solutions of
E(X ; α
bmom , βbmom ) =
E(X 2 ; α
bmom , βbmom ) =
m1
m2
If there are k unknown parameters, compare the first k population and
sample moments.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
46 / 279
Section 2.4: Method of moments examples
Example 2.12.
Assume {x1 , . . . xn } come from a simple random sample from an
Exponential(θ), with θ unknown.
unknown parameter, so need
equation, involving the mean.
Factsheet tells us that E(X ; θ) =
.
Method of moments implies that m1 = E(X ; θbmom ) =
.
Rearranging, θbmom = 1/m1 = 1/x = n/(x1 + . . . + xn ).
Example 2.13.
For earthquake data from Section 1.6, sample histogram ‘looks
exponential’ – no reason to think Exp(θ) is not an adequate model.
(See Section 2.5 below for a more rigorous way of doing this).
In this case n = 62 and sample mean m1 = x ' 437.2097 (found
using mean(quakes) in R ).
Hence method of moments estimator θbmom = 1/m1 ' 0.002287.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
47 / 279
Method of moments examples
Example 2.14.
Assume {x1 , . . . xn } come from a simple random sample from an
, with µ, σ 2 unknown.
unknown parameters, so need
equations, involving the mean and
variance.
Factsheet tells us that E(X ; µ, σ 2 ) = .
Further Var (X ; µ, σ 2 ) = σ 2 , so that E(X 2 ; µ, σ 2 ) =
.
Method of moments estimates are joint solutions of two equations:
c2 mom ) = m1
µ
bmom = E(X ; µ
bmom , σ
c2 mom = E(X 2 ; µ
c2 mom ) = m2
bmom , σ
µ
b2mom + σ
First equation gives µ
bmom = m1 = x
c
2
Further σ mom = m2 − µ
b2mom = m2 − m12
c2 mom = Pn (xj − x)2 /n.
Rearranging, gives σ
j=1
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
.
c
TB 2 UoB
2017
48 / 279
Newcomb data example
Example 2.15.
For Newcomb data from Section 1.6, after removing outliers x6 and
x10 , sample histogram gives no reason to think N(µ, σ 2 ) is not an
adequate model.
Using R (since var gives sample variance, i.e. divides by (n − 1) = 63)
> mean(newcomb[-c(6,10)])
[1] 27.75
> 63/64*var(newcomb[-c(6,10)])
[1] 25.4375
c2 mom = 25.4375.
Hence µ
bmom = 27.75, σ
Equivalently, taking square roots σ
bmom
q
c2 mom = 5.04356.
= σ
Obviously would get different answers if we didn’t remove outliers.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
49 / 279
Method of moments examples
Example 2.16.
Assume {x1 , . . . xn } come from a simple random sample from a
.
Since for X ∼ U(0, θ), the mean E(X ; θ) = θ/2, we solve
θbmom /2 = E(X ; θbmom ) = x.
In this case
This may not make sense if maxi (xi ) > θbmom = 2x.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
50 / 279
Section 2.5: Assessing Fit
Say we have
I
I
I
a data set of n values x1 , . . . , xn
assumed to be a random sample from a population whose distribution
function and pdf have parametric forms
and fX (x; θ).
an estimate θ̂ = θ̂(x1 , . . . , xn ) calculated from the data.
Good practice to assess how well our model fits the data, by
comparing the observations x1 , . . . , xn with the values we might
expect for a
from FX (x; θ̂).
If the observations show striking or systematic differences from what
we would expect, our assumed model may not be appropriate.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
51 / 279
Heuristics based on order statistics
If the model FX (x; θ̂) is correct, for any i and y , the
P(Xi ≤ y ) = FX (y ; θ̂).
This means #{Xi ≤ y } ∼ Bin(n, FX (y ; θ̂)) ' nFX (y ; θ̂).
Equivalently, if we write k = nFX (y ; θ̂), there are about k values less
than y , so the kth order statistic
k
; θ̂ .
x(k) ' y = FX−1
n
Here FX−1 denotes the inverse of FX for fixed θ (not the
1/FX ). That is, FX−1 (FX (y ; θ); θ) = y .
In fact, for k = 1, . . . , n in practice (to avoid issues with k = n) we
use:
k
−1
b
x(k) ' FX
;θ
n+1
b '
or equivalently FX (x(k) ; θ)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
52 / 279
Assessing fit with order statistics
If model FX (x; θ̂) is correct, then – on average – the observations are
likely to be equally spaced out according to this distribution.
In other words the n sample values should roughly split the range of X
values into n + 1 intervals each of which has probability 1/(n + 1)th.
Examples showing the expected values of the order statistics for a
simple random sample of size n = 4 for
and
.
The values split each range into 5 intervals with probability 1/5
Oliver Johnson ([email protected])
1
_ 1
_1
_1
_ 1
_
5 555 5
1
_ 1
_
5 5
x xx x
x
1
_
5
x
1
_
5
x
Statistics 1: @BristOliver
1
_
5
x
c
TB 2 UoB
2017
53 / 279
Section 2.6: Quantile (Q-Q) plots and Probability plots
For a given value of n, and a given distribution FX (x), we will call the
n values
k
−1
FX
, k = 1, . . . , n,
n+1
the
of the distribution.
Similarly the n ordered sample values x(1) , . . . , x(n) that split the
sample into roughly equal parts are called the
.
The discussion above leads to two simple graphical methods for
assessing the fit of a model: quantile (or Q-Q) plots and probability
plots.
(Some authors use the term ’probability plots’ for both methods.)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
54 / 279
Quantile plot
For this you must have an analytic or numerical method method for
computing values of FX−1 (x; θ) (e.g. using R ). The procedure is as follows:
1
Compute an estimate
estimate).
2
Order the observations to obtain the order statistics (i.e. the
sample quantiles) x(1) , . . . , x(n) .
b These are the
For k = 1, . . . , n, compute FX−1 (k/(n + 1); θ).
quantile values (i.e. the values we would expect for the
quantiles if the model was correct).
b x(k) ).
For k = 1, . . . , n, plot the pairs (F −1 (k/(n + 1); θ),
3
4
5
for θ (e.g. the method of moments
X
Add the line y = x to the plot (i.e. the line through the origin
with slope 1).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
55 / 279
Analysing a Q-Q plot
If the points show only
there is no reason to
, random deviations from the line, then
the model.
If there are striking or systematic deviations from the line, then this
may be evidence that the model is
.
An alternative plot is a
:
This proceeds in a similar way to a quantile plot, except that you
now need to be able to compute values of FX (x; θ), and you plot
b against the
the values of the sample probabilities FX (x(k) ; θ)
values k/(n + 1), k = 1, . . . , n.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
56 / 279
Quantile and Probability plots in R
These plots are easy to produce in R for standard distribution families.
Consider a family called name with probability density function f (x; θ)
and distribution function F (x; θ) which depend on a parameter θ.
Then, for given numerical values of x and θ, we can use the R
functions
I
(x, θ) - which returns the value of the density f (x; θ)
I
(x, θ) - which returns the value of the probability
F (x; θ) = P(X ≤ x; θ)
I
(x, θ) - which returns the value of the quantile F −1 (x; θ)
For more information on exactly what parameters need to be specified
for each distribution, use the help facility in R - for example try typing
help(dexp), help(dunif), help(dgamma) or help(dnorm).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
57 / 279
Section 2.7: Earthquakes Example
Example 2.17.
The quakes data set records the
earthquakes.
times between successive serious
Suggested model is an Exponential distribution with parameter θ.
Write θb for the method of moments estimate θbmom .
We have seen in Example
that for this model θb = 1/m1 = 1/x̄.
Remember that to access the data you may first need to type
load(url("https://people.maths.bris.ac.uk/∼maotj/teach/stats1.RData")).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
58 / 279
Q-Q and probability plots for Earthquake data
Quakes probability plot − maotj
1.0
Quakes quantile plot − maotj
●
1500
0.8
●
0.6
●
●
●
●
●
0
500
0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
0
0.4
Sample probabilities
1000
●
●
●●
●
●
●●●
●●
●
●
500
Sample quantiles
● ●
1000
1500
Quantiles of fitted distribution
●
●
●
●
●
●
●
●
●
0.0
●
●
●
●
●
●
●
●
●
●
●
0.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4
0.6
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.8
●
●
●
●
1.0
Probabilities of fitted distribution
Although the points do not lie exactly on a straight line, there does
not appear to be any significant systematic deviation from the line
y = x, and no substantial reason to reject the Exponential model.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
59 / 279
Q-Q and probability plots in R for Exp(θ) model
The following R commands compute a quantile plot:
> m1 <- mean(quakes)
> theta <- 1/m1
> quakes.ord <- sort(quakes)
> quant <- seq(1:62)/63
> quakes.fit <- qexp(quant,theta)
> plot(quakes.fit, quakes.ord,
ylab="Sample quantiles",
xlab="Quantiles of fitted distribution",
main="Title - id", abline(0,1))
The probability plot is produced by
> quakes.pfit <- pexp(quakes.ord,theta)
> plot(quakes.pfit, quant, abline(0, 1))
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
60 / 279
Section 2.8: Interpreting Quantile Plots
The plots below ways in which the sample data may differ
systematically from the predictions of the fitted model for a sample of
size n = 1000. The sample:
(a) comes from the fitted model, we see the points lying fairly well along
the line.
(b) is from a distribution with longer left and right tails than the fitted
model. The sample quantiles at each end are much more spread out
than one would expect if the model was correct, so they are smaller at
the left-hand end and larger at the right-hand end than the
corresponding expected quantiles for the fitted distribution.
(c) is from a distribution with shorter left and right tails than the fitted
model. The sample quantiles at each end are much less spread out
than one would expect if the model was correct.
(d) is from a distribution which corresponds to a random variable which is
a location/scale mapping (X 7→ aX + b) of that specified by the fitted
model. This does not affect the fit to a straight line, but it does affect
the slope and intercept of the line of fit.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
61 / 279
4
2
0
−4
−6
−6
−4
−2
0
2
4
−6
6
−4
−2
0
2
4
6
(c) Observations have shorter tails
(d) Obsns fit linear transformation
4
2
0
−2
−6
−6
−4
−2
0
2
Sanple quantiles
4
6
Expected quantiles
6
Expected quantiles
−4
Sanple quantiles
−2
Sanple quantiles
2
0
−2
−6
−4
Sanple quantiles
4
6
(b) Observations have longer tails
6
(a) Obsns fit model distribution
−6
−4
−2
0
2
4
6
−6
−4
Expected quantiles
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
−2
0
2
4
6
Expected quantiles
c
TB 2 UoB
2017
62 / 279
Section 3: Likelihood and Maximum Likelihood Estimation
Aims of this section:
In this section we introduce the concepts of the likelihood
function and the maximum likelihood estimate.
For a distribution in a given parametric family, the likelihood
function acts as a summary of all the information about the
unknown parameter contained in the observations.
Many important and powerful statistical procedures have the
likelihood function as their starting point.
Here we focus on method of maximum likelihood estimation,
which could be said to provide the most plausible estimate of the
unknown parameter for the given data.
Suggested reading: Rice
Oliver Johnson ([email protected])
Section 8.5
Statistics 1: @BristOliver
c
TB 2 UoB
2017
63 / 279
Objectives: by the end of this section you should be able to
Write down the general form of the likelihood function and the
log-likelihood function, based on a simple random sample from a
distribution in a general parametric family.
Define the maximium likelihood estimate(s) for one or more
unknown parameters, based on a simple random sample from a
distribution in a general parametric family.
Write down the general form of the likelihood equations(s), based
on a simple random sample from a distribution in a general
parametric family, and understand how and when the maximum
likelihood estimate(s) can be obtained from the solution of the
likelihood equations(s).
Find the explicit form of the likelihood equation(s) for a simple
one- or two-parameter family of distributions, and solve to find the
maximium likelihood estimate(s) of the unknown parameter(s).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
64 / 279
Objectives: by the end of this section you should be able to
Find the maximum likelihood estimate directly from the likelihood
function for a simple one-parameter range family.
Compute the maximum likelihood estimate for population
quantities which are simple functions of the unknown
parameter(s). Examples of such functions include the population
mean, the population median and the population variance, or
appropriate poulation probabilities.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
65 / 279
Section 3.1: Motivation
Example 3.1.
Consider a coin for which
and P(Tail) = (1 − θ), where
θ is an unknown parameter in (0, 1) which we wish to estimate.
One way to gain information about θ is to repeatedly toss the coin
and count the number of tosses until we get a Head.
Assume the outcome of each toss is independent of all the other
tosses, and let X denote the number of the toss on which we first get
a Head.
Then
∼ Geom(θ) and
p (x; θ) = (1 − θ)x−1 θ
Oliver Johnson ([email protected])
x = 1, 2, 3, . . . ;
Statistics 1: @BristOliver
θ ∈ (0, 1).
c
TB 2 UoB
2017
(3.1)
66 / 279
Example 3.1.
Say we perform the experiment once and get a single observation
x = 4 (so the first head was observed on the fourth toss).
Write
[or more properly L(θ; x)] for the probability of getting this
particular observation x as a function of the unknown parameter θ.
Thus in this case L(θ) is got by putting x = 4 in Equation (3.1),
giving
L(θ) = p (4; θ) = (1 − θ)3 θ
θ ∈ (0, 1).
We call L(θ) the
for the given observation. A
graph of L(θ) against θ is shown below.
The value of θ that maximises the likelihood function L(θ), i.e. that
maximises the probability of getting this particular observation, is
called the maximum likelihood estimate of θ.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
67 / 279
0.00
0.02
0.04
L
0.06
0.08
0.10
Example 3.1.
0.0
0.2
0.4
0.6
0.8
1.0

Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
68 / 279
Example 3.1.
We can see that the maximum here is a turning point, so here the
maximising value satisfies the equation
dL(θ)
=0
dθ
where
dL(θ)
= (1 − θ)3 − 3θ(1 − θ)2 = (1 − θ)2 (1 − 4θ)
dθ
and you can check that
maximised at
d 2L
dθ2
< 0, so the likelihood function is
.
Thus this single observation x = 4 has a greater likelihood of
occurring when the parameter θ takes the value
than when θ
takes any other possible value in (0, 1).
We call this value 0.25 the
of θ for the
b
single observation x = 4 and we denote it by θmle [or more properly
θbmle (x)] and write
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
69 / 279
Example with multiple observations
Example 3.2.
We can extend our analysis to the case of
observations.
For example, say we repeat the experiment three times and get three
independent observations x1 = 4, x2 = 5, and x3 = 1.
Then the corresponding random variables X1 , X2 , X3 are
,
each with pmf p(x; θ), so they have joint probability mass function
(see Lemma 2.8)
pX1 ,X2 ,X3 (x1 , x2 , x3 ; θ) = p (x1 ; θ)p (x2 ; θ)p (x3 ; θ)
where the expression for p (x; θ) is given in Equation (3.1) above.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
70 / 279
Example 3.2.
In this case the likelihood function denotes the probability of
observing these three numerical values x1 , x2 , x3 as a function of the
unknown parameter θ and is given by
L(θ) ≡ L(θ; x1 , x2 , x3 ) = pX1 ,X2 ,X3 (x1 , x2 , x3 ; θ)
= p (x1 ; θ)p (x2 ; θ)p (x3 ; θ)
= (1 − θ)x1 −1 θ(1 − θ)x2 −1 θ(1 − θ)x3 −1 θ
= (1 − θ)3 θ(1 − θ)4 θ(1 − θ)0 θ
=
.
As before L(θ) is now maximised at the value
.
Thus, as a function of the unknown parameter θ, the likelihood of
these three numerical observations x1 = 4, x2 = 5, x3 = 1 occurring is
maximised by taking θ = 0.3.
Again, we say that 0.3 is the maximum likelihood estimate of θ and
we write θbmle (x1 , x2 , x3 ) = 0.3 – or, since the context is clear,
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
71 / 279
Section 3.2: Definition – Likelihood function
Definition 3.3.
General case: Assume the data x1 , x2 , . . . , xn are the observed values
of random variables X1 , X2 , . . ., Xn , whose joint distribution depends
on one or more unknown parameters θ.
The
L(θ) ≡ L(θ; x1 , x2 , . . . , xn ) is the joint
probability mass function (discrete case) or joint probability density
function (continuous case) regarded as a function of the unknown
parameter θ for these fixed numerical values of x1 , x2 , . . . , xn .
Definition 3.4.
For observed values {x1 , . . . , xn }, the maximum likelihood estimator
(mle) θbmle (x1 , . . . xn ) is the value of θ which
the likelihood
function L(θ; x1 , . . . , xn ).
Usually just write θbmle , for the value that provides the most plausible
overall explanation of the individual observations.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
72 / 279
Random sample case
Example 3.5.
Usual case: If X1 , X2 , . . . , Xn , is a random sample of size n from a
distribution with probability mass function p (x; θ) (or probability
density function f (x; θ)) then the Xi are i.i.d. and their joint
distribution factorises into the product of marginals.
Thus for a random sample

 p(x1 ; θ) p(x2 ; θ) · · · p(xn ; θ) disc
L(θ) ≡
=

f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ) cts
L(θ) is a function of θ for fixed data x1 , x2 , . . . , xn .
L(θ) gives a combined measure of of how well the value θ explains the
set of observations, and hence of the ‘plausibility’ of θ.
e.g. if
are collectively unlikely as observations from fX (x; θ) then
L(θ) is small, and vice-versa.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
73 / 279
Log-likelihood function
In practice, instead of maximising the likelihood, we usually maximise
the log-likelihood `(θ) := log L(θ), where log is the natural logarithm
(and we take log 0 = −∞).
Since the logarithm is an increasing function, L(θ) and `(θ) are
maximised at the same value of θ.
However, `(θ) is often easier to deal with in practice.
Example 3.6.
Let {x1 , . . . , xn } be a simple random sample from
Here (see handout) f (x; θ) =
.
, for x > 0 and θ > 0, so.
L(θ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ)
= (θe −θx1 )(θe −θx2 ) · · · (θe −θxn )
= θn exp(−θ(x1 + . . . + xn )).
P
P
Means that `(θ) = n log θ − θ ni=1 xi = ni=1 (log θ − θxi ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
74 / 279
Section 3.3: Finding the maximum likelihood estimate
The additive form of the log-likelihood function in Example 3.6 is not
a coincidence.
Lemma 3.7.
For observations taken from a simple random sample
`(θ) =
log f (xi ; θ).
Proof.
`(θ) = log L(θ)
= log (f (x1 ; θ) . . . f (xn ; θ))
= log f (x1 ; θ) + . . . + log f (xn ; θ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
75 / 279
Log-likelihood function and likelihood equation
In ‘regular’ cases, the maximum of `(θ) will be a stationary point.
That is, the mle
is the solution to the
∂
`(θ) = 0.
∂θ
(3.2)
∂
∂θ
denotes differentiation wrt θ keeping other variables fixed.
By ‘regular’ cases, we mean those such that f is a smooth function of
θ with a range not depending on θ.
This includes all distributions on the handout except the uniform (see
Example 3.12 below).
∂`
∂
Differentiating Lemma 3.7 we obtain: ∂θ
(θ) =
∂θ log f (xi ; θ).
Hence the likelihood Equation (3.2) becomes
0=
n
X
∂
log f (xi ; θ).
∂θ
i=1
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
76 / 279
Procedure for calculating θbmle
These observations suggest a recipe for finding the mle, when
{x1 , . . . , xn } are observations from a simple random sample, from a
continuous regular distribution with
f (x; θ).
∂
∂θ
1
Calculate the function
2
Compute and simplify the sum
log f (x; θ).
∂
∂
log f (x1 ; θ) + . . . +
log f (xn ; θ),
∂θ
∂θ
which we will consider as a function of θ.
3
The
is the value of θ satisfying the likelihood equation
0=
n
X
∂
∂θ
.
i=1
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
77 / 279
Example: exponential data
Example 3.8.
Return to the setting of Example
. That is assume {x1 , . . . xn }
come from a simple random sample from an
, with θ unknown.
Here f (x; θ) = θ exp(−θx), so that log f (x; θ) = log θ − θx.
∂
log f (x; θ) =
.
Treating x as a constant ∂θ
This means that


n
X
∂
∂
n 
log f (x1 ; θ) + . . . +
log f (xn ; θ) =
−
xj  .
∂θ
∂θ
θ
j=1
P
Hence θbmle solves the likelihood equation n/θb − ( nj=1 xj ) = 0.
P
That is, θbmle = n/( n xj ) = 1/x.
j=1
In this case, this coincides with the θbmom found in Example 2.12 –
however that is not true in general.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
78 / 279
Section 3.4: Maximum likelihood estimate of τ (θ)
Lemma 3.9.
If the quantity of interest is a function τ (θ) of θ, the mle of τ (θ) is simply
the
estimate
b
τd
(θ) = τ (θ).
This is the
property of maximum likelihood estimation.
Proof: Not examinable – though Lemma itself is examinable.
If re-parameterize in terms of τ and solve ∂`(τ )/∂τ = 0 for τb would
get the same answer, at least when τ (θ) is
(injective).
Under new parameterization likelihood `new (τ (θ)) = `old (θ).
Note that by the chain rule applied to `(τ (θ)),
∂ old
∂ new
∂ new
` (θ) =
` (τ (θ)) =
` (τ (θ)) × τ 0 (θ),
∂θ
∂θ
∂τ (θ)
so if τ 0 (θ) 6= 0,
∂ new
(τ (θ))
∂θ `
Oliver Johnson ([email protected])
= 0 if and only if
Statistics 1: @BristOliver
∂ new
(τ )
∂τ `
= 0.
c
TB 2 UoB
2017
79 / 279
Example 3.10.
Again let x1 , x2 , . . . , xn be observed values of a simple random sample
from the
distribution with θ unknown.
We found in Example 3.8 that θbmle = 1/x.
Suppose we are interested in the population variance
Var (X ; θ) = 1/θ2 , i.e. in τ (θ) = 1/θ2 .
Then the
of the population variance is
b = 1/θb2 = x 2
τd
(θ) = τ (θ)
This is not the same as the sample variance!
Suppose we are interested in the proportion of the population taking
values ≥ 1, that is, in τ (θ) = P(X ≥ 1; θ) = e −θ for the Exp(θ) case.
Then the
of this proportion is
b = e −θb = exp(−1/x)
τd
(θ) = τ (θ)
This is not the same as the sample proportion of values ≥ 1!
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
80 / 279
Section 3.5: Mle with multiple parameters
Mles can be easily extended to the case of
parameters.
For example, for two parameters
, the α
bmle and βbmle are the
simultaneous solutions to the two likelihood equations
n
X
∂
0=
∂α
n
X
∂
and 0 =
∂β
i=1
.
i=1
Example 3.11.
For example, consider a simple random sample from the N(µ, σ 2 ) with
unknown mean and variance. This family is continuous and regular.
Since there are two parameters
, the µ
bmle and σ
bmle are the
simultaneous solutions to the two likelihood equations.
Since f (x; µ, σ) = (2πσ 2 )−1/2 exp(−(x − µ)2 /(2σ 2 )),
log f (x; µ, σ) =
Oliver Johnson ([email protected])
− log σ −
Statistics 1: @BristOliver
(x − µ)2
.
2σ 2
c
TB 2 UoB
2017
81 / 279
Mle with multiple parameters example
Example 3.11.
This means that
∂
x −µ
log f (x; µ, σ) =
,
∂µ
σ2
giving the first likelihood equation
Pn
n
X
xi − nµ
(xi − µ)
= i=1 2
,
0=
2
σ
σ
i=1
which we can solve to give
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
82 / 279
Mle with multiple parameters example
Example 3.11.
Similarly
∂
1 (x − µ)2
log f (x; µ, σ) = − +
,
∂σ
σ
σ3
giving the second likelihood equation
n
0=−
n X (xi − µ)2
+
,
σ
σ3
i=1
and substituting in µ = µ
bmle = x, we obtain
σ
bmle =
.
Again, notice that these mles happen to coincide with the mom
estimates of Example 2.14.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
83 / 279
Section 3.6: Non-regular density
Recall that the likelihood equation (3.2) is based on the idea that the
likelihood is maximised at a turning point, because the density is
.
However, if the density is not regular, then the likelihood can be
maximised at one
of the interval.
In this case, it is best to work directly with
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
84 / 279
Non-regular density example
Example 3.12.
Consider a simple random sample {x1 , . . . , xn } from the
distribution.
In this case f (x; θ) = 1/θ if 0 ≤ x ≤ θ, and
otherwise.
As a result
L(θ; x1 , . . . , xn ) = f (x1 ; θ) . . . f (xn ; θ)
1/θn if θ ≥ x1 , . . . , θ ≥ xn ,
=
0
else.
Hence, the likelihood is 1/θn (decreasing function in θ) if
θ ≥ x(n) = max(x1 , . . . , xn ), and zero otherwise.
This means (plot a graph?) that the likelihood is maximised at
.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
85 / 279
Section 3.7: Example with not-identically-distributed
observations
A strength of the maximum likelihood approach is that it still provides
answers when the observations cannot be treated as
The likelihood is just the joint probability density or mass function of
the data, regarded as a function of the parameters.
This makes sense, and can be maximised with respect to the
parameters, whatever the model.
Here we just consider one example, where the observations are still
, but with different distributions.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
86 / 279
Poisson data with unequal means
Example 3.13.
The Poisson distribution is used to model counts of events that can
be assumed to occur completely at random.
For example, consider counts of
different intervals of time.
arriving at a detector in
Suppose that the
of arrival is λ per unit time, and that Xi is the
number of arrivals in a time interval of known length ti (not
necessarily equal).
It is natural to assume that Xi ∼ Poisson(λti ), for i = 1, 2, . . . , n.
If the different intervals are not overlapping, then the counts Xi
should be independent.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
87 / 279
Poisson data with unequal means
Example 3.13.
The joint probability mass function of X1 , X2 , . . . , Xn is
P{X1 = x1 , X2 = x2 , . . . , Xn = xn }
= P{X1 = x1 } × P{X2 = x2 } × · · ·
e −λt2 (λt2 )x2
e −λtn (λtn )xn
e −λt1 (λt1 )x1
×
× ··· ×
=
x1 !
x2 !
xn !
x1 x2
xn
t
t
·
·
·
t
n
= e −λ(t1 +t2 +···+tn ) λx1 +x2 +···+xn 1 2
x1 !x2 ! · · · xn !
So the log-likelihood is just the
of this:
`(λ) = −λ(t1 + t2 + · · · + tn ) + (log λ)(x1 + x2 + · · · + xn )
+ (terms not containing λ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
88 / 279
Poisson data with unequal means
Example 3.13.
Then
∂`(λ)
= −(t1 + t2 + · · · + tn ) +
∂λ
(x1 + x2 + · · · + xn )
So
∂`(λ)
=0
∂λ
b=
if and only if λ = λ
(x1 + x2 + · · · + xn )
(t1 + t2 + · · · + tn )
so this is the mle.
The turning point we have found is obviously a
, since we
can see that ∂`(λ)/∂λ is decreasing in λ.
bmle is the total count of photons divided by the
Note that the mle λ
total time of observation.
Note that, when the ti are unequal, this is
the same as the
average of the individual estimates (xi /ti ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
89 / 279
Section 4: Assessing the Performance of Estimators
Aims of this section:
In the previous sections we have seen several different ways of
estimating a population parameter, or population quantity of
interest, from a given set of sample data.
However, the sample data is just one of many possible samples
that could be drawn from the population.
Each sample would have different values, and so would give a
different value for the estimate.
In this section we use simulation based methods to investigate
how the value of an estimate would vary as we took different
independent random samples and hence evaluate and compare
the performance of different estimators.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
90 / 279
Aims of this section:
Many estimators are based on the sum of the observations in a
random sample from an underlying population distribution.
The exact distribution of these quantities may be difficult to
compute, and varies with the underlying population distribution.
The Central Limit Theorem gives a simple way of approximating
the distribution of the sum or mean, that depends only on the
population mean and the population variance.
It also provides a plausible explanation for the fact that the
distribution of many random variables studied in physical
experiments are approximately Normal, in that their value may
represent the overall addition of a number of random factors.
Ross
Suggested reading: Rice
Rice
Oliver Johnson ([email protected])
Sections 10.1-10.3
Sections 8.4, 8.5, 8.8
Section 5.3
Statistics 1: @BristOliver
c
TB 2 UoB
2017
91 / 279
Objectives: by the end of this section you should be able to
Generate random samples from a given standard distribution
using the random number generator for that distribution in a
statistical package such as R .
Understand how the performance of an estimator can be related
to systematic and random error through the bias and variance of
the estimator.
Evaluate the performance of an estimator for a single quantity of
interest, both qualitatively from a boxplot or histogram of
estimates from repeated samples and quantitatively or numerically
from summary statistics derived from the repeated samples.
Recall the statement and the implications of the Central Limit
Theorem.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
92 / 279
Objectives: by the end of this section you should be able to
Apply the Central Limit Theorem to find the approximate
distribution of the sum or mean of a random sample from a
population distribution with known mean and variance.
Apply a continuity correction to improve the approximation given
by the Central Limit Theorem when the underlying variable is an
integer valued random variable representing counts.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
93 / 279
Section 4.1: Different methods of Estimation
We have seen two general
model-based methods for
estimating a population quantity τ = τ (θ) (the method of moments
and the method of maximum likelihood).
These estimate θ (by θbmom and θbmle respectively), and plug the
estimates θb into τ (θ) to give an estimate of τ .
There may also be a direct
alternative, in which we
simply use the relevant sample quantity to estimate the population
value.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
94 / 279
Example
Example 4.1.
Consider estimating the population median for a population which has
a
distribution where θ is unknown, using a random
sample with values x1 , . . . , xn .
The parametric methods use the fact that the population median for
this distribution is θ/2.
I
I
I
The method of moments estimates by θbmom = 2x and so estimates
the population median by θbmom /2 = x.
The method of maximum likelihood estimates θ by
where
x(n) = max{x1 , . . . , xn } is the largest value in the sample, and thus
estimates the population median by θbmle /2 = x(n) /2.
The non-parametric method estimates the population median by the
sample median (for n odd, this is x((n+1)/2) ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
95 / 279
Comparing estimates
For a given set of
different estimates.
The questions are
I
I
, the three methods will result in three
which estimate (or method of estimation) is best?
how can we compare the methods when we don’t actually know the
value of the quantity we want to estimate?
If we do not know the
value we are trying to estimate, we cannot usefully compare methods of estimation using only the resulting
numerical estimates from a single sample.
The main way we compare methods of inference is to see how they
perform under repeated sampling.
We imagine future hypothetical samples of the same size from the
same distribution, and examine how well each method performs in the
long run.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
96 / 279
Section 4.2: Repeated sampling, and sampling distributions
In probability language, we treat the sample as a collection of random
variables X1 , X2 , . . . , Xn , regard the estimators as functions of these
random variables, and look at the distributions of these
.
We make a Key Definition, that motivates much of the rest of the
course:
Definition 4.2.
b 1 , . . . , Xn ) as a
We refer to the distribution of an estimator θb = θ(X
, as opposed to the original population distribution.
Remark 4.3.
A good estimator is one whose sampling distribution is concentrated close
to the true value of the quantity it is trying to estimate. A poor estimate
is one where the sampling distribution is either very spread out, or is
concentrated around the wrong value.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
97 / 279
Good estimators
Definition 4.4.
Let θb be an estimator of an unknown parameter θ. We define two key
properties of its sampling distribution:
b = E(θb − θ) = E(θ)
b − θ.
bias (θ)
Say θ̂ is unbiased if bias(θ̂) = 0.
(θ̂) = E[(θ̂ − θ)2 ]
b = Var (θ)
b + bias (θ)
b 2.
Can check that mse (θ)
Example 4.5.
In some rare cases, we can use methods from the
to calculate the exact sampling distribution theoretically.
course
For example, if X1 , X2 , . . . , Xn ∼ N(µ, σ 2 ), we know that µ
bmom =
so that µ
bmom = X ∼ N(µ, σ 2 /n) (see Theorem 5.9 below).
,
Hence bias (b
µmom ) = E(b
µmom ) − µ = µ − µ = 0,
mse (b
µmom ) = Var (b
µmom ) + (bias (b
µmom ))2 = σ 2 /n + 02 = σ 2 /n.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
98 / 279
Section 4.3: Sampling distributions by simulation
A more general, but empirical, approach is to use
.
In statistics, simulation is the process of artificially generating a
as independent observations from a given probability
distribution.
Simulation-based procedures for evaluating a method of estimation
replace the above idea of hypothetical future samples and probability
calculations, with actual simulated numerical samples and numerical
calculations.
Thus, for a particular type of population distribution f (x; θ), we take
particular values for the parameter θ and the sample size n.
Then we generate a number (say ) of artificial data sets, each of
which looks like a simple random sample of size n from f (x; θ).
Calculate an estimate for each data set, giving a total of B different
estimates.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
99 / 279
Motivation
The idea is that this process represents, say, the experience of
independent statisticians each using the method.
If B is
, the estimates generated by these independent
experiments should give a good indication of the sampling
distribution, and hence the overall performance of the method.
Moreover, we can understand the strengths and weaknesses of
different methods of estimation by comparing their overall
performances on the same B data sets.
In R we use the following procedure
I
I
I
Generate
numbers from f and arrange them in B groups of n.
Calculate the estimates for each sample.
Analyse the results numerically or graphically.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
100 / 279
Section 4.4: Graphical summaries of performance
One way of exploring the performance of different methods is just to
plot a
of the B estimates obtained in the simulation study.
Example 4.6.
Consider again the problem of estimating the population median for
the Uniform(0, θ) distribution.
The histograms below were constructed by simulating
each of size n = 10, from a Uniform distribution with
true population median was θ/2 = 0.5).
samples,
(so the
For each sample we compute the sample median, the method of
moments estimate (x̄) and the maximum likelihood estimate
(max{x1 , . . . , xn }/2).
We plot histograms of the 1000 estimates produced by each method.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
101 / 279
Histograms of results
0.0
0.4
0.8
Sample median
12
0
2
4
6
Density
8
10
12
10
8
0
2
4
6
Density
8
6
0
2
4
Density
10
12
Example 4.6.
0.0
0.4
0.8
Method of moments
0.0
0.4
0.8
Maximum likelihood
For this example, the differences in the shape of the histograms are
particularly striking.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
102 / 279
Graphical summary of performance – boxplots
Example 4.6.
boxplots are also convenient to compare visually the sampling
distributions of different estimators.
They use the median of each as a measure of the centre of the
distribution and the upper and lower
as a measure of spread.
0.2
0.4
0.6
0.8
Estimators of population median: Unif(0,=1) with n=10
Sample median
Oliver Johnson ([email protected])
MOM
Statistics 1: @BristOliver
MLE
c
TB 2 UoB
2017
103 / 279
Example 4.6.
Clearly the estimates produced by the
and the method
of moments are both centred on the true population median value of
0.5 but fairly widely spread about this value (sample median more
than mom).
The maximum likelihood estimates are centred on a value just below
and so the method slightly, but consistently, underestimates the
true value.
However the narrow spread of mle values means that for most
samples the mle is closest to the true value.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
104 / 279
Section 4.5: Numerical summaries of performance
Can sometimes derive explicit
expressions for the bias and
mean square error of an estimator (see Example 4.5).
Usually we can only estimate these numerically (via simulation).
Say we want to estimate a function τ (θ) whose true value is τ and
our simulation
τb1 , . . . , τbB with sample
P study produces B estimates P
B
mean τ = B
τ̂
/B
and
sample
variance
τi − τ )2 /(B − 1).
i=1 i
i=1 (b
The
(indicator of bias) is τ − τ .
To estimate mse, consider
PB
(b
τi − τ )2
average squared error = i=1
.
B
P
P
You can check that B
τi − τ )2 = B
τi − τ )2 + B
, so
i=1 (b
i=1 (b
that for B large the
average squared error ' sample variance + (average error)2 .
b = Var (θ)
b + bias (θ)
b 2 ).
(cf mse (θ)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
105 / 279
Example 4.7.
For our Uniform distribution example, we estimate the population
median τ , (true value is τ = 0.5) with B = 1000.
The table confirms numerically the impression from the graphical
summary – the size of the average error is larger for
than for the other two methods.
Mle has a much smaller sample variance than using the method of
moments, which in turn has a smaller sample variance than the
non-parametric method using the sample median.
Overall mle has the smallest average squared error, then
, then the estimate based on the sample median.
Methods of estimating
the population median
average error
(estimates bias)
sample variance
(estimates variance)
average squared error
(estimates mse)
sample median
mom
mle
+0.00497
+0.00376
−0.04403
0.01861
0.00798
0.00168
0.01863
0.00800
0.00362
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
106 / 279
Section 4.6: Simulation using R
We illustrate this in the context of Example 4.6, estimating
θbmom = 2x, taking B = 1000 and n = 10.
by
Example 4.8.
xvalues <- runif(10000)
xsamples <- matrix(xvalues, nrow=1000)
sample.mean <- apply(xsamples, 1, mean)
theta.mom <- 2*sample.mean
par(mfrow=c(1,2))
hist(theta.mom, main = "Histogram of theta.mom")
boxplot(theta.mom, main= "Boxplot of theta.mom")
true.theta <- 1
mean(theta.mom - true.theta)
var(theta.mom)
mean( (theta.mom - true.theta)^2)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
107 / 279
Explaining R commands
Example 4.8.
xvalues <- runif(10000) generates 10000
values,
and assigns them to a vector called xvalues. (By default runif
simulates from U(0, 1), but runif(10000, min=-1, max =2) would
simulate from U(−1, 2)).
xsamples <- matrix(xvalues, nrow=1000) arranges this into a
matrix with
rows, each of length
, each row
corresponding to a random sample.
sample.mean <- apply(xsamples, 1, mean). The apply
command applies the command mean to the set of values that share
subscript 1 (i.e. to the each row in turn). This generates a vector
sample.mean made up of the means of each sample. (Try
apply(xsamples, 1, var) or apply(xsamples, 1, max) to
generate the sample variance and maximum).
par(mfrow=c(1,2)) puts the plots next to each other.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
108 / 279
Graphical output
Example 4.8.
The remaining code generates boxplots and histograms, and
calculates the average error and average squared error.
1.2
1.0
150
0
0.4
0.6
50
0.8
100
Frequency
Boxplot of theta.mom
1.4
200
Histogram of theta.mom
0.4
0.8
1.2
1.6
theta.mom
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
109 / 279
Graphical output – comparison via boxplots
Example 4.8.
As in
, we can compare the output of several estimators
with boxplots generated with the following code
xvalues
<- runif(10000)
xsamples
<- matrix(xvalues, nrow=1000)
sample.mean <- apply(xsamples, 1, mean)
sample.median <- apply(xsamples, 1, median)
sample.max <- apply(xsamples, 1, max)
tau.nonparam <- sample.median
tau.mom <- sample.mean
tau.mle <- sample.max/2
true.tau <- 0.5
boxplot(tau.nonparam,tau.mom,tau.mle, names =
c("sample median","mom","mle",abline(h=true.tau, lty=2)))
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
110 / 279
Graphical output – comparison via histograms
Example 4.8.
Here
plots a horizontal line at the true value of τ for
comparison purposes, with lty=2 creating a
line.
Similarly, we can generate multiple histograms using the following
code, which fixes the range of x and y to make clear comparison
easier.
par(mfrow = c(1,3))
hist(tau.nonparam, xlim=c(0,1), ylim=c(0,350))
hist(tau.mom, xlim=c(0,1), ylim=c(0,350))
hist(tau.mle, xlim=c(0,1), ylim=c(0,350))
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
111 / 279
Section 4.7: Approximate methods based on the Central
Limit Theorem
One disadvantage of simulation-based methods is that each
simulation only provides information about
situation
and gives no direct information about what would happen for:
I
I
I
I
other
other
other
other
sample sizes n
values of true parameter
types of population distribution f (x; θ)
methods of estimation
Also, the numerical accuracy of estimates of quantities like the bias is
limited by the finite size of , the number of samples.
Therefore, as before we use probability theory to find sampling
distributions whenever this is possible.
For example when estimating µ using a simple random sample from
N(µ, σ 2 ) (see Example
), we can calculate these key quantities
analytically: the bias is 0, and the variance and mse are both
.
In general, we cannot do this.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
112 / 279
Approximations using the CLT
However, many estimators are based on the sum or
sample.
of a random
The Central Limit Theorem lets us approximate the distribution of
such estimators,
the distribution of the sample.
The limiting distribution depends only on the mean and variance (not
the actual distribution).
The
of convergence depends on X .
For example, if X is symmetric with not too heavy tails, convergence
is faster.
The CLT is one of the most fundamental results in statistics – it can
explain why many ‘real world’ data samples seem to be close to
normal.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
113 / 279
Central Limit Theorem
Theorem 4.9 (Central Limit Theorem).
Let X1 , . . . , Xn be a random sample from a population with mean
µ = E(X ) and variance σ 2 = Var (X ). Write X n = (X1 + . . . + Xn )/n for
the sample mean. Then for n large, whatever the distribution of X
Xn − µ
√ ' N(0, 1).
The normalized sample mean
σ/ n
That is, for Z ∼ N(0, 1) with distribution function Φ:
P
Xn − µ
√ ≤x
σ/ n
' P(N(0, 1) ≤ x) = Φ(x).
Equivalently (a) X n = (X1 + . . . + Xn )/n ' N(
and (b) (X1 + . . . + Xn ) ' N(
).
)
Proof.
See Section 5.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
114 / 279
Example
Example 4.10.
Let X1 , . . . , Xn be a random sample of size
from the Exp(2)
distribution, which has mean µ = 1/2 and variance σ 2 = 1/4.
The CLT tells us that
Xn − µ
(X + . . . + X10 )/10 − 1/2
√ = 1
√
' N(0, 1).
σ/ n
1/(2 10)
Hence if we want to approximate P(X1 + . . . + X10 ≤ 5.2) then
P(X1 + . . . + X10 ≤ 5.2) = P(X 10 ≤ 5.2/10)
0.52 − 1/2
X 10 − 1/2
√
√
= P
≤
1/(2 10)
1/(2 10)
' P(N(0, 1) ≤ 0.1265)
.
This is found using pnorm(0.1265).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
115 / 279
Example
Example 4.10.
In this case, X1 + . . . + X10 ∼ Γ(10, 2) (exactly), and
pgamma(5.2,10,2) gives
, showing the approximation is OK,
but not amazing.
For n = 1000, the P(X1 + . . . + Xn ≤ 502) is approximated by
again, and the gamma value is pgamma(502,1000,2) =
(much better).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
116 / 279
Section 4.8: Continuity correction
Consider X taking integer values, and consider T = X1 + . . . + Xn ,
where Xi are IID with the same distribution as X .
Theorem 4.9 says that P(T ≤ x) ' P(S ≤ x), where S ∼ N(
).
However, since T can only take integer values, better to approximate
P(T = x) ' P(x − 1/2 ≤ S ≤ x + 1/2),
P(T ≤ x) ' P(S ≤ x + 1/2)
where the second result follows on summing the first.
This is referred to as a
.
Example 4.11.
Let X1 , . . . , X10 be IID Bernoulli(p), with p = 1/4. Then (see
appendix) µ = p = 1/4, and σ 2 = p(1 − p) = 3/16.
Consider T = X1 + . . . + X10 .
Theorem 4.9 suggests T ' S ∼ N(nµ, nσ 2 ) = N(10/4, 30/16).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
117 / 279
Section 4.8: Continuity correction
Example 4.11.
Consider approximating P(T ≤ 2):
1
T ∼ Bin(10, 1/4) exactly, so using pbinom(2,10,0.25):
P(T ≤ 2) = P(T = 0) + P(T = 1) + P(T = 2)
2
Without continuity correction is inaccurate, using pnorm(-0.3651):
!
S − 10/4
2 − 10/4
P(T ≤ 2) ' P(S ≤ 2) = P p
≤ p
30/16
30/16
= P(N(0, 1) ≤ −0.3651)
3
.
.
With continuity correction gives a better result, using pnorm(0):
!
S − 10/4
2.5 − 10/4
P(T ≤ 2) ' P(S ≤ 2.5) = P p
≤ p
30/16
30/16
= P(N(0, 1) ≤ 0)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
.
c
TB 2 UoB
2017
118 / 279
Section 5: Sampling distributions related to the Normal
distribution
Aims of this section:
Knowing the exact distribution of an estimator helps us to
understand how its behaviour depends, for example, on the
sample size or the unknown population parameter values.
It also enables us to incorporate our theoretical results into other
aspects of our statistical analysis.
In this section we derive the exact distribution of some sample
statistics associated with random samples from a range of
distributions, particularly focussing on the mean and variance of a
random sample from the Normal distribution.
Suggested reading:
Oliver Johnson ([email protected])
Rice
Rice
Section 2.3
Section 6.1–6.3
Statistics 1: @BristOliver
c
TB 2 UoB
2017
119 / 279
Objectives: by the end of this section you should be able to
Recall the distibution of the sample mean for a simple random
sample from a Normal distribution and use statistical tables or an
appropriate statistical package to compute relevant probabilities
associated with its distribution.
Recall the distibution of the sample variance, for a simple random
sample from a Normal distribution; understand that it is
independent of the sample mean; and use statistical tables or an
appropriate statistical package to compute relevant probabilities
associated with its distribution.
Identify the distribution of a sum or linear combination of
independent Normally distributed random variables, and use
statistical tables or an appropriate statistical package to compute
relevant probabilities associated with its distribution.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
120 / 279
Objectives: by the end of this section you should be able to
Identify the distribution of a sum of squares of independent
random variables, each with the standard Normal distribution,
and use statistical tables or an appropriate statistical package to
compute relevant probabilities associated with its distribution.
Identify the distribution of a sum of independent random
variables, each with the same Exponential distribution, and use
statistical tables or an appropriate statistical package to compute
relevant probabilities associated with its distribution.
Define the chi-square and t distributions, and look up percentile
points of each distribution in tables or with R .
Apply the results above to find the mean and variance of an
estimator of a parameter or other population quantity of interest,
and hence find its bias and mean square error.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
121 / 279
Section 5.1: Revision of moment generating functions
Definition 5.1.
For random variable X define the moment generating function (mgf) MX
by
R tx
e f (x)dx
for continuous X
tX
MX (t) ≡ E(e ) = P txX
e
P(X
=
x)
for discrete X
x
MX is defined for whatever values of t the integral is well defined.
MX uniquely determines the distribution: two random variables with
the same mgf (assuming it is finite in an interval around the origin)
have the same distribution.
Example 5.2.
X ∼ N(µ, σ 2 )
⇐⇒ MX (t) = exp{µt + σ 2 t 2 /2}
X ∼ Exp(θ)
⇐⇒ MX (t) = θ/(θ − t)
X ∼ Gamma(α, β) ⇐⇒ MX (t) = β α /(β − t)α
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
t∈R
t<θ
t<β
c
TB 2 UoB
2017
122 / 279
Lemma 5.3.
If Y = aX + b then MY (t) = E(e tY ) = E(e taX +tb ) = e tb MX (ta)
Definition 5.4.
The joint mgf of X and Y is MX ,Y (s, t) ≡ E(e sX +tY ).
Lemma 5.5.
1
The marginal moment generating functions for X and Y are given in
terms of the joint moment generating function by
MX (s) = E(e sX ) = MX ,Y (s, 0)
MY (t) = E(e tY ) = MX ,Y (0, t)
2
Two random variables X and Y are independent if and only if
MX ,Y (s, t) = MX (s)MY (t) = MX ,Y (s, 0)MX ,Y (0, t).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
123 / 279
Independence
Lemma 5.6.
1
If X1 , . . . , Xn are independent and Y = X1 + X2 + · · · + Xn , then so
MY (t) = MX1 (t)MX2 (t) . . . MXn (t).
2
If X1 , . . . , Xn are a random sample, i.e. they are all independent and
all ∼ X , then this simplifies to
MY (t) = [MX (t)]n .
Proof.
Observe
MY (t) = E(e tY ) =
Oliver Johnson ([email protected])
= E(e tX1 )E(e tX2 ) . . . E(e tXn ).
Statistics 1: @BristOliver
c
TB 2 UoB
2017
124 / 279
Section 5.2: Transforming, adding and sampling normals
Lemma 5.7.
Let X ∼ N(µ, σ 2 ) then aX + b ∼ N(aµ + b, a2 σ 2 ).
Let X ∼ N(µ, σ 2 ) then (X − µ)/σ ∼ N(0, 1).
Proof.
Already know mean and variances; just checking distribution.
Example 5.2 shows that MX (t) = exp(µt + σ 2 t 2 /2).
Lemma 5.3 shows that
MaX +b (t) = MX (at) exp(bt) = exp(µat + σ 2 a2 t 2 /2) exp(bt)
= exp (µa + b)t + a2 σ 2 t 2 /2
We recognise this as the mgf of a N(aµ + b, a2 σ 2 ), and the result
follows by uniqueness of mgfs.
The second part follows on taking
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
125 / 279
Addition of normal distributions
Lemma 5.8.
2
If X1 , . . . P
, Xn are independent
with
P
P X2i ∼2 N(µi , σi ) then for any weights ai ,
the sum i ai Xi ∼ N
i ai µ i ,
i ai σi .
Proof.
Combining Lemmas 5.6 and 5.3 we find that
n
n
Y
Y
P
M i ai Xi (t) =
Mai Xi (t) =
MXi (ai t)
i=1
=
n
Y
i=1
= exp
i=1
exp(µi ai t + σi2 ai2 t 2 /2)
"
n
X
i=1
ai µi
!
t+
n
X
i=1
ai2 σi2
!
2
#
t /2 .
Again the result follows by uniqueness of mgfs.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
126 / 279
Sample mean for normal distribution
Theorem 5.9.
Let X1 , . . . , Xn be a random sample of size n from N(µ, σ 2 ) and
X = (X1 + . . . + Xn )/n be the
.
(i) X ∼ N(µ, σ 2 /n).
√
X −µ
(ii) n
∼ N(0, 1).
σ
Compare this with the Central Limit Theorem 4.9, which states that
this result is approximately true for large n.
If σ 2 is
, this result tells us how close we expect µ and X to be.
Hence, we can make inference about unknown µ based on known X .
Much of the remainder of the chapter extends this to the more
realistic case of σ 2 unknown, to prove Theorem 5.18.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
127 / 279
Proof of Theorem 5.9(i)
Proof.
Taking ai ≡ 1/n in Lemma 5.8, since µi ≡ µ and σi2 ≡ σ 2 , we deduce
X =
n
X
1
i=1
1
1
Xi ∼ N n µ, n 2 σ 2
n
n
n
= N(µ, σ 2 /n).
The second result follows on noticing that
√
X −µ
n
= aX + b,
σ
where
and
.
Hence applying Lemma 5.7 we deduce that
√
X −µ
n
∼ N(aµ + b, a2 (σ 2 /n)) = N(0, 1).
σ
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
128 / 279
Sketch Proof of Central Limit Theorem
Proof.
We now prove Theorem 4.9, which states that if Xi are IID and
distributed like
X with finite mean and variance:
Sn =
Xn − µ
√
σ/ n
Z ∼ N(0, 1).
We use a result (not proved here), that states that if
MSn (t) → MZ (t) for all t, then P(Sn ≤ x) → P(Z ≤ x).
This is the sense of the Central Limit Theorem that we claimed in
Theorem 4.9.
First recall from Lemma 5.6 that Tn = X1 + . . . + Xn has
MTn (t) = [MX (t)]n .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
129 / 279
Sketch Proof of Central Limit Theorem
Proof.
Further, note that
√ √
√
n X1 + . . . + Xn
µ n
Tn
µ n
Sn =
−
= √ −
.
σ
n
σ
σ
σ n
√
√
As before, this is aTn + b, where a = 1/(σ n) and b = −µ n/σ.
Lemma 5.3 gives MSn (t) = exp(tb)MTn (at) = [exp(tb/n)MX (at)]n .
We expand MX (u) in terms of Mi , the ith moment of X . Since
M1 = µ and M2 = σ 2 + µ2
MX (u) =
∞
X
Mi u i
i=0
so that MX (at) = 1 +
Oliver Johnson ([email protected])
i!
µt
√
σ n
1
= 1 + µu + (σ 2 + µ2 )u 2 + . . . .
2
+
σ 2 +µ2
2
t2
σ2 n
Statistics 1: @BristOliver
+ ...
c
TB 2 UoB
2017
130 / 279
Sketch Proof of Central Limit Theorem
Proof.
Hence we can simplify
exp(tb/n)(MX (at))
µt
tµ
t 2 µ2
(σ 2 + µ2 )t 2 )
1+ √ +
+ ...
=
1 − √ + 2 + ...
2σ 2 n
σ n 2σ n
σ n
t2
1
= 1+
+O
.
2n
n2
Hence, using the fact that limn→∞ (1 + a/n)n = exp a:
n
t2
MSn (t) = 1 +
+ . . . → exp(t 2 /2) = MZ (t)
2n
We recognise this as the mgf of a standard normal, and we are done.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
131 / 279
Section 5.3: Independence of X and
Pn
j=1 (Xj
− X )2 .
We now state a result which plays a fundamental role in underpinning
the theory of many of the remaining parts of the course.
The statement is very simple, the proof is fairly technical.
Result at first sight looks extremely surprising!
Theorem 5.10.
If X1 , . . . , Xn is a random sample of size n from the N(µ, σ 2 ) distribution
then
n
X
and
(Xj − X )2 are independent.
j=1
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
132 / 279
Sketch Proof of Theorem 5.10
The proof is made up of the following steps:
(i) the definition gives the joint moment generating function of
X , X1 − X , X2 − X , . . . , Xn − X as the expected value of a
function of the variables;
(ii) simple manipulation reduces this complicated expression to a simple
product of terms of the form
(iii) since the Xj are independent the expectation of this product is just
the product of the expectations, each of which is MXj (aj ), giving the
joint mgf;
(iv) finally we observe that the
is the product of marginal mgf
for X and the (marginal) jointmgf for
X1 − X , X2 − X , . . . , Xn − X .
By
analogy with Lemma 5.5.2,X is independent
P of
X1 − X , X2 − X , . . . , Xn − X and hence of nj=1 (Xj − X )2 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
133 / 279
Full Proof of Theorem 5.10: Part (i)
Proof.
Let X denote (X1 + · · · + Xn )/n.
Let s denote (s1 + s2 + · · · + sn )/n (so
Pn
j=1 (sj
− s) = 0).
Let M(t, s1 , . . . , sn ) denote the joint moment generating function of
the n + 1 random variables X , X1 − X , X2 − X , . . . , Xn − X .
Then by definition
M(t, s1 , . . . , sn )
= E(exp{tX + s1 (X1 − X ) + s2 (X2 − X ) + · · · + sn (Xn − X )}).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
134 / 279
Full Proof of Theorem 5.10: Part (ii)
Proof.
Now rearranging the terms in the curly brackets gives
tX + s1 (X1 − X ) + s2 (X2 − X ) + · · · + sn (Xn − X )
= (t − s1 − s2 − · · · − sn )X + s1 X1 + s2 X2 + · · · + sn Xn
= a1 X1 + · · · + an Xn
Here
aj =
M(t, s1 , . . . , sn )
t−
P
i si
n
+ sj =
t
+ (sj − s).
n
= E(exp{tX + s1 (X1 − X ) + s2 (X2 − X ) + · · · + sn (Xn − X )})
= E(exp{a1 X1 + · · · + an Xn })
= E(exp{a1 X1 } exp{a2 X2 } · · · exp{an Xn })
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
135 / 279
Full Proof of Theorem 5.10: Part (iii)
Proof.
Since the Xj are independent, using Lemma 5.3
M(t, s1 , . . . , sn ) = E(exp{a1 X1 })E(exp{a2 X2 }) · · · E(exp{an Xn })
= MX1 (a1 )MX2 (a2 ) · · · MXn (an )
σ 2 a12
σ 2 an2
= exp µa1 +
· · · exp µan +
2
2


 X
σ2 X 2
= exp µ
aj +
aj


2
j
j


n


2
X
σ
(sj − s)2 .
= exp µt + σ 2 t 2 /2n +


2
j=1
The last equality follows from the facts that
Pn
Pn
2
2
2
j=1 aj = t /n +
j=1 (sj − s) .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
Pn
j=1
aj = t and
c
TB 2 UoB
2017
136 / 279
Full Proof of Theorem 5.10: Part (iv)
Proof.
Hence
M(t, 0, . . . , 0) = exp{µt + σ 2 t 2 /2n}
n
X
2
M(0, s1 , . . . , sn ) = exp{σ
(sj − s)2 /2}
j=1
giving
M(t, s1 , . . . , sn ) = M(t, 0, . . . , 0) M(0, s1 , . . . , sn ).
Thus X is independent of the random variables
(X1 − X , X2 − X , . . . , Xn − X ) and in particular X is independent of
P
n
2
j=1 (Xj − X ) .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
137 / 279
Section 5.4: The χ2 distribution
Definition 5.11.
We say that a random variable
r degrees of freedom, and write
has the
if W has mgf
MW (t) = (1 − 2t)−r /2
distribution with
for t < 1/2.
Remark 5.12.
(i) Comparison with
they have the same mgf.
shows that χ2r ≡ Γ(r /2, 1/2), since
(ii) If W ∼ χ2r then (see handout), EW = (r /2)/(1/2) = r and
Var (W ) = (r /2)/(1/2)2 = 2r .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
138 / 279
χ2 are squared normals
Lemma 5.13.
If Z ∼ N(0, 1) then Y = Z 2 ∼
.
Proof.
MY (t) = E exp(tY ) =
Z
=
exp(tz 2 )φ(z)dz
2
Z
z
1
2
dz
=
exp(tz ) √ exp −
2
2π
"Z
2
#
1
1
z (1 − 2t)
p
=
exp −
dz
2
(1 − 2t)1/2
2π/(1 − 2t)
= (1 − 2t)−1/2 [1] ,
The last equation holds since we recognise the integrand as the
density of a N(0, 1/(1 − 2t)), which integrates to 1.
Since Y has the same mgf as a χ21 , the result holds by uniqueness.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
139 / 279
Sums of χ2 are χ2
Lemma 5.14.
1
2
are independent then U + V ∼ χ2r +s .
P
If independent Zi ∼ N(0, 1) then ni=1 Zi2 ∼ χ2n .
If
and
Proof.
By definition MU (t) = (1 − 2t)−r /2 and MV (t) = (1 − 2t)−s/2 .
Hence by Lemma 5.6,
MU+V (t) = MU (t)MV (t)
= (1 − 2t)−r /2 (1 − 2t)−s/2 = (1 − 2t)−(r +s)/2 .
This is the mgf of χ2r +s , and so the result follows by uniqueness.
The final part follows by Lemma 5.13.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
140 / 279
Section 5.5: Normal sampling distributions
Theorem 5.15.
Let X1 , . . . , Xn be a random sample of size n from the N(µ, σ 2 )
distribution. Then:
Pn
2
2
(i)
.
j=1 (Xj − µ) /σ ∼
Pn
2
2
.
(ii)
j=1 (Xj − X ) /σ ∼
Proof: Part (i).
Writing Yj = (Xj − µ)/σ, Lemma 5.7 gives
.
Pn
2
2
Further, Yj are independent, so j=1 Yj ∼ χn by Lemma
P
P
But nj=1 (Xj − µ)2 /σ 2 = nj=1 Yj2 , so we are done.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
141 / 279
Proof of Theorem 5.15(ii)
Proof: Part (ii).
P
P
2
Set W3 ≡ nj=1 Yj2 , W2 ≡ nj=1 (Yj − Y )2 and W1 ≡ nY .
P
Note that using nj=1 Yj = nY we can write
P
P
2
W3 = nj=1 Yj2 = nj=1 (Yj − Y )2 + nY = W1 + W2 .
P
( To see this, try expanding j (Yj − Y )2 . )
Further, W1 and W2 are independent, from Theorem 5.10.
Thus MW3 (t) = MW1 +W2 (t) = MW1 (t)MW2 (t), or equivalently,
MW2 (t) =
/MW1 (t).
√
2
But W1 ∼ χ1 from Lemma 5.13, as nY ∼ N(0, 1) from Theorem
5.9, since µ = 0 and σ 2 = 1.
Similarly, W3 ∼ χ2n , using Theorem 5.15(i). Hence
MW2 (t) = (1 − 2t)−n/2 /(1 − 2t)−1/2 = (1 − 2t)−(n−1)/2
P
This is the mgf for
, hence W2 = nj=1 (Xj − X )2 /σ 2 ∼ χ2n−1 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
142 / 279
Section 5.6: The t distribution
Definition 5.16.
Let U and V be independent random variables with
V ∼ χ2r and let
U
.
W =p
V /r
and
We say that W has the t distribution with r degrees of freedom, and write
W ∼ tr .
Remark 5.17.
It is vital that U and V be independent. It’s why we needed to go to
the effort of proving Theorem 5.10.
If W ∼ tr , the density of W is
about 0.
It is similar to N(0, 1) but with heavier tails.
The density approaches N(0, 1) as
.
W has EW = 0, Var (W ) = r /(r − 2).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
143 / 279
The most important slide of the whole course?
Theorem 5.18.
Let X1 , . . . , Xn be a random sample of size n from the N(µ, σ 2 )
X = (X1 + . . . + Xn )/n and
distribution
Pn then, writing
2
2
S = j=1 (Xj − X ) /(n − 1):
1
2
3
4
√
n(X − µ)/σ ∼ N(0, 1)
Pn
V := j=1 (Xj − X )2 /σ 2 ∼ χ2n−1
U :=
U and V are independent.
√
n(X − µ)
∼
.
S
This result allows us to know how far apart we expect µ and X to be,
even when σ 2 is
.
This makes it the counterpart of Theorem 5.9.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
144 / 279
Proof of Theorem 5.18
Proof.
Most of these facts are already known:
√
Here U = n(X − µ)/σ ∼ N(0, 1) by Theorem
P
Here V = nj=1 (Xj − X )2 /σ 2 ∼ χ2n−1 by Theorem
S 2 /σ 2 = V /(n − 1) ∼ χ2n−1 /(n − 1).
U and V are
, so that
by Theorem 5.10.
The result is proved since
√
√
n(X − µ)
n(X − µ)
1
p
=
∼ N(0, 1) q
S
σ
S 2 /σ 2
χ2
1
n−1 /(n
− 1)
where the two terms are independent as required.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
145 / 279
Section 5.7: Percentage points of distributions
In applications, we need to know values xα such that
.
Using this: P(X ≤ x1−α ) = 1 − P(X ≥ x1−α ) = 1 − (1 − α) = α.
Often need these for α = 0.1, 0.05, 0.025, 0.01 etc.
Traditionally given in tables, but now more commonly calculated by R
.
RV
Z ∼ N(0, 1)
T ∼ tr
W ∼ χ2r
Notation
P(Z ≥ zα ) = α
P(T ≥ tr ;α ) = α
P(W ≥ χ2r ;α ) = α
Symmetry?
Yes:z1−α = −zα
Yes:tr ;1−α = −tr :α
No:χ2r ;1−α 6= −χ2r ;α
R command
qnorm(1 − a)
qt(1 − a, r)
qchisq(1 − a, r)
Remark 5.19.
Using this notation, we deduce (draw a picture?):
(1 − α) = P(−zα/2 ≤ Z ≤ zα/2 )
(1 − α) = P(−tr ;α/2 ≤ T ≤ tr ;α/2 )
(1 − α) = P(χ2r ;1−α/2 ≤ W ≤ χ2r ;α/2 )
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
146 / 279
Section 5.8: Similar theory for Γ distributions
We can give some similar results giving exact distributions for sums
and means of exponentials
The following result is equivalent to Theorem
.
It can be proved in a similar way using facts about mgfs from
Example 5.2.
Lemma 5.20.
Let X1 , . . . , Xn be a random sample of size n from the Exp(θ) distribution.
Then
Pn
j=1 Xj ∼ Γ(n, θ).
P
X = ( nj=1 Xj )/n ∼
.
Pn
2θ j=1 Xj ∼ Γ(n, 1/2) = χ22n .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
147 / 279
Section 5.9: The F distribution
Definition 5.21.
Let U ∼ χ2r and
independently, and let
W =
U/r
.
V /s
Then W has the F distribution with r and s degrees of freedom and
we write
Define the percentage point Fr ,s;α as the value such that
P(W ≥ Fr ,s;α ) = α when W ∼ Fr ,s .
The density function is heavily skewed with a long
tail.
If W ∼ Fr ,s then by definition, 1/W ∼ Fs,r so Fr ,s;1−α = 1/Fs,r ;α .
This distribution forms a starting point for statistical inference about
equality of variances for two normal population, linear regression and
analysis of variance (see Linear Models).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
148 / 279
Applications of the F distribution
Theorem 5.22.
Let X1 , . . . , Xm be a random sample of size
from N(µX , σX2 ),
Independently, let Y1 , . . . , Yn be a random sample of size
N(µY , σY2 ).
When σX2 = σY2 =
from
(say)
Pm
(Xi − X )2 /(m − 1)
∼ Fm−1,n−1 .
T = Pi=1
n
2
j=1 (Yj − Y ) /(n − 1)
Proof.
From Theorem
, independently
Pm
c
2
2
σX = i=1 (Xi − X ) /(m − 1) ∼ σX2 χ2m−1 /(m − 1) and let
Pn
2 =
σc
(Y − Y )2 /(n − 1) ∼ σ 2 χ2 /(n − 1).
Y
j=1
j
Y
n−1
The distribution of T is independent of the unknown parameters
µX , µY and σ 2 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
149 / 279
Section 6: Confidence intervals
Aims of this section:
In previous sections we have seen how use the observations in a
simple random sample from a given population to estimate a
population parameter or some other population quantity of
interest.
However, we have also seen that different samples would give
different estimates, so our estimate cannot be ’exactly’ correct.
In this section we derive procedures for reporting the accuracy of
our estimate by constructing a confidence interval - an interval of
values around the estimate which has a pre-set level of probability
of containing the true value of the parameter or other quantity
being estimated.
Suggested reading: Rice
Oliver Johnson ([email protected])
Sections 7.3.3, 8.5,3, 10.4.5
Statistics 1: @BristOliver
c
TB 2 UoB
2017
150 / 279
Objectives: by the end of this section you should be able to
Construct an exact confidence interval for the population mean,
with a given confidence level, based on a simple random sample
from a Normal distribution.
Construct an exact confidence interval for the population
variance, with a given confidence level, based on a simple random
sample from a Normal distribution.
Recall and explain the assumptions under which the standard
formulae for confidence intervals are applicable and be aware of
how the validity of these assumptions might be explored using
Exploratory Data Analysis.
Explain how the length of a confidence interval for a population
mean depends qualitatively on the required confidence level and
on the size of the simple random sample.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
151 / 279
Objectives: by the end of this section you should be able to
Construct an approximate confidence interval for a population
mean, with a given confidence level, based on the mean of a
simple random sample from the underlying population
distribution.
Construct an approximate confidence interval for a proportion,
with a given confidence level, based on the mean of a simple
random sample from a Bernoulli distribution.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
152 / 279
Section 6.1: Introduction
Example 6.1.
Consider the simple case of a random sample of size n from a
distribution.
Suppose the population mean µ is an unknown parameter which we
wish to estimate and (unrealistically) σ 2 is
(say σ 2 = σ02 ).
The natural estimator µ
bmom = µ
bmle = X .
Recall from Section 4 that any estimator is random (depends on the
data), with a particular sampling distribution.
Hence need to report the value of the estimate, together with some
measure of its accuracy or margin of error.
For example, could give an nterval (centred on the estimate ) that
we are 95% confident contains the the true value of µ.
Knowing sampling distribution allows us to calculate this.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
153 / 279
Definition
Definition 6.2.
Take 0 < α < 1. An 100(1 − α)%
an interval of the form (cL , cU ) (here L is for
limit) such that:
for a parameter θ is
limit, U for upper
The parameter lies in the interval with probability (1 − α)
P(cL ≤ θ ≤ cU ) = 1 − α.
(6.1)
cL and cU are calculated only using the value of n, the sample data
(x1 , . . . , xn ), and any known parameters.
Remark 6.3.
It is very important to understand that in Equation (6.1), θ is
.
cL and cU depend on the data (are random), so vary from sample to
sample.
(6.1) is an assertion about the joint distribution of cL and cU .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
154 / 279
Procedure for finding the interval
Remark 6.4.
Our procedure for finding the interval is based on some
the data. It usually effectively comes in two stages:
f (X) of
1. We treat θ as fixed but unknown. We use our collection of facts
about distributions to find an interval depending on θ such that
P (g1 (θ) ≤ f (X) ≤ g2 (θ)) = 1 − α.
2. Then we
of θ:
the interval, to rewrite the same interval as a function
P (cL (f (X)) ≤ θ ≤ cU (f (X))) = 1 − α.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
155 / 279
Section 6.2: N(µ, σ 2 ): Confidence Interval for µ; σ 2 known
Example 6.5.
Return to the setting of Example 6.1 (normal sample, known σ 2 ).
√ ∼ N(0, 1).
From Theorem
: X ∼ N(µ, σ02 /n) and σX/−µ
n
0
Since (using qnorm(0.9750)) we know
, then (see
Remark 5.19)
X −µ
√ ≤ 1.96 = 0.95.
P −1.96 ≤
σ0 / n
But
X −µ
√
σ0 / n
≤ 1.96 ⇐⇒ X − µ ≤
X −µ
√
σ0 / n
⇐⇒ X −
1.96
√ σ0
n
≤µ
√ σ0
And −1.96 ≤
⇐⇒
≤ X − µ ⇐⇒ µ ≤ X + 1.96
n
So the event
X −µ
1.96 σ0
1.96 σ0
√ ≤ 1.96 ⇐⇒ X − √
−1.96 ≤
≤µ≤X+ √
σ0 / n
n
n
Oliver Johnson ([email protected])
√ σ0
− 1.96
n
1.96
√ σ0
n
Statistics 1: @BristOliver
c
TB 2 UoB
2017
156 / 279
Example 6.5.
Thus we report a 95% confidence interval with
cL = X −
1.96σ0
1.96σ0
√
and cU = X + √
n
n
Suppose we take a large number of simple random samples from the
distribution, each of fixed size n.
√
In roughly
of the samples the interval X ± 1.96σ0 / n will
contain the true parameter value µ (and in 5% it will not).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
157 / 279
General 100(1 − α)% confidence interval
Example 6.6.
More generally find
confidence interval. Remark 5.19
gives
X −µ
√
P −zα/2 ≤
≤
= 1 − α.
σ0 / n
Rearranging in the same way, the event
zα/2 σ0
zα/2 σ0
X −µ
√ ≤ zα/2 ⇐⇒ X − √
−zα/2 ≤
≤µ≤X+ √
σ0 / n
n
n
We can therefore report a 100(1 − α)% confidence interval with
=X−
zα/2 σ0
√
and
n
=X+
zα/2 σ0
√
n
√
In about 100(1 − α)% of the samples the interval X ± zα/2 σ0 / n will
contain the true parameter value µ (and in 100α% it will not).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
158 / 279
Remark 6.7.
In ∼ 100(1 − α)% of cases the interval will contain the true
and in
the remainder it will not.
It is impossible to tell for each sample whether the interval does or
does not contain µ.
The interval is of the form
√
√ (cL , cU ) = X − zα/2 σ0 / n, X + zα/2 σ0 / n
with end-points which depend on the data as well as on the value of n
and the value of the known parameter σ0
The length of the confidence interval is
. All things being
equal this:
1. DECREASES as the sample size n INCREASES
2. INCREASES as the population variance σ02 INCREASES
3. INCREASES as the confidence level 100(1 − α) INCREASES (since
this means α DECREASES and so zα/2 INCREASES).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
159 / 279
Section 6.3: N(µ, σ 2 ): CI for µ; σ 2 unknown
Example 6.8.
A more realistic setting than Example 6.1 is the following:
Assume X1 , . . . , Xn is a random sample from N(µ, σ 2 ) with σ 2
.
Can’t apply Theorem 5.9. However, Theorem 5.18 implies that
X −µ
√ ∼ tn−1 ,
S/ n
P
where X = (X1 + . . . + Xn )/n and S 2 = nj=1
/(n − 1).
Write tn−1;α/2 for the value such that P(T ≥ tn−1;α/2 ) = α/2, where
.
By symmetry P(T ≤ −tn−1;α/2 ) = α/2.
Can find tn−1;α/2 using qt(1-alpha/2, n-1).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
160 / 279
Section 6.3: N(µ, σ 2 ): CI for µ; σ 2 unknown
Example 6.8.
By definition (see Theorem 5.18 and Remark 5.19):
X −µ
√ ≤ tn−1;α/2
P −tn−1;α/2 ≤
S/ n
.
√
≤ tn−1;α/2 ⇐⇒ X − Stn−1;α/2 / n ≤ µ
√
X −µ
√ ⇐⇒ µ ≤ X + Stn−1;α/2 / n.
Similarly −tn−1;α/2 ≤ S/
n
√
√
Hence P(X − Stn−1;α/2 / n ≤ µ ≤ X + Stn−1;α/2 / n) = 1 − α.
√
√
Equivalently cL = X − Stn−1;α/2 / n and cU = X + Stn−1;α/2 / n
define a 100(1 − α)% confidence interval for µ.
√
The interval has (
) length 2Stn−1;α/2 / n
Points 1. and 3. of Remark 6.7 also apply here (larger sample size
gives smaller interval, larger confidence 100(1 − α) gives larger
interval).
As before
X −µ
√
S/ n
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
161 / 279
Example: Newcomb data
Example 6.9.
Return to the setting of Example
; speed of light data, outliers
removed (x6 = −44 and x10 = −2), leaving n = 64 data points.
Nothing in histogram to contradict model that data is simple random
sample from N(µ, σ 2 ) with µ and σ 2 unknown.
P
P
Here Pni=1 xi = 1776, ni=1 xi2 = 50912, hence x = 27.75 and
s 2 = ( ni=1 xi2 − n(x)2 )/(n − 1) = 25.84127 and s = 5.0834.
For a
confidence interval, can find t63;0.025 = 1.998341 using
qt(0.975,63) in R .
Substituting in Example 6.8, a 95% confidence interval for µ:
√
√
(cL , cU ) = (X − Stn−1;α/2 / n, X + Stn−1;α/2 / n)
5.0834 × 1.998
5.0834 × 1.998
√
√
=
27.75 −
, 27.75 +
64
64
=
.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
162 / 279
Section 6.4: Confidence interval for σ 2 for N(µ, σ 2 ) data
Lemma 6.10.
Assume X1 , . . . , Xn is a random sample from N(µ, σ 2 ) with σ 2
. Theorem
gives
Pn
2
j=1 (Xj − X )
∼ χ2n−1 .
σ2
Hence for any given α, Remark 5.19 gives
!
Pn
2
X
)
(X
−
j
j=1
≤ χ2n−1;α/2 =
P χ2n−1;1−α/2 ≤
σ2
.
Hence, we obtain a 100(1 − α)% confidence interval for σ 2 taking
!
Pn
Pn
2
2
(X
−
X
)
(X
−
X
)
j
j
j=1
j=1
(cL , cU ) =
,
.
χ2n−1;α/2
χ2n−1;1−α/2
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
163 / 279
Example: Newcomb data
Example 6.11.
Return to the setting of Example 1.19.
P
P
Here n = 64, ni=1 xi = 1776, ni=1 xi2 = 50912.
P
P
Hence ni=1 (xi − x)2 = ni=1 xi2 − n(x)2 =
.
R (using the qchisq(., 63) command) gives χ263;0.975 =
χ263;0.025 =
.
Lemma 6.10 shows that
Pn
(cL , cU ) =
2
j=1 (Xj − X )
,
χ2n−1;α/2
Pn
j=1 (Xj − X )
χ2n−1;1−α/2
2
and
!
gives a 100(1 − α)% confidence interval for σ 2 .
Substituing in, we obtain
(cL , cU ) = (1628/86.83, 1628/42.95) = (18.75, 37.9).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
164 / 279
Section 6.5: Confidence interval for θ in U(0, θ) population
Lemma 6.12.
Consider X1 , . . . , Xn a simple random sample from U(0, θ).
Recall from
that θbmle = X(n) = max(X1 , . . . , Xn ),so
P(X(n) ≤ x) = P(X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x)
= (x/θ)n
if 0 ≤ x ≤ θ
Hence, if we take x = θu 1/n , for
, then
X(n)
1/n
P
≤u
= P X(n) ≤ θu 1/n = u.
θ
If we choose u1 = α/2 and u2 = 1 − α/2, usual ”inversion” gives !
X(n)
X(n)
X(n)
1/n
1/n
1 − α = u2 − u1 = P u1 ≤
≤ u2
=P
≤ θ ≤ 1/n
1/n
θ
u
u
2
1
Hence we can take (cL , cU ) = X(n) (1 − α/2)−1/n , X(n) (α/2)−1/n .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
165 / 279
Section 6.6: Confidence interval for θ in Exp(θ) population
Lemma 6.13.
Take a simple random sample X1 , . . . , Xn from
population.
Want to construct a 100(1 − α)% confidence interval
for
P the
The standard estimates for θ are θbmle = θbmom = n/ nj=1 Xj .
P
From Lemma 5.20 know 2θ nj=1 Xj ∼ χ22n .
To construct an ‘equal tailed’ confidence interval we note that


n
X
1 − α = P χ22n ; 1−α/2 ≤ 2θ
Xj ≤ χ22n ; α/2 
j=1
= P
χ22n ; 1−α/2
P
2 nj=1 Xj
χ22n ; α/2
≤ θ ≤ Pn
2 j=1 Xj
!
Thus, a 100(1 − α)%
Pn confidence interval2 for θ is given
Pn by
2
cL = χ2n;1−α/2 /(2 j=1 Xj ) and cU = χ2n;α/2 /(2 j=1 Xj ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
166 / 279
Example - Earthquakes - 90% Confidence Interval
Example 6.14.
Consider again the quakes data set (Example 1.17) with n = 62
observations.
From graphical plots in Example 1.17 and assessment of fit in Section
2.7 reasonable to assume the data comes from
family.
Pn
For this dataset, n = 62, j=1 xj = 27107, so that
θb = 1/x = 0.002287.
Lemma
gives that
(cL , cU ) =
Oliver Johnson ([email protected])
χ22n;1−α/2 χ22n;α/2
P
, P
2 nj=1 Xj 2 nj=1 Xj
Statistics 1: @BristOliver
!
.
c
TB 2 UoB
2017
167 / 279
Example - Earthquakes - 90% Confidence Interval
Example 6.14.
We illustrate the effect of different
in the following table.
Increasing 1 − α leads to a wider interval, as you might expect.
The values of χ2124;1−α/2 and χ2124;α/2 are obtained using the
command qchisq(.,124).
100(1 − α) χ2124;1−α/2 χ2124;α/2 100 × cL 100 × cR 100 × length
90%
99.28 150.99
0.1831
0.2785
0.0954
95%
95.07 156.71
0.1754
0.2891
0.1137
99%
87.19 168.31
0.1608
0.3105
0.1496
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
168 / 279
Section 6.7: Confidence intervals by simulation in R
Sometimes we do not have the distributional facts required to
construct an exact confidence interval.
Suppose we have a simple random sample X1 , . . . , Xn of size n from a
distribution in a parametric family with a density f (x; θ) for single
unknown parameter θ.
We can construct an approximate
simulation as follows:
1. Calculate an estimate
confidence interval by
for θ.
2. Simulate simple random samples, each of the same size n as
b
the original sample, from f (x, θ).
3. Calculate the B estimates, θ1∗ , . . . , θB∗ , one from each simulated
sample, using the same estimation method as in step 1 above.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
169 / 279
Confidence intervals by simulation in R
b If is close to θ, then the
4. Calculate the
values of θk∗ − θ.
∗
distribution of the values of θ − θb for samples from the
distribution with parameter θb will be close to the distribution of
θb − θ for samples from the distribution with parameter θ.
5. Identify values
and
such that Bα/2 of the values of θk∗ − θb
are < kL and Bα/2 are > kU . Then step 4 gives:
P(kL ≤ θb − θ ≤ kU ) ' P(kL ≤ θ∗ − θb ≤ kU ) ' 1 − α.
6. The event {kL ≤ θb − θ ≤ kU } is equivalent to the event
{θb − kU ≤ θ ≤ θb − kL }, so for B large the interval (cL , cU ) is an
approximate 100(1 − α)% confidence interval for θ, where
Oliver Johnson ([email protected])
cL = θb − kU
and cU = θb − kL .
Statistics 1: @BristOliver
c
TB 2 UoB
2017
170 / 279
Example - Earthquakes - 90% Confidence Interval
Example 6.15.
Consider again the quakes data set (Example 1.17 and 6.14).
Apply the following R commands.
> theta.hat <- 1/mean(quakes)
> xsamples <- matrix(rexp(62000,theta.hat), nrow=1000)
> xmean <- apply(xsamples,1,mean)
> theta.star <- 1/xmean
> diff.theta <- (theta.star - theta.hat)
> sort.diff <- sort(diff.theta)
> sort.diff[c(50,950)]
> cl <- theta.hat - sort.diff[950]
> cu <- theta.hat - sort.diff[50]
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
171 / 279
Interpreting these R commands
Example 6.15.
First calculate
for the quakes data; then generate 62, 000
b distribution, and arranges them into a
observations from the Exp(θ)
matrix of
samples each with n = 62 observations.
Next means calculates a vector of means and hence vector of
estimates θk∗ for the 1000 samples; calculates the vector of differences
b sorts these differences in order of increasing value and puts
θk∗ − θ;
the sorted values in a vector sort.diff.
Finally, we want a
confidence interval, so α/2 = 0.05.
Thus the last three commands output the 50th and the 950th of the
1000 ordered values of θk∗ − θb (i.e. the 5th and the 95th quantiles of
the ordered differences); calculate cL = θb − kU ; and calculate
cU = θb − kL .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
172 / 279
Example 6.15.
A histogram of the
given below.
differences for a particular simulation is
Recall that the estimate of θ here is θb = 0.00229.
For this simulation the 5th and 95th quantiles were kL = −0.00039
and kU = 0.00058 respectively, so the
confidence interval
calculated from the simulation had end points cL = θb − kU = 0.00171
and cU = θb − kL = 0.00268.
This compares well with the exact 90% confidence interval, which has
end points (Example 6.14) cL = 0.001831 and cU = 0.002785,
calculated using R .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
173 / 279
Example 6.15.
150
0
50
100
Frequency
200
250
Histogram of diff.theta
−0.0005
0.0000
0.0005
0.0010
0.0015
diff.theta
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
174 / 279
Section 7: Hypothesis Tests
Aims of this section:
A hypothesis test is a procedure for evaluating whether sample
data is consistent with one of two contrasting statements about
the value taken by one (or more) population parameters.
We will focus on the case when the data is in the form of a simple
random sample from a single Normal population and the
parameter of interest is the population mean.
Suggested reading: Rice
Oliver Johnson ([email protected])
Sections 9.1–9.5
Statistics 1: @BristOliver
c
TB 2 UoB
2017
175 / 279
Objectives: by the end of this section you should be able to
Recall the definition of the following terms: null hypothesis,
alternative hypothesis, p-value, significance level, critical region,
type I and type II error, power.
Perform standard hypothesis tests on the value of the population
mean, based on a simple random sample from a Normal
distribution with either known or unknown variance.
Starting with an informal problem description, formulate
appropriate statements of any model assumptions and of the null
and alternative hypotheses of interest.
In standard cases, identify an appropriate test statistic and state
its distribution under the null hypothesis.
For each of the standard types of alternative hypothesis, identify
the set of values of the test statistic that are at least as extreme
as a given observed value.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
176 / 279
Objectives: by the end of this section you should be able to
In standard cases, calculate the p-value corresponding to a given
alternative hypothesis and a given observed value of the test
statistic.
In standard cases, identify the form of the critical region for a test
with a given significance level for each of the standard types of
alternative hypothesis.
In standard cases, calculate the probability of a type II error for a
test with a given significance level.
In standard cases, calculate the power against a given simple
alternative hypothesis for a test with a given significance level.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
177 / 279
Section 7.1: Introduction
Definition 7.1.
A hypothesis is a statement about a parameter – e.g. µ = 4.2 or
2 < σ < 5.
Establishing consistency with the data is posed as a competition
between the
hypothesis
and the alternative hypothesis ,
although the two are not treated symmetrically.
We consider whether H0 is consistent with the data x1 , x2 , . . . , xn (or
if a value of θ allowed by H1 is preferable).
Ask: “Is the sample data an unlikely thing to observe under H0 ?”.
H1 is mainly present to define the direction of departures from H0
that are regarded as interesting.
For example, if testing whether broadband speeds are at least a
specified amount µ0 , we would test H0 : µ = µ0 against the
alternative H1 : µ < µ0 , since the consumer is happy to get too much!
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
178 / 279
Section 7.2: Hypothesis-testing procedure
Remark 7.2.
Hypothesis testing is not the same as deciding whether H0 is true or
not.
Data will
be consistent with two or more hypotheses that
contradict each other!
Definition 7.3.
At its simplest, a hypothesis-testing procedure requires the following steps:
1. Statement of any model
,
2. Statement of the null hypothesis and the alternative hypothesis of
interest,
3. Calculation of the value of an appropriate test statistic,
4. Computation of the resulting p-value,
5. Report on any conclusions.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
179 / 279
Stage 1: Model Assumptions
As with any statistical procedure, we start with a probability model
for the data.
We will assume that the data is a simple random sample from a
particular
distribution in a known parametric family.
We will first focus on the case when the parameter of interest is the
population mean .
Example 7.4.
As in Example 6.1, we make the following assumptions
(a) x1 , . . . , xn are the observed values of a random sample X1 , . . . , Xn , . . .
(b) . . . from a population with the N(µ, σ 2 ) distribution, where µ is
unknown but the value of σ 2 is known – say σ 2 = σ02 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
180 / 279
Stage 2: Null Hypothesis H0
Often the null hypothesis is that of
or no effect.
e.g. hypothesis that current population looks like some previous
reference population.
Hence parameter values are similar for current population and the
reference population.
That is why we call it the
.
Denote the known mean for the previous population by µ0 and denote
the unknown mean for the current population by µ, then the null
hypothesis takes the form
.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
181 / 279
Stage 2: Alternative Hypothesis H1
We write the null hypothesis as H0 : µ = µ0 .
We restrict attention to three standard cases for H1 :
(a) the current mean is
than its previous value, i.e. H1 : µ > µ0
(b) the current mean is less than its previous value, i.e. H1 : µ < µ0
(c) the current mean differs from its previous value, i.e. H1 : µ 6= µ0 .
Note that for large sample sizes, a small difference between the
current parameter and the reference parameter may be statistically
but not of practical importance.
Example 7.4.
The reference value of the mean is some known value µ0 and we are
interested in whether the data leads us to conclude the population
has mean µ > µ0 .
Null hypothesis is H0 : µ = µ0 (no difference between the means)
Alternative hypothesis H1 : µ > µ0 (new mean being greater)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
182 / 279
Stage 3: Test Statistic
To summarise the case that the data provides (for or against H0 ) we
use the value of a suitable
T (X1 , . . . , Xn ), i.e. a
function of the data with the following properties:
(a) values of the test statistic would be highly unlikely if H0 were true and
suggest that H0 is in fact false,
(b) when µ = µ0 (i.e. when H0 is true) the distribution of T is known and
its distribution function can be easily calculated.
We write tobs for the observed value of the test statistic.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
183 / 279
Stage 3: Test Statistic
Example 7.4.
Since X is the natural estimator of µ, we base our test statistic on
Since the population standard deviation σ0 is assumed known we can
take as our test statistic
√
n(X − µ0 )
.
T (X1 , . . . , Xn ) =
σ0
Then from Theorem 5.9(ii), when H0 is true (i.e. when µ = µ0 ) we
have X ∼ N(µ0 , σ02 ) and
We write the observed value of the test statistic
√
n(x − µ0 )
tobs =
.
σ0
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
184 / 279
Stage 4: p-value approach: Consistency with H0
If tobs = T (x1 , . . . , xn ) is relatively
with H0 then it provides
little or no reason to believe that H0 is untrue.
Thus, given value , we’d like to know the values of the test statistic
T which would be less consistent with H0 and more consistent with
H1 than t.
Obviously, this set of values depends on the particular form of H1 .
Statistics 2 explains these are the set of values whose relative
likelihood of occurring under H0 rather than H1 is less than that for t.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
185 / 279
Stage 4 p-value
Definition 7.5.
Compute the probability (assuming H0 is true) of test statistic value
less consistent with H0 (and more consistent with H1 ) than tobs .
We call this probability the
corresponding to tobs .
For each alternative, we calculate the p-value as follows:
(a) H1 : µ > µ0 ⇒ p-value = P(T ≥ tobs |H0 true)
(b) H1 : µ < µ0 ⇒ p-value = P(T ≤ tobs |H0 true)
(c) H1 : µ 6= µ0 ⇒ p-value = P(|T | ≥ |tobs | |H0 true).
Small p-values make us more sceptical of H0
Example 7.4.
Since the alternative of interest is H1 : µ > µ0 , the values of T which
are less consistent with H0 than tobs are the set of values {T ≥ tobs }.
Also, when H0 is true, T ∼ Z ∼ N(0, 1).
p-value = P(T ≥ tobs | H0 true) = P(Z ≥ tobs ) = 1 − Φ(tobs ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
186 / 279
Stage 5: Conclusions – Interpretation of the p-value
If the p-value is very
– the level of consistency with H0 is very
small – we have reason to believe that either the null hypothesis
H0 : µ = µ0 is false (or that something very unlikely has happened).
Thus, small p-values may well lead us to reject H0 in favour of H1 .
Conversely, if the p-value is relatively
, then these observations
are relatively likely to occur when H0 is true, and we conclude that
there is no reason to reject H0 .
In giving conclusions we should
(a) report the p-value
(b) interpret that to make practical conclusions about µ in the context of
the example. (i.e. does the test show that the sample data gives a
reason to reject H0 in favour of H1 or not?)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
187 / 279
Section 7.3: Example - Normal distribution with known
variance
Example 7.6.
When patients with a certain chronic illness are treated with the
current standard medication, the mean time to recurrence of the
illness is µ0 =
days, with a standard deviation of σ0 = 26.4 days.
A new type of medication, that is thought to increase the time until
recurrence, was tried by a randomly chosen sample of
patients.
For this sample, the mean time to recurrence was x = 65.8 days.
Assuming the variance of the recovery time is the same for the new
and current medication, we want to test whether the new medication
has increased the mean time to recurrence.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
188 / 279
Stages 1. and 2. Model assumptions and hypotheses
Example 7.6.
(a) The recurrence times for the
patients are a random sample
from the population of recurrence times for all patients that will use
this new medication . . .
(b) . . . with distribution N(µ, σ 2 ), where µ is
but the value
σ 2 = σ02 = (26.4)2 is known.
I
We take
H0 : µ = µ0 = 53.3 versus H1 : µ > 53.3.
I
I
The null hypothesis H0 corresponds to
between the mean
recurrence time µ for the new medication and the mean recurrence
time µ0 = 53.3 for the standard medication.
The alternative hypothesis H1 corresponds to the mean recurrence time
for the new medication being longer than for the standard medication.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
189 / 279
Stage 3. Test Statistic
Example 7.6.
I
I
I
I
We base our test statistic on X − 53.3.
Since population standard deviation σ0 = 26.4 is assumed known and
n = 16, we can take as our test statistic
√
√
T (X1 , . . . , Xn ) = n(X − µ0 )/σ0 = 16(X − 53.3)/26.4,
where X ∼ N(µ, σ02 /n) = N(µ, (26.4)2 /16).
Thus,√when
(i.e. when µ = µ0 = 53.3) we have
T = 16(X − 53.3)/26.4 ∼ N(0, 1).
The data
√ gives x = 65.8 so the observed test statistic is
tobs = 16(65.8 − 53.3)/26.4
.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
190 / 279
Stages 4. and 5. p-value and Conclusions
Example 7.6.
the values of T which are less consistent with H0 than tobs are the set
of values {T ≥ tobs = 1.893} so
p-value = P(T ≥ tobs | H0 true) = P(Z ≥ 1.893)
= 1 − Φ(1.893) = 0.0292.
I
I
The p-value of
is quite small – if the mean for the new
medication was really 53.3 we would only observe data for which the
consistency with H0 was this small about 3 percent of the time.
Thus there is a reasonably strong case that H0 is not true.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
191 / 279
Section 7.4: One sample t-test: µ, σ 2 unknown
Examples 7.4 and 7.6 correspond to Example 6.1 – that is, we make
the unrealistic assumption that σ 2 is known.
We now give an equivalent to Example 6.8 – that is, testing
hypotheses about µ for σ 2
.
Example 7.7.
I
Assume X1 , . . . , Xn form a simple random sample from N(µ, σ 2 ) where
µ and σ 2 are unknown.
I
Null hypothesis: H0 : µ = µ0 .
Alternative hypothesis: one of the standard cases, either H1 : µ > µ0 ,
H1 : µ < µ0 or H1 : µ 6= µ0 .
For definiteness (and variety) take H1 : µ 6= µ0 .
I
I
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
192 / 279
Example 7.7.
Theorem
implies that when H0 is true (i.e. when µ = µ0 )
X − µ0
√ ∼ tn−1 ,
S/ n
P
where X = (X1 + . . . + Xn )/n and S 2 = nj=1 (Xj − X )2 /(n − 1).
T =
I
I
Because we take H1 : µ 6= µ0 , we are interested in values of T such
that {|T | ≥ |tobs |} (
alternative).
Hence
p-value
=
P(|T | ≥ |tobs ||H0 true)
= P(|tn−1 | ≥ |tobs |)
I
I
If p-value is large, no reason to reject H0 .
If p-value is small, suggests inconsistency with H0 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
193 / 279
Example of 1-sample t-test
Example 7.8.
To investigate the accuracy of DIY radon detectors, researchers
bought 12 such detectors and exposed them to exactly 105 picocuries
per litre of radon.
The 12 detector readings were:
91.9, 97.8, 111.4, 122.3, 105.4, 95.0, 103.8, 99.6, 96.6, 119.3, 104.8,
101.7.
P
This
i xi = 1249.7, x = 104.1417, n = 12,
P 2gives summary statistics
2 = 86.4181.
x
=
131096.44,
S
i i
Our question: does the mean for such detectors seem to differ from
105?
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
194 / 279
Example of 1-sample t-test
Example 7.8.
these 12 observations are a random sample from N(µ, σ 2 ) where both
µ and σ 2 are unknown.
H0 : µ = 105 vs. H1 : µ 6= 105 (2-sided alternative).
T =
X − µ0
√ ∼ tn−1
S/ n
when H0 is true
We have
√ n = 12 and µ0 = 105,
√ so the observed value of T is
tobs = 12(104.142 − 105)/ 86.4181 = −0.32 and |tobs | = 0.32.
The p-value is P(|T | ≥ 0.32) when T ∼ tn−1 = t11 . Using R , we find
pt(0.32,11)=0.6225 so P(|T | ≥ 0.32) = 2(1 − 0.6225) = 0.755.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
195 / 279
Section 7.5: Alternative approach: Critical region
An alternative approach (avoiding p-value calculations) is to define a
critical region.
In advance of the test, we define a threshold for the test statistic,
which if crossed will lead us to reject the null hypothesis.
Fixing (and publicising?) this value in advance can encourage
scientific honesty.
Thinking in terms of critical regions is useful philosophically e.g. when
thinking about optimal test procedures (see courses in later years).
In this context, we introduce the useful ideas of Type I and Type II
error.
The outcome of such tests is less informative than a p-value, but
easier to calculate without R, so this approach is often found in
earlier textbooks.
Critical regions provide an alternative for Stage 4 - the other stages of
the procedure are unchanged.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
196 / 279
Type I and Type II error
One way of evaluating the performance of a test procedure is to focus
on some simple fixed alternative hypothesis H1 : µ = µ1 , and assume
that µ can only take one of the two values µ = µ0 or µ = µ1 .
In this simplified context, there are only two possible errors.
Definition 7.9.
error is the error of deciding the null hypothesis H0 is false when H0 is
actually true,
error is the error of deciding the null hypothesis H0 is true when H1 is
actually true.
There is a trade-off between type I and type II error.
A change to the test procedure that reduces the type I error will
usually increase the type II error, and vice-versa.
Control of the type I error is often thought of as being
in some senseTBthe
0 represents
c status-quo.
Oliver Johnson ([email protected]), since HStatistics
1: @BristOliver
2 UoB
2017
197 / 279
Significance level
Definition 7.10.
Often we fix in advance some small acceptable threshold level (e.g.
α = 0.05 or 0.01) for the type I error. We call this the
of the test, and speak of an α-level test.
Fixing significance level α in turn fixes the
(set of
values of T that would lead us to reject H0 ) and
c ∗.
For significance level α and
, the critical value c ∗
satisfies:
P(T ≥ c ∗ | H0 true) = P(Reject H0 | H0 true) = P(Type I error) = α
Corresponding conditions hold for other two alternative hypotheses.
Remark 7.11.
By definition, an α-level test procedure will reject H0 if and only if
the calculated p-value is less than or equal to α.
e.g. P(T ≥ tobs | H0 true) ≤ α = P(T ≥ c ∗ | H0 true) iff tobs ≥ c ∗ .
Hence reporting if T ∈ C is less informative than the p-value.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
198 / 279
Return to Example 7.4
Example 7.4.
Since H1 : µ > µ0 , the values {T ≥ t} are those which are less
consistent with H0 than t.
Equivalently the critical region of values for which the test would
reject H0 is of the form
To find c ∗ for a given significance level α, we recall that a test has
significance level α if P(Reject H0 | H0 true) = α.
Thus, for a α-level test, c ∗ is defined by the condition
α = P(Reject H0 |H0 true) = P(T ≥ c ∗ | H0 true) = P(Z ≥ c ∗ )
= 1 − Φ(c ∗ ).
Hence c ∗ = Φ−1 (1 − α). For α = 0.05, the c ∗ = Φ−1 (0.95) =
We reject H0 at 0.05 significance if T > c ∗ = 1.645, or if
√
√
X ≥ µ0 + 1.645σ0 / n (since T = n(X − µ)/σ0 )
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
.
199 / 279
Further numerical examples
Example 7.12.
Example 7.6: the critical region of values C has the form
C = {T ≥ c ∗ }. Since T ∼ N(0, 1), and P(N(0, 1) > 1.645) = 0.05,
we take as critical region
Since tobs = 1.893 is in
C , the 0.05-level test would lead us to reject H0 .
Example 7.7: Form of alternative hypothesis implies we are looking
for a critical region of the form C = {|T | ≥ c ∗ }. For α-level test, c ∗
is defined by
α = P(|T | ≥ c ∗ | H0 true) = P(|tn−1 | ≥ c ∗ ).
The relevant c ∗ is found in R as qt(1-alpha/2,n-1). As before,
tobs in critical region means we reject H0 (at a α significance level).
Example 7.8: For an α level test we use the critical region
C = {|T | ≥ t11;α/2 }. Let us take the significance level α = 0.05.
Using R gives qt(0.975,11)=2.201. The value of tobs = 0.32 is not
within C = {|T | ≥ 2.201}, so do not reject H0 at the 5% level.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
200 / 279
P(Type II error) and Power
Definition 7.13.
For a fixed alternative hypothesis H1 : µ = µ1 , we define the
of the test to be 1 − P(Type II error).
It gives a measure of how powerful the procedure would be in
detecting that the alternative H1 is true.
Example 7.14.
Consider a simple fixed alternative of the form
Under our test procedure for this alternative, we accept H0 as true if
and only if T < c ∗ .
Note that
P(
) = P(Accept H0 | H1 true) = P(T < c ∗ |µ = µ1 ).
Hence the power 1 − P(Type II error) = P(T ≥ c ∗ |µ = µ1 ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
201 / 279
t-tests in R
Example 7.15.
The R command:
> t.test(data, mu = 0, alternative="greater",
conf.level=0.9)
will compute a one-sample t-test on observations in an array data,
with null hypothesis H0 : µ = 0, with alternative hypothesis
H1 : µ > 0, and at significance level α = 0.1.
The numerical mean value 0 can be replaced by the value appropriate
for your data, the alternative hypothesis ”greater” can be replaced by
the alternatives ”less” or ”two.sided” as desired, and the significance
level can be changed by setting conf.level to 1 − α (the default
value is α = 0.05).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
202 / 279
Section 7.6: Confidence Intervals and Hypothesis Tests
Theorem 7.16.
Hypothesis tests are closely related to confidence intervals.
In particular, the α-level test of H0 : µ = µ0 versus the two-sided
alternative H1 : µ 6= µ0 will reject H0 if and only if the corresponding
two-sided 100(1 − α)% confidence interval for µ does not contain µ0 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
203 / 279
Confidence Intervals and Hypothesis Tests
Example 7.17.
If we know σ 2 = σ02 , then reject H0 if and only if
√
n(X − µ0 ) ∗
,
c ≤ |T | = σ0
where α/2 = 1 − Φ(c ∗ ), or c ∗ = zα/2 .
Rearranging, this means that we reject if and only if
zα/2 σ0
zα/2 σ0
√
√
µ0 ∈
/ X−
,X −
,
n
n
which we recognise from Example 6.6 as the two-sided 100(1 − α)%
confidence interval.
Good exercise to check similar equivalence holds when σ 2 is unknown.
Similar results connect one-sided tests and one-sided confidence
intervals of the form (−∞, cU ) or (cL , ∞).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
204 / 279
Section 8: Comparison of population means
Aims of this section:
In the last section we introduced hypothesis tests in the context
of samples from a single distribution.
In this section we will use hypothesis tests to compare samples
from distributions that differ in some qualitative factor.
In particular, we will investigate situations where it is thought
that the change in the qualitative factor may have had the effect
of increasing or decreasing the population mean.
Suggested reading: Rice
Oliver Johnson ([email protected])
Sections 11.1–11.3
Statistics 1: @BristOliver
c
TB 2 UoB
2017
205 / 279
Objectives: by the end of this section you should be able to
Identify from an informal problem description situations where a
paired t-test or a two sample t-test would be appropriate, and, in
the latter case, identify whether or not a pooled two sample t-test
would be appropriate.
State appropriate model assumptions and formulate appropriate
null and alternative hypotheses for each type of two-sample or
paired t-test listed above.
Identify an appropriate test statistic and its distribution under the
null hypothesis for each type of two-sample or paired t-test above.
Use the methods of this section to compute appropriate p-values
or critical regions and report appropriate conclusions for each type
of two-sample or paired t-test listed above.
Use appropriate commands in R to perform each of the types of
two-sample or paired t-test listed above, and correctly interpret
and report the output of the procedures.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
206 / 279
Section 8.1: Introduction
So far, we have modelled data as a
random sample from a fixed
parametric distribution.
This model applies to experimental units of the same type, where
differences in the data values come from random variation.
In reality, data are usually collected in order to compare groups,
and/or to study how the main variable of interest (the
variable) depends on one or more
variables.
These are really versions of the same question – we can think of each
data item being accompanied by a
indicating its group (age
groups, different treatments etc.).
By studying how the response variable depends on this label, we are
comparing groups.
In this case, the explanatory variable is discrete or categorical, and
often referred to as a factor.
In Sections 9 and 10 we consider regression, where there is a single
numerical explanatory variable.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
207 / 279
Section 8.2: Comparison of two groups
In this section, we suppose we have two distinct groups of data (so
the explanatory variable is
), and suspect there are systematic
differences between the groups.
The groups might be defined by properties of the experimental units
(human subjects, etc.) or by different treatments (e.g. drug
therapies) applied to them.
We want to know if there are systematic differences in some quantity
of interest, between the populations (corresponding to differences in
the
).
The response variable will be influenced by both systematic and
random variation.
The statistician’s task is
these two effects.
For simplicity, we only consider the case of normally distributed data
with the quantity of interest being the population mean.
We test whether observed differences are statistically significant.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
208 / 279
Case 1: Independent samples
Sometimes, we can assume each data set is entirely
one other.
of
In this case the data can be modelled as two independent random
samples from different population distributions.
Here the question of interest reduces to whether the means of the two
populations differ.
This type of model can be analysed using a two sample t-test (see
Section 8.3)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
209 / 279
Case 2: Paired (or matched) samples
Alternatively, the data consist of pairs of observations on each of n
experimental units, with different treatments applied to each member
of the pair.
The first observation in each
corresponds to one factor value and
the second corresponds to the other.
We may assume that the change in factor value is associated with a
common systematic change in the underlying distribution of the
variable being measured.
An appropriate model is often that
between observations
in each pair are independent observations from the same distribution,
whose mean corresponds to the systematic change.
The question of interest reduces to whether this mean change is zero.
This model can be analysed by a
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
(see Section 8.4).
c
TB 2 UoB
2017
210 / 279
Section 8.3: Two sample t-test
Example 8.1.
Model assumptions are that there are two independent samples.
The X1 , . . . , Xn is a random sample of size n from the N(µX , σX2 )
distribution, with sample mean X = (X1 + · · · + Xn )/n.
Y1 , . . . , Ym is a random sample of size m from the N(µY , σY2 )
distribution, with sample mean Y = (Y1 + · · · + Ym )/m.
The null hypothesis of interest is H0 : µX − µY = 0.
The standard estimators of µX and µY are X and Y , so it is natural
to base our analysis on the value of X − Y .
From Theorem
, X ∼ N(µX , σX2 /n) and Y ∼ N(µY , σY2 /m), so
from Lemma
, X − Y ∼ N(µX − µY , σX2 /n + σY2 /m).
Moreover,
when H0 is true, and Lemma
gives
s
σX2
σ2
(X − Y )/
+ Y ∼ N(0, 1).
(8.1)
n
m
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
211 / 279
Example 8.1.
Since σX2 and σY2 are unknown, we replace them by appropriate
estimates and use as test statistic
s
2
2
σc
σc
X
+ Y.
T = (X − Y )/
n
m
We reject H0 if the value of the test statistic is significantly different
from zero, where the relevant direction depends on H1 .
The situation is slightly more complicated than the single sample
case, where the resulting test statistic has a standard t-distribution:
1
2
If we can assume the X and Y distributions have the same variance,
c
2
2
we combine estimates σc
estimate Sp2 ,
X and σY into a single
and the resulting test statistic has a t-distribution (see Example 8.2)
If we cannot make this assumption, then a result due to
shows
that the distribution of the test statistic can be approximated by a
t-distribution with non-integer degrees of freedom (see Example 8.3).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
212 / 279
Pooled two sample t-test
Example 8.2 (Pooled two sample t-test).
Here we are prepared to make the extra model assumption that
= (say) σ 2 .
P
DenotePthe sample variances by SX2 = ni=1 (Xi − X )2 /(n − 1) and
2
SY2 = m
j=1 (Yj − Y ) /(m − 1).
Since both of these are independent estimates of the common
variance σ 2 , we can combine them into the
estimate
Pn
Pm
2
2
(n − 1)SX2 + (m − 1)SY2
i=1 (Xi − X ) +
j=1 (Yj − Y )
2
=
.
Sp =
n+m−2
n+m−2
The test statistic then becomes
!
r
1
1
T = (X − Y )/ Sp
+
where T ∼ tn+m−2 when H0 is true.
n m
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
213 / 279
Example 8.2.
To see that T q
does have the claimed distribution,
q note that we can
2
2
write T = U/ Sp /σ , where U = (X − Y )/σ n1 + m1 .
From (8.1) and since σX2 = σY2 = σ 2 , U ∼ N(0, 1) when H0 is true.
Further, from Theorem
n
X
i=1
2
(Xi − X )
/σX2
we have that (independently):
∼
χ2n−1
and
m
X
(Yj − Y )2 /σY2 ∼ χ2m−1 .
j=1
Since samples independent and σX2 = σY2 = σ 2 , Lemma 5.14 gives
Sp2
=
σ2
Pn
i=1 (Xi
− X )2 +
Pm
j=1 (Yj
σ 2 (n + m − 2)
Thus, from Definition 5.16,
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
− Y )2
∼
χ2n+m−2
.
n+m−2
when H0 is true.
c
TB 2 UoB
2017
214 / 279
Welch two sample t-test
Example 8.3 (Welch two sample t-test).
In the general case, when
, the natural estimators of the
population variances are the corresponding sample variances.
Put
2
σc
X
2
σc
Y
=
SX2
n
X
=
(Xi − X )2 /(n − 1),
i=1
= SY2 =
m
X
j=1
(Yj − Y )2 /(m − 1).
The test statistic is then:
s
T = (X − Y )/
Oliver Johnson ([email protected])
2
2
σc
σc
X
+ Y.
n
m
Statistics 1: @BristOliver
c
TB 2 UoB
2017
215 / 279
Example 8.3.
A result due to Welch shows that: T ' tν when H0 is true.
The degrees of freedom are computed as:
ν=
SY2 2
SX2
+
n
m
2 2
2
SX
SY
1
1
+
n−1
n
m−1
m
2 .
Note that, when the X and Y distributions have similar unimodal
shapes, the approximation to the distribution of the test statistic is
reasonably good for
.
Note also that, when the sample sizes m, n and sample variances SX2 ,
SY2 are similar for the two samples, then the degrees of freedom ν will
be close to the value n + m − 2 used in the pooled test.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
216 / 279
Cider maker: two sample t-test
Example 8.4.
A cider maker tests if an insecticide
total crop per tree.
From
trees, he chooses
at random and sprays them, leaving
remaining ones unsprayed, and otherwise treats them identically.
Write X1 , . . . , X11 for yields from 11 sprayed trees.
Write Y1 , . . . , Y12 for yields from 12 unsprayed trees.
These yields (in kg) are available in the files apple.sprayed and
apple.unsprayed in Stats1.RData.
The summary statistics are: n = 11, x = 42.56, sx2 = 10.04855,
m = 12, y = 40.49, sy2 = 6.660833, sp2 = 8.27403.
Assume two independent samples: X1 , . . . , Xn is a random sample of
size n from the N(µX , σX2 ) distribution and Y1 , . . . , Ym is a random
sample of size m from the N(µY , σY2 ) distribution.
Compare H0 : µX − µY = 0 vs H1 : µX − µY > 0.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
217 / 279
Cider maker: pooled two sample t-test
Example 8.4.
If we assume σX2 = σY2 = σ 2 , as in Example 8.2, we can use Sp2 .
Substituting the values above, we obtain n + m − 2 = 21,
!
r
1
1
+
= 1.7256.
tobs = (x − y )/ sp
n m
The form of H1 implies that we are interested in values of T ≥ tobs .
Since R gives pt(1.7256,21) = 0.95045),
= P(T ≥ tobs |H0 true) = P(t21 ≥ 1.7256) = 0.05
: need α = P(T ≥ c ∗ | H0 true), so c ∗ = t21,α .
e.g. for α = 0.05, using qt(0.95,21) in R gives
c ∗ = t21,0.05 = 1.7207, so C = {T ≥ 1.7207}.
Hence p-value close to 0.05 and tobs close to edge of critical region,
some evidence spraying increases
mean.
c
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
TB 2 UoB
2017
218 / 279
Cider maker: Welch t-test
Example 8.4.
If we do not assume σX2 = σY2 , as in Example 8.3, we can use the
Welch approximation.
In this case
q
tobs = (x − y )/ sx2 /n + sy2 /m
.
The degrees of freedom are computed as:
ν=
Oliver Johnson ([email protected])
SY2 2
SX2
n + m
2 2
2
SX
SY
1
1
+
n−1
n
m−1
m
Statistics 1: @BristOliver
2 =
.
c
TB 2 UoB
2017
219 / 279
Cider maker: Welch t-test
Example 8.4.
The form of H1 implies that we are interested in values of T ≥ tobs .
R gives pt(1.7098,19.35) = 0.94835), so
= P(T ≥ tobs |H0 true)
= P(t19.35 ≥ 1.7098) = 0.052
: need α = P(T ≥ c ∗ | H0 true), so c ∗ = t19.35,α .
e.g. for α = 0.05, using qt(0.95,19.35) in R gives
c ∗ = t19.35,0.05 = 1.7275, so C = {T ≥ 1.7275}.
Hence p-value close to 0.05 and tobs close to edge of critical region,
some evidence spraying increases mean.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
220 / 279
Section 8.4: Paired t-test
Example 8.5.
In certain circumstances (e.g. twin studies in medicine), we may have
reason to pair up the sample data values as (X1 , Y1 ), . . . , (Xn , Yn ).
Denote the difference between values in each pair by Wi = Xi − Yi .
The model assumption is then that W1 , . . . , Wn are a random sample
from the N(δ, σ 2 ) distribution, where δ and σ 2 are unknown.
The null hypothesis is H0 : δ = 0.
The analysis follows as for the one sample t-test in Example
.
Pn
2
d
2 = S2 =
i=1 (Wi −W )
.
Put W =
and σ
W
W
(n−1)
The test statistic is then
√
nW
T =
where T ∼ tn−1 when H0 is true.
σ
d
W
We reject H0 if the value of the test statistic is significantly different
from zero, where the direction of the difference depends on H1 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
221 / 279
Remark 8.6.
Note that the model assumptions do not necessarily require that
X1 , . . . , Xn all have the same distribution.
For example, suppose that each Xi ∼ N( , τ 2 ) and that each
Yi ∼ N(
, τ 2 ), where the underlying mean values µi may all be
different.
This systematic difference is the same, which would still be consistent
with the model assumptions above, since it still implies that each
Xi − Yi ∼ N(δ, σ 2 ), where
This type of experimental design may be particularly appropriate if
the experimental units are quite variable.
Following example shows small but consistent systematic differences
may show up in an experiment that uses paired observations, but not
using two independent samples, if small differences in mean are
masked by the high variability between the experimental units.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
222 / 279
Example: paired t-test
Example 8.7.
To test two water-repellents,
cut in half.
garments of different materials were
One half was treated with repellent A, the other with repellent B.
They were placed in a wet environment, and
absorbed in grams was as follows:
Garment
1
2
3
4
Treatment A Xi 1.7 4.3 14.6 5.0
Treatment B Yi 1.4 3.9 14.2 4.2
Differences
Wi 0.3
0.4
0.8
the amount of water
5
2.2
2.0
We assume that W1 , . . . , W5 is a random sample from N(δ, σ 2 ),
where both parameters are
.
We test the hypothesis H0 : δ = 0 (no systematic difference) against
H1 : δ 6= 0 (some difference).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
223 / 279
Example: paired t-test
Example 8.7.
2 =
Put W = (W1 + · · · + Wn )/n and SW
Pn
− W )2 /(n − 1).
i=1 (Wi
For the given data w = 0.42, sw2 = 0.052.
p
√
The observed test statistic is tobs = 5w / sw2 = 4.118.
The test statistic T ∼ t4 when H0 is true.
The alternative is two-sided, so interested in the set {|T | ≥ tobs }.
In terms of p-value, need P(|t4 | ≥ 4.118) = 2(1 − P(t4 ≤ 4.118)).
pt(4.118,4) = 0.9926, so deduce P(|t4 | ≥ 4.118) = 0.015.
In terms of
, need α = P(|t4 | ≥ c ∗ ) = 2P(t4 ≥ c ∗ ).
Hence for α = 0.05, need c ∗ = t4;0.025 = 2.776 and
C = {|T | ≥ 2.776}.
Either approach shows there is good
to reject H0 – there is
evidence of significant difference between the two treatments.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
224 / 279
Two-sample and paired t-test
Remark 8.8.
The Xi weights are not independent of Yi weights.
The variability in Xi values is very large relative to variability in Wi
.
If you compare the two datasets in the water repellent example above
(Example 8.7) using a non-paired t test, you will not find anything
significant.
For example, using t.test in R (see below) gives a p-value of 0.901.
However, the paired test
and isolates the treatment effect.
Oliver Johnson ([email protected])
compares the differences Wi ,
Statistics 1: @BristOliver
c
TB 2 UoB
2017
225 / 279
Section 8.5: t-test procedures in R
Assume the two random samples are in data arrays xdata and ydata.
A test of the null hypothesis H0 : µX − µY = 0 against the two sided
alternative H1 : µX − µY 6= 0 can be performed using the command
> t.test(xdata,ydata)
The output includes the value of the test statistic, the degrees of
freedom ν for the approximating t-distribution and the (approximate)
p-value.
The option alternative="less" can be used to test against the
alternative
as in the command
> t.test(xdata,ydata,alternative="less")
Similarly the option alternative="greater" can be used to test
against the alternative H1 : µX − µY > 0.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
226 / 279
Specific forms of the two sample t-test
The default two sample t-test in R is the
test.
Under the model assumption that the population variances are equal,
a pooled t-test of the null hypothesis H0 : µX − µY = 0 agaist the
two sided alternative H1 : µX − µY 6= 0 can be performed using the
command
> t.test(xdata,ydata,var.equal=T)
For the
t-test, the data is assumed to be in equal-length data
arrays xdata and ydata, where each component of xdata will be
paired with the corresponding component of ydata.
A paired t-test of the null hypothesis H0 : δ = 0 agaist the two sided
alternative H1 : δ 6= 0 can then be performed using the command
> t.test(xdata,ydata,paired=T)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
227 / 279
Section 9: Linear regression
Aims of this section:
In this section we provide a brief introduction to the ideas and
methods of simple linear regression.
Suggested reading: Rice
Oliver Johnson ([email protected])
Section 14.1 and 14.2
Statistics 1: @BristOliver
c
TB 2 UoB
2017
228 / 279
Objectives: by the end of this section you should be able to
State the model assumptions under which a simple linear
regression model is appropriate for describing and analysing a set
of data consisting of predictor values and corresponding responses.
Produce a scatter plot of response values against predictor values,
both by hand and in R .
Compute least squares estimates of the slope and intercept of the
fitted regression line, by hand and in R , and add the line to a
scatter plot of the data.
Comment critically on any deviations from the assumptions of the
model that are apparent from the plot of the data values together
with the fitted regression line.
Compute the fitted values and the residual values, plot the
residual values against either the predictor values or the fitted
values, and comment critically on any deviations from the
assumptions of the model that are apparent from the plot.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
229 / 279
Section 9.1: Introduction
Instead of discrete groups (as in Section 8), we compare populations
of potential Y values for different values of a
variable x.
In this course, we only consider one-dimensional x, and a linear
dependence of Y on x, but the subject of Regression becomes much
more general later – see (Generalized) Linear Models course.
In this notation, the data consist of a set of n pairs of values
(x1 , y1 ), . . . , (xn , yn ), corresponding to the n members of our sample.
Example 9.1.
We have data on the heights x and weights y of a sample of students and
are interested in how well height can be used to predict weight.
Example 9.2.
We have data on debt y and parental income x for a sample of students,
and wish to investigate whether parental income can explain debt.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
230 / 279
Types of variables
We are interested in whether the associated variable xi can help
or predict values of the variable yi of interest.
The two variables play different roles, in that the original variable y
often
the associated variable x.
For example, changes in weight do not usually cause changes in
height, but a change in height (through growth) is usually associated
with an increase in weight.
Definition 9.3.
For that reason the variable of interest (our Y variable) is called the
(an old-fashioned term is the
).
The associated variable (our x variable) is known as the predictor
variable or the
variable (formerly the independent
variable).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
231 / 279
Random effects
We also need to take account of
between the x and Y values.
variation in the relationship
For example, if we took repeated samples, then (even if the x values
were kept the same) the y values obtained would usually vary from
sample to sample.
Thus an appropriate framework is to assume that for each value x of
the explanatory variable there is a corresponding
of values
of Y with its own x-dependent distribution.
We call the function g (x) given by g (x) = E(Y |x) the regression of
Y on x.
In this framework, we look for a simple expression for g (x) = E(Y |x)
which is plausible over an appropriate range of x values.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
232 / 279
Linear regression model
Definition 9.4.
The simple
x is of the form
model says the relationship of E(Y |x) to
E(Y |x) = α + βx
For this model, the basic questions of interest are:
What are good estimates of the unknown parameters
(assuming the model is correct)?
and
How well do the data fit the model – is there any evidence from
the data that the model is not correct?
What evidence is there that Y really does depend on x (i.e. that
)?
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
233 / 279
Section 9.2: Model assumptions
Definition 9.5.
Let x1 , . . . , xn be the values observed of a predictor variable X .
For each i = 1, . . . , n assume the value yi of the
an observed value of random variable Yi , where
Yi =
Here α and β are
variable is
+ βxi + ei .
parameters.
The ei are random variables, which we think of as errors.
We assume that
I
I
I
Eei = 0,
Var (ei ) = σ 2 (unknown),
Cov (ei , ej ) = 0 for i 6= j (errors are uncorrelated).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
234 / 279
Equivalent model and summary statistics
Lemma 9.6.
Under our assumptions, for given x1 , . . . , xn , it is equivalent to say that
EYi =
Var (Yi ) =
,
σ2,
Cov (Yi , Yj ) = 0 for i 6= j (
are uncorrelated).
Definition 9.7.
To simplify notation, we introduce summary statistics of the data. Write
P
= ni=1 xi /n,
P
y = ni=1 yi /n,
P
P
ssxx = ni=1 (xi − x)2 = ni=1 xi2 − nx 2 , (cf sample variance)
P
P
ssyy = ni=1 (yi − y )2 = ni=1 yi2 − ny 2 ,
P
P
ssxy = ni=1 (xi − x)(yi − y ) = ni=1 xi yi − nxy . (‘sample covariance’)
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
235 / 279
Section 9.3: Least squares estimates
Definition 9.8.
For a given model, define the least squares estimates to be the
parameter values that minimise
n
n
X
X
2
(yi − g (xi )) =
(yi − E(Yi |xi ))2 .
i=1
i=1
That is, we minimise the sums of squares of (vertical) distances
between yi and its expected value under the model.
For
model, the least squares estimates of α
and β are the values α
b and βb minimising
n
X
i=1
Oliver Johnson ([email protected])
(yi − (α + βxi ))2 .
Statistics 1: @BristOliver
c
TB 2 UoB
2017
236 / 279
Finding α
b and βb
Theorem 9.9.
For the simple linear regression model, the least squares estimates
are given by
βb = ssxy /ssxx ,
and
b
α
b = y − βx.
Proof.
We can rewrite the term inside the sum of squares as
(yi − (α + βxi )) = ((yi − y ) − (α − y + βx) − β(xi − x)) .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
237 / 279
Finding α
b and βb
Proof.
Hence the
n
X
i=1
satisfies
(yi − (α + βxi ))2
=
n
X
i=1
((yi − y ) − (α − y + βx) − β(xi − x))2
= ssyy + n(α − y + βx)2 + β 2 ssxx − 2βssxy .
(9.1)
P
P
Here the fact that (xi − x) = 0 and (yi − y ) = 0 makes some of
the cross-terms vanish.
b we can minimise the 2nd bracket by taking α
b
Given β,
b = y − βx.
2
This leaves us choosing to minimise ssyy + β ssxx − 2βssxy , and
b xx = 2ssxy .
differentiating with respect to β, we find 2βss
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
238 / 279
Section 9.4: Fitted values, residuals and predictions
Definition 9.10.
1
2
3
The
(estimated value under the model for the ith
b i.
observation) are ybi = α
b + βx
The residual values (difference between the observed and fitted value)
b i.
are ebi = yi − ybi = yi − α
b − βx
Define the residual sum of squares
RSS =
n
X
i=1
4
2
ebi =
n
X
i=1
by
b i )2 .
(Yi − α
b − βx
The best predictor of the value Y that would be observed at some x
value for which we have no data is
Oliver Johnson ([email protected])
b
yb = α
b + βx.
Statistics 1: @BristOliver
c
TB 2 UoB
2017
239 / 279
Properties of these quantities
Lemma 9.11.
Substituting the optimal values in Equation (9.1) above
1
we deduce the extremely useful formula that
RSS = ssyy −
2
2
ssxy
.
ssxx
c2 = RSS/(n − 2).
We estimate σ 2 by σ
There are only n − 2 independent values of ebi (cf n − 1 independent
values of xi − x in Definition 1.11).
The model is chosen to minimise the sum of squares of residuals.
However, systematic patterns in the residuals can indicate lack of fit
in the model – see Section 9.7.
Note that prediction can be less accurate if x lies outside the range of
observed xi (i.e. extrapolating from data rather than interpolating).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
240 / 279
Section 9.5: Leaning Tower of Pisa Example
Example 9.12.
Studies by engineers on the Leaning Tower of Pisa between 1975 and
1987 recorded the following data on the
tilt of the tower.
Each tilt value in the table represents the difference from being
vertical.
The data are coded in tenths of a millimetre in excess of 2.9 metres,
so the 1975 tilt of
represents an actual difference of 2.9642
metres.
Only the last two digits of the year are shown.
The data are contained in the Statistics 1 data sets pisa.year and
pisa.tilt respectively.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
241 / 279
Leaning Tower of Pisa Example
Example 9.12.
The summary
for the data set
P statisticsP
P are:
n
xi = 1053
yi = 9018
xi2 = 85475
P= 213
P
yi = 6271714
yi xi = 732154.
This gives x̄ = 81, ȳ = 693.6923, ssxx = 182, ssxy = 1696 and
ssyy = 15996.77.
Thus the
are
ssxy
= 9.3187
ssxx
α
b = ȳ − βbx̄ = −61.1209
βb =
giving the
Oliver Johnson ([email protected])
b = −61.1209 + 9.3187x.
regression line y = α
b + βx
Statistics 1: @BristOliver
c
TB 2 UoB
2017
242 / 279
Leaning Tower of Pisa Example
Example 9.12.
From this the fitted values and the residuals in the table can be
calculated, using the formulas
b i
ybi = α
b + βx
ebi = yi − ybi
i = 1, . . . , n.
A
of the data is shown on the left on the next page,
together with the fitted regression line.
There seems to be a good fit of the straight line to the data.
A plot of the
right.
against the corresponding year is shown on the
As we’d hope, the residuals appear to be fairly random, with no
obvious systematic pattern or systematic trend in variability (see
Section 9.7).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
243 / 279
Leaning Tower of Pisa Example
Example 9.12.
Pisa − residuals
4
2
−6
−2 0
residuals(pisa)
720
680
640
pisa.tilt
6
760
Pisa − scatter plot
76 78 80 82 84 86
76 78 80 82 84 86
pisa.year
pisa.year
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
244 / 279
Section 9.6: Fitting linear regression models using R
Example 9.13.
R has a simple command
for fitting
regression models.
This command produces an R object, containing numerical outputs
which can be accessed by applying appropriate commands
Assume the predictor (x) values are in a data array xdata and the
response (y ) values are in a data array ydata.
We can perform an initial analysis with the commands:
> plot(xdata,ydata)
> xyoutput <- lm(ydata ~ xdata)
> coef(xyoutput)
The first line produces an initial scatter plot,
The second line tells R to perform a linear regression with the
response values in ydata and the predictor values in xdata and to
store the output in the object xyoutput. The third line produces a
b
vector containing the least squares estimates α
b and β.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
245 / 279
Example 9.13.
> plot(xdata,ydata, abline(coef(xyoutput)))
will produce a scatter plot together with the fitted regression line –
i.e. the line whose intercept is the first value and whose slope is
the second value in the vector coef(xyoutput) .
> fitted(xyoutput)
> residuals(xyoutput)
will output vectors of fitted values and residual values.
Thus, for example, we can plot the residuals against the predictor
values with the command:
> plot(xdata,residuals(xyoutput))
In Section 10, we will look at other outputs such as
summary(xyoutput), which produces (among other things)
b
estimates of
and of Var (b
α) and Var (β).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
246 / 279
Leaning Tower of Pisa Example
Example 9.14.
For the Leaning Tower of Pisa example in Section 9.5, I put the
predictor
values in the vector pisa.year and the response
values in the vector pisa.tilt.
I used the commands:
> attach(pisa); pisafit <- lm(tilt ~ year)
to perform the linear regression analysis and store the output in the
object pisafit.
I then inspected the scatter plot and the fitted line with the
commands:
> plot(year,tilt, abline(coef(pisafit)))
and inspected the values of the least squares estimates with:
> coef(pisafit)
which gave output:
(Intercept)
year
-61.120879
9.318681
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
247 / 279
Example 9.14.
Finally I inspected the fitted values and the values of the residuals
with the commands:
> fitted(pisafit)
> residuals(pisafit)
I plotted the residuals against the predictor (year) values with the
command:
> plot(year, residuals(pisafit))
For those who are interested, I used the segments command,
specifically
> segments(year,0,,residuals(pisafit)); abline(h=0)
to add the extra lines – see help(segments).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
248 / 279
Section 9.7: Quality of fit of linear regression model
One way of assessing the fit of a model is by examining a plot of the
eb1 , eb2 , . . . , ebn . These can be plotted against the
values x1 , x2 , . . . , xn or the fitted values yb1 , yb2 , . . . , ybn .
If the model in Section 9.2 is correct, then e1 , e2 , . . . , en is a random
sample from a distribution with expectation 0 and variance σ 2 .
We cannot observe or calculate e1 , e2 , . . . , en , but we can look at their
estimates eb1 , eb2 , . . . , ebn instead.
In linear regression examples, you should always plot the points on a
scatter plot, draw in the estimated regression line, and plot the
residuals.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
249 / 279
Diagnostics
What we should see:
I
I
I
I
no
pattern in the size or sign of the residuals (so the linear
model is correct)
and, if we assume
distributed errors (see next chapter),
additionally:
a roughly symmetric distribution of the residuals about 0
very few extreme outliers (residuals ≥ 3b
σ or ≤ −3b
σ , say)
If what we see departs from this ideal, we may be able to judge from
the pattern we can see how to change the model so that it does fit.
This information may not be at all apparent just from the summary
data values (see Example 9.15 below)
We might allow the error variance σ 2 to depend on x, or we could
include a quadratic term in the model, like E(Y |x) = α + βx + γx 2 .
This is beyond the scope of this unit.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
250 / 279
Anscombe’s Quartet
Example 9.15.
The artificial example below, due to Anscombe, brings out this point.
It consists of four artificial data sets, each of 11 data pairs, with the
same values of the relevant summary statistics.
Thus each data set gives rise to exactly the same regression line and
exactly the same inferences for α, β and σ 2 .
The data are contained in the Statistics 1 data set anscombe.
The summary
P statistics
P for each data set are (approximately):
n
xi =P99
yi = 82.5
P= 211
P
xi = 1001
yi2 = 660
yi xi = 797.5.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
251 / 279
Anscombe’s Quartet
15
10
y2
5
0
0
5
y1
10
15
Example 9.15.
0
5
10
15
20
0
5
10
15
20
15
20
10
y4
5
0
0
5
y3
10
15
x2
15
x1
0
5
10
15
20
0
5
x3
Oliver Johnson ([email protected])
10
x4
Statistics 1: @BristOliver
c
TB 2 UoB
2017
252 / 279
Anscombe’s Quartet
Example 9.15.
From the scatter plots with the fitted regression lines, we see
immediately that there is a lack of fit for data sets 2, 3 and 4:
I
I
I
in data set 2 the relationship between x and y is quadratic rather than
linear so the simple linear model is incorrect;
in data set 3 the simple linear regression model is correct, but a very
clear regression line is distorted by the effect of a single outlier;
in data set 4, the regression line is particularly sensitive to the y value
for the single observation taken at x = 19 and it is impossible to tell
from this choice of x values whether or not a simple linear regression
model is suitable.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
253 / 279
Section 10: Linear Regression: Confidence Intervals &
Hypothesis Tests
Aims of this section:
We continue with the simple linear regression model, under the
extra assumption of normality.
We use confidence intervals and hypothesis tests to investigate
the pattern of variation of the mean value of a response variable
Y with the corresponding value of a quantitative predictor
variable x.
Suggested reading: Rice
Oliver Johnson ([email protected])
Sections 14.1–14.2
Statistics 1: @BristOliver
c
TB 2 UoB
2017
254 / 279
Objectives: by the end of this section you should be able to
State the model assumptions for the simple Normal linear
regression model.
Derive the mean, variance and distribution of the estimators of
the slope and intercept of the fitted regression line, and calculate
the corresponding standard errors.
Construct exact confidence intervals for the values of both the
slope and the intercept.
Perform standard hypothesis tests on the values of both the slope
and the intercept.
Use the summary() command in R to calculate confidence
intervals and perform hypothesis tests on the values of both the
slope and the intercept.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
255 / 279
Section 10.1: Simple normal linear regression
We make an extra assumption on the distribution of e.
This lets us perform hypothesis tests and find
.
Definition 10.1.
Let x1 , . . . , xn be the n given values for
variable X .
For i = 1, . . . , n, assume that the value of yi of the response variable
is an observed value of the random variable Yi .
Further assume that
1
2
3
Yi = α + βxi + ei ,
where ei are independent identically distributed (IID) N(0, σ 2 ),
and α, β and σ 2 are unknown parameters.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
256 / 279
Checking assumption of normality
Remark 10.2.
Note that for given x1 , . . . , xn , the following are equivalent:
(i) ei are IID normal N(0, σ 2 ).
(ii) , Var (ei ) = σ 2 , ei independent, ei normal.
(iii) E(Yi ) = α + βxi , Var (Yi ) = σ 2 , Yi independent, Yi normal.
Note that we cannot check the assumption that Yi ∼ N(α + βxi , σ 2 )
from the data by simply making a histogram, stem-and-leaf plot, or
QQ plot of the data y1 , y2 , . . . , yn , since all the observations have
normal distributions.
But we can carry out a check after the linear regression has been
fitted, by looking at the residuals.
Continuing the example in section 9.5, typing
> qqnorm(residuals(xyoutput))
shows a Normal Q-Q plot of the residuals and helps check for
non-Normality.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
257 / 279
Section 10.2: Properties of α
b, βb and σb2
If we took repeated independent samples of the Yi , keeping the
predictor values fixed at x1 , . . . , xn , the values of α
b and βb would vary
from sample to sample as the values for y1 , . . . , yn vary.
Theorem 10.3.
If {ei } are normally distributed (as in Definition 10.1) then:
(i) βb ∼
P
(ii) α
b ∼ N(α, σ 2 (1/n + x 2 /ssxx )) = N(α, σ 2 xi2 /(nssxx )).
Remark 10.4.
1
2
In fact, these values of mean and variance hold without assuming the
ei are normal.
b
Under the assumption of normality, the least squares estimates (b
α, β)
are also the maximum likelihood estimates (so we have two good
reasons to think they will be reasonable estimates).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
258 / 279
Proof of Theorem 10.3(i)
Proof.
P
Note that
P
Pnas i (xi − x) = 0, the
ssxy = i=1 (xi − x)(yi − y ) = ni=1 (xi − x)yi .
Thus, considered as a random variable,
n
=
X (xi − x)
ssxy
=
Yi
=
ssxx
ssxx
1
Here, for given fixed values of x1 , . . . , xn , the
bi = (xi − x)/ssxx , i = 1, . . . , n are fixed constants and the Yi are
independent N(α + βxi , σ 2 ).
From Lemma 5.8 we can immediately deduce that βb has a normal
distribution, since it is a linear combination of independent normals.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
259 / 279
Proof.
n
X
bi
=
bi xi
=
1
n
X
1
n
X
bi2
=
1
n
1 X
0
(xi − x) =
,
ssxx
ssxx
1
Pn
Pn 2
Pn
ssxx
1 xi − x
1 xi
1 (xi − x)xi
=
=
= 1,
ssxx
ssxx
ssxx
n
1 X
ssxx
1
(xi − x)2 =
=
.
2
2
(ssxx )
(ssxx )
ssxx
1
b
To calculate the mean
! and nvariance for β:
n
n
n
X
X
X
X
=E
bi Yi =
bi (α + βxi ) = α
bi + β
bi x i = β
1
Since Yi are
b = Var
Var (β)
Oliver Johnson ([email protected])
1
n
X
1
bi Yi
!
1
=
n
X
1
bi2 Var (Yi ) = σ 2
1
Statistics 1: @BristOliver
n
X
bi2 =
1
c
TB 2 UoB
2017
σ2
.
ssxx
260 / 279
Proof of (ii)
Proof.
b
Derivation of the distribution for α
b is very similar to that for β.
We start by noting that
b =
= Y − βx
n
X
1
Yi /n − x
n
X
bi Yi =
1
n
X
1
Yi (1/n − bi x) =
n
X
ai Yi
i=1
where ai = (1/n − bi x), i = 1, . . . , n.
This in turn
bP
has a Normal distribution P
and gives
Pnmeans α
n
n 2
2
E(b
α) = α 1 ai + β 1 ai xi and Var (b
α) = σ
1 ai .
Pn
Finally,
andP
Pn
Pn 2 using the facts that 1 bi = 0,
b
=
1/ss
,
one
can
easily
deduce
that
a
=
1, n1 ai xi = 0
xx
1 i
P1n 2i
P
2
xi2 /n ssxx
1 ai = (1/n + x /ssxx ) =
This means that E(b
α) = α and Var (b
α) = σ 2 (1/n + x 2 /ssxx ).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
261 / 279
t-distributions for α
b and βb
Although σ 2 is unknown, we can
the t-distribution.
it in the usual way, using
We combine Lemma 10.5 below and Theorem 10.3.
Lemma 10.5.
c2 /σ 2 ∼ χ2
(n − 2)σ
n−2
b
of α
b and β.
Key fact: This holds
It means that (since a χ2r random variable has mean r and variance 2r
– see Remark 5.12):
Proof.
c2 ) = σ 2
E(σ
and
c2 ) = 2σ 4 /(n − 2).
Var (σ
Not proved here.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
262 / 279
Form of t-distributions
Theorem 10.6.
q
c2 1/n + x 2 /ssxx for the
Write sαb = σ
sαb is an estimate of the standard deviation of α
b.
q
c2 /ssxx for the standard error for β.
b
Write sβb = σ
for α
b.
If the assumptions of Definition 10.1 hold, then
α
b−α
sαb
b
β−β
sβb
Oliver Johnson ([email protected])
∼ tn−2
∼ tn−2 .
Statistics 1: @BristOliver
c
TB 2 UoB
2017
263 / 279
Distribution of α
b
Proof.
Theorem 10.3 gives that
σ
p
(b
α − α)
1/n + x 2 /ssxx
∼ N(0, 1).
Lemma 10.5 gives that (independently)
c2 /σ 2 ∼ χ2 /(n − 2).
σ
n−2
Hence, using Definition 5.16, we know that
α
b−α
(b
α − α)
1
1
q
= p
∼ N(0, 1) q
,
sαb
σ 1/n + x 2 /ssxx σ
c2 /σ 2
χ2n−2 /(n − 2)
as required.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
264 / 279
Distribution of βb
Proof.
Theorem 10.3 gives that
(βb − β)
p
∼ N(0, 1).
σ 1/ssxx
Lemma 10.5 gives that (independently)
c2 /σ 2 ∼ χ2 /(n − 2).
σ
n−2
Hence, using Definition 5.16, we know that
βb − β
(βb − β)
1
1
q
= p
∼ N(0, 1) q
,
sβb
2
σ 1/ssxx σ
c2 /σ 2
χn−2 /(n − 2)
as required.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
265 / 279
Section 10.4: Confidence intervals for α and β
Example 10.7.
Theorem 10.6 shows that if the assumptions of
then (b
α − α)/sαb ∼ tn−2 .
hold
Hence we can obtain a 100(1 − γ)%
for α.
P(−tn−2;γ/2 ≤ (b
α − α)/sαb ≤ tn−2;γ/2 ) = 1 − γ.
We can make α the subject of this in the usual way, to obtain
P(b
α − tn−2;γ/2 sαb ≤ α ≤ α
b + tn−2;γ/2 sαb ) = 1 − γ.
Hence, under the assumptions of Definition 10.1, taking
(cL (α), cU (α)) = (b
α − tn−2;γ/2 sαb , α
b + tn−2;γ/2 sαb )
(cL (β), cU (β)) = (βb − tn−2;γ/2 s b, βb + tn−2;γ/2 s b)
β
gives a 100(1 − γ)% confidence interval for
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
β
and β.
c
TB 2 UoB
2017
266 / 279
Example – the Leaning Tower of Pisa
Example 10.8.
We previously met this example in Example 9.12.
Some of P
the basic arithmetic:
P 2
n
=
13,
x
=
1053,
,
i
i
i xi = 85475,
P 2
P
y
=
6271714,
x
y
=
732154.
i
i
i i
i
So x = 81, y = 693.6923, ssxx = 182, ssyy = 15996.77, ssxy = 1696
Then βb = 9.319, α
b = −61.121, σ
b2 = 17.481, s 2 = 631.51,
α
b
s 2b = 0.096047.
β
Finally,
, sαb = 25.130, sβb = 0.3099.
We make the standard assumptions of Definition 10.1, so Theorem
10.6 implies that (b
α − α)/sαb ∼ tn−2 = t11 .
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
267 / 279
Example – the Leaning Tower of Pisa
Example 10.9.
Example 10.7 shows that a 100(1 − γ)% confidence interval for
(cL , cU ) = (b
α − tn−2;γ/2 sαb , α
b + tn−2;γ/2 sαb )
For example, if we want a 90% confidence interval,
is
, so
(cL , cU ) = (−61.121 − 1.796 × 25.130, −61.121 + 1.796 × 25.130)
= (−106.25, −15.99).
Similarly t11;0.025 = 2.201 so a 95% confidence interval for β is
(cL , cU ) = (βb − tn−2;γ/2 sβb, βb + tn−2;γ/2 sβb)
= (9.319 − 2.201 × 0.3099, 9.319 + 2.201 × 0.3099)
=
Oliver Johnson ([email protected])
.
Statistics 1: @BristOliver
c
TB 2 UoB
2017
268 / 279
Section 10.5: Hypothesis tests for β
As in Definition 10.1, assume that ei are
.
The model Yi = α + βxi + ei would be simpler if β = 0.
In particular, we might want to know if the expected value of the
response
varies systematically with the predictors.
If not, then we can simplify the model by removing β.
We can place all of this in our standard hypothesis testing framework.
Remark 10.10.
We can use a similar argument, based on the fact that
(b
α − α)/sαb ∼ tn−2 to test whether α = 0.
However, this corresponds to the regression line passing through the
origin; often not an interesting hypothesis.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
269 / 279
Hypothesis tests for β
Example 10.11.
Model: We make the model assumptions of Definition 10.1.
Hypotheses: we test H0 : β = 0 vs
.
Test statistic: Theorem 10.6 shows that (βb − β)/sβb ∼ tn−2 .
b b.
This suggests the use of
T = β/s
β
When H0 is true (i.e. β = 0) then T ∼ tn−2 .
b b:
p-value Since we consider a two-sided alternative, with tobs = β/s
β
p-value = P ( |T | ≥ |tobs | | H0 true)
= P(|tn−2 | ≥ |tobs |) = 2(1 − P(tn−2 ≤ |tobs |)).
Critical value Critical region for a γ-level test is C = {|T | ≥ c ∗ }.
c ∗ is defined by c ∗ = tn−2;γ/2 , since
γ = P( reject H0 | H0 true) = P(|T | ≥ c ∗ | H0 true)
= P(|tn−2 | ≥ c ∗ ) = 2P(tn−2 ≥ c ∗ ),
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
270 / 279
Example – the Leaning Tower of Pisa
Example 10.12.
We return to the setting of Examples 9.12 and 10.8.
Wish to test H0 : β = 0 vs H1 : β 6= 0 (does mean tilt vary with year?)
b b ∼ tn−2 .
As in Example 10.11, we use T = β/s
β
In this case, tobs = 9.319/0.3099 = 30.071, to compare with a
.
p-value: given by 2(1 − P(t11 ≤ 30.071)).
R gives pt(30.071,11) = 1, so p-value is zero.
In fact, using lm() command gives p-value of 6 × 10−12 .
Critical region: similarly, for a test with significance level γ = 0.05,
we have tn−2;γ/2 = t11;0.025 = 2.201. Hence C = {|T | ≥ 2.201}.
p-value is
small, and tobs lies well within critical region.
This is very strong evidence that H0 is not true – that the mean tilt
does vary with year.
Consistent with the fact that 0 ∈
/ (cL , cU ) in Example 10.8.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
271 / 279
Section 10.6: Confidence Intervals and Hypothesis Tests
using the summary command in R
Consider the simple Normal linear regression model Yi = α + βxi + ei ,
where the ei are i.i.d.
Assume the
values x1 , . . . , xn are contained in an R data
vector called xdata and the
values y1 , . . . , yn are contained
in an R data vector called ydata.
We have already seen how to produce the output using the R
command xyoutput <- lm(ydata ~ xdata)
We can perform exploratory data analysis, estimation and assessment
of fit using the follow-up commands plot, coef, fitted, and
residuals.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
272 / 279
Summary command in R
For confidence intervals and hypothesis tests, most of the necessary
information can be obtained with the summary command.
For example summary(xyoutput) produces the following output,
where the formulas shown in each box are replaced in the actual
output by its numerical value.
Then the output:
(i) reproduces the formula used to produce the output, to show exactly
which model is being analysed;
(ii) produces a
summary of the residual values (or lists in full
the numerical values of the residuals if there are only a few of them);
(iii) lists the relevant values for confidence intervals and hypothesis tests –
first for α (the intercept in the model) and then for β (the coefficient
of
in the model);
(iv) lists numerical values relevant to estimating σ 2 (or, more precisely, σ);
(v) gives information on the
and F -statistic values (not covered
in this unit).
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
273 / 279
Call:
lm(formula = ydata ~ xdata)
Residuals:
Min
1Q
Coefficients:
Median
3Q
Estimate
Max
Std. Error
s
1
x2
+
n
ssxx
(Intercept)
b
α
b = y − βx
sα
b
b =σ
xdata
βb = ssxy /ssxx
√
sβb = σ
b/ ssxx
t value
Pr(>|t|)
α
b/sα
b
2(1 − Ftn−2 (|b
α/sα
b |))
b b
β/s
β
b b|))
2(1 − Ftn−2 (|β/s
β
--Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
s
2 /ss
ssyy − ssxy
xx
Residual standard error: σ
b=
on n − 2 degrees of freedom
n−2
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
274 / 279
In particular, the values on the line beginning (Intercept) are:
(i) (the estimate of α),
(ii) sαb (the standard error, which estimates the standard deviation α
b),
(iii) tobs = α
b/sαb (the observed test statistic for testing H0 : α = 0 vs.
H1 : α 6= 0),
(iv) P(|W | > |tobs |), where W ∼ tn−2 (the p-value of the data for the test).
The result of a hypothesis test of
vs. H1 :α 6= 0 can then be
deduced immediately from the corresponding p-value.
Moreover, the endpoints for a 100(1 − γ)% confidence interval for α
can be calculated using the values of α
b, sαb and the appropriate
t-distribution percentage point tn−2;γ/2 .
The values on the line beginning xdata are the corresponding
quantities for estimating, constructing confidence intervals or
performing hypothesis tests on β:
b (ii) s b, (iii) tobs = β/s
b b, and (iv) P(|W | > |tobs |),
i.e. (i) β,
β
β
where W ∼ tn−2 .
A 100(1 − γ)% confidence interval for β can be obtained in a similar
manner to that for α.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
275 / 279
Section 10.7: The Leaning Tower of Pisa example in R
Example 10.13.
In our previous analysis (Example
) we had typed
pisa<-lm(pisa.tilt~pisa.year) to carry out the linear regression.
Applying the summary(pisa) command using this previous result
produces the output below.
You can (and should) check that the values shown correspond to the
appropriate values calculated in your notes when we constructed
confidence intervals and performed hypothesis tests on α and β.
From the output we can, for example, immediately read off the least
squares estimate
and its standard error
We can also see that the p-value for testing H0 :β = 0 versus
H1 :β 6= 0 is extremely small (6.5 × 10−12 ) and so there is very strong
evidence that β is not zero and the mean tilt does vary significantly
with the year.
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
276 / 279
Example 10.13.
Call:
lm(formula = pisa.tilt ~ pisa.year)
Residuals:
Min
1Q
-5.9670 -3.0989
Median
0.6703
3Q
2.3077
Max
7.3956
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.1209
25.1298 -2.432
0.0333 *
pisa.year
9.3187
0.3099 30.069 6.5e-12 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.181 on 11 degrees of freedom
Multiple R-Squared: 0.988,
Adjusted R-squared: 0.9869
F-statistic: 904.1 on 1 and 11 DF, p-value: 6.503e-12
Oliver Johnson ([email protected])
Statistics 1: @BristOliver
c
TB 2 UoB
2017
277 / 279
Parametric Families Summary Sheet
Family
Parameter values
pmf or pdf
X values
Mean
Variance
Bernoulli(θ)
0<θ<1
pX (x; θ) = θx (1 − θ)1−x
K
θx (1 − θ)K−x
pX (x; θ) =
x
x = 0, 1
E(X; θ) = θ
Var(X; θ) = θ(1 − θ)
x = 0, 1, . . . , K
E(X; θ) = Kθ
Var(X; θ) = Kθ(1 − θ)
x = 1, 2, . . .
E(X; θ) =
x = 0, 1, 2, . . .
E(X; θ) = θ
Binomial(K, θ) 0 < θ < 1
(K known)
1
θ
1−θ
θ2
Geometric(θ)
0<θ<1
pX (x; θ) = θ(1 − θ)x−1
Poisson(θ)
0<θ<∞
pX (x; θ) = e−θ
Uniform(0, θ)
θ>0
fX (x; θ) = 1/θ
0<x<θ
E(X; θ) =
θ
2
Var(X; θ) =
θ2
12
Exponential(θ)
θ>0
fX (x; θ) = θe−θx
x>0
E(X; θ) =
1
θ
Var(X; θ) =
1
θ2
Gamma(α, λ)
α > 0, λ > 0
fX (x; α, λ) =
x>0
E(X; α, λ) =
Normal(μ, σ 2)
−∞ < μ < ∞, σ 2 > 0
fX (x; μ, σ 2 ) = √
−∞ < x < ∞
E(X; μ, σ 2) = μ
Oliver Johnson ([email protected])
θx
x!
λα xα−1 e−λx
Γ(α)
1
2πσ 2
2 /2σ 2
e−(x−μ)
Statistics 1: @BristOliver
Var(X; θ) =
Var(X; θ) = θ
α
λ
Var(X; α, λ) =
α
λ2
Var(X; μ, σ 2 ) = σ 2
c
TB 2 UoB
2017
278 / 279
Examples of pdfs in the Exp() family
1.5
f(x;)
0.5
f(x;)
1
0.5
2
1
  0.5
0.0
0.0
5
0
1
2
3
4
5
6
0
1
2
3
4
5
6
x
Examples of pdfs in the Gamma(,) family
Examples of pdfs in the Normal(,2) family
0.4
x
  0,  1
  2,  1
f(x;)
0.3
  0.5,  1
  2,  1
  0,  1.5
0.0
0.1
  7,  2
0.2
0.0 0.2 0.4 0.6 0.8 1.0
f(x;)
2
1.0
1.0
2.0
1.5
Examples of pdfs in the Unif(0,) family
0
1
2
3
4
5
6
−4
x
Oliver Johnson ([email protected])
−2
0
2
4
x
Statistics 1: @BristOliver
c
TB 2 UoB
2017
279 / 279