Download Chapter 1: Introduction

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 1: Introduction
1.1
What is Statistics?
Statistics involves collecting, analysing, presenting and interpreting data.
We frequently see statistical tools (such as bar charts, tables, plots of data, averages and percentages) on TV, in
newspapers and in magazines. Such methods used to organise and summarise data, so as to increase the
understanding of the data, are called descriptive statistics.
Statistics is also used in practice in many different walks of life, going beyond simple data summarisation to
answer a wide variety of questions such as:
 Medicine: Does a certain new drug prolong life for AIDS sufferers?
 Science: Is global warming really happening?
 Education: Are GCSE and A level examinations standards declining?
 Psychology: Is the national lottery making us a nation of compulsive gamblers?
 Sociology: Is the gap between rich and poor widening in Britain?
 Business: Do Persil adverts really make us want to buy Persil?
 Finance: What will interest rates be in 6 months time?
1.2
Populations and Samples
Suppose that we wanted to investigate whether smoking during pregnancy leads to lower birth weight of babies.
We use this example to illustrate the following definitions.
Definitions:
 Experimental unit: the object on which measurements are made.




For above example, we are measuring birth weights of newborn babies, so a unit is a newborn baby.
Variable: a measurable characteristic of a unit.
For above example, the variable is birth weight.
Population: the set of all units about which information is required.
For above example, the population is all newborn babies.
Sample: a subset of units of the population for which we can observe the variable of interest.
For above example, a sample would be the observed birth weights for a set of newborn babies (which will be
a subset of all newborn babies).
Random sample: a sample such that each unit in the population has the same chance of being chosen
independently of whether or not any other unit is chosen.
To determine whether smoking during pregnancy leads to lower birth weight of babies, we would compare a
random sample of weights of new-born babies whose mothers smoke, with a random sample of weights of newborn babies of non-smoking mothers. By analysing the sample data, we would hope to be able to draw
conclusions about the effects on birth weight of smoking during pregnancy for all babies (i.e. the population).
The process of using a random sample to draw conclusions about a population is called statistical inference.
If we do not have a random sample, then sampling bias can invalidate our statistical results. For example, birth
weights of twins are generally lower than the weights of babies born alone. So if all the non-smoking mothers in
the sample were giving birth to twins, whereas all the smoking mothers were giving birth to single babies, then
the conclusions we draw about the effects of smoking in pregnancy will not necessarily be correct as they are
affected by sampling bias.
Different units of the same population will have different values of the same variable  this is called natural
variation. For example, obviously the weights of all newborn babies are not the same. So different samples will
contain different data- called sampling variability. Therefore it is important to bear in mind that slightly different
conclusions could be reached from different samples.
1
1.3
Types of Data
Different types of data require different types of analysis. The type of data set is determined by several factors:




Type of variable:
 quantitative data - i.e. numerical (e.g., heights of students, number of phone calls in an hour).
 qualitative data - i.e. non-numerical (for example, eye colour, M/F).
Quantitative data can be subdivided further:
 discrete – a discrete variable can take only particular values (e.g., number of phone calls received at an
exchange).
 continuous- a continuous variable can take any value in a given range (e.g., heights of students).
Number of variables measured:
 1 variable  univariate data.
 2 variables  bivariate data. E.g., we may have both the heights and weights of a set of individuals. The
data set then consists of pairs of observations on each unit such as (1.7m, 65kg).
 3 or more variables  multivariate data. E.g., we have heights, weights, eye colour, gender for a group
of individuals. In this case the data set consists of sets of 4 observations made on each unit such as (1.7m,
65kg, blue, M).
Number of samples: For example, when investigating the effects of smoking during pregnancy, we would
observe two samples:
 a sample of birth weights of babies born to smoking mothers
 a sample of birth weights of babies born to non-smoking mothers.
Relationship of samples (if more than 1 sample):
 Are the samples independent? E.g., the two birth weight samples should be independent.
 Are the samples dependent?
 Example:
Suppose that a doctor would like to assess the effectiveness of changing to a low-fat diet in lowering cholesterol
for a group of patients. To do this the doctor might measure the cholesterol of the patients before starting on the
low-fat diet and then measure the cholesterol for the same patients after they have been on the low-fat diet. We
therefore have 2 samples of measured cholesterol:
 a sample before the diet
 a sample after the diet.
However, the 2 samples are not independent, since the cholesterol measurements for each sample were taken on
the same patients. Samples of this type are called matched pair data.
1.4
Recommended Books
You will need to use statistical tables for the course. The tables used in the exams are:
 Lindley, D.V. and Scott, W.F., New Cambridge Elementary Statistical Tables, C.U.P., 1984.
Statistical tables will be used throughout this course.
There are many books which cover the material in this course. Some good books are:
 Introduction to probability and statistics for engineers and scientists; [with CD-ROM] / Sheldon M. Ross
 Probability and Statistics for Engineers and Scientists - 7th edition, R.E.Walpole, R.H.Myers, S.L.Myres and
K. Ye, Prentice Hall, 2002
 Clarke, G.M., and Cooke, D. A Basic Course in Statistics, Edward Arnold, 4th edition, 1999.
 Daly, F., Hand, D.J., Jones, M.C., Lunn, A.D. and McConway, K.J. Elements of Statistics, Open University,
1995.
Goes beyond what's required for this course, but is quite clearly written with some real examples.
 Devore, J and Peck, R. Introductory Statistics, West, 1990.
Rather simplistic at times, but has lots of real examples. Especially good if you have not done any statistics
before.
 Spiegel, M.R., Probability and Statistics, Schaum Outline Series, 1988.
2
In addition, you could browse in the library around QA276 and find a book which suits you. For starters you
could try looking at some of the following.





Anderson, D.R., Sweeney, D.J. and Williams, T.A. Introduction to Statistics: Concepts and Applications,
West, 2nd edition, 1991.
Bassett, E.E., Bremner, J.M., Jolliffe, I.T., Jones, B., Morgan, B.J.T. and North, P.M., Statistics: Problems
and Solutions, Edward Arnold, 1986.
Moore, D.S., The Basic Practice of Statistics, Freeman, 1995.
Moore, D.S., Think and Explain with Statistics, Addison-Wesley, 1986.
Moore, D.S., Statistics: Concepts and Controversies, Freeman, 1991, 1985, 1979.
There are many online books which could be useful. See for example
http://www.statsoft.com/textbook/stathome.html
3
Chapter 2: Graphical and Numerical Statistics
2.1
Histograms
Histograms give a visual representation of continuous data. We consider two separate cases corresponding to
when (i) all the bars in the histogram have the same width; (ii) the intervals are of variable widths.
2.1.1
Histograms with equal class widths
 Example:
Mercury contamination can be particularly high in certain types of fish. The mercury content (ppm) on the hair
of 40 fishermen in a region thought to be particularly vulnerable are given below (From paper “Mercury content
of commercially imported fish of the Seychelles, and hair mercury levels of a selected part of the population.”
Environ. Research, (1983), 305-312.)
13.26
32.43
18.10
58.23
64.00
68.20
35.35
33.92
23.94
18.28
22.05
39.14
31.43
18.51
21.03
5.50
6.96
5.19
28.66
26.29
13.89
25.87
9.84
26.88
16.81
38.65
19.23
21.82
31.58
30.13
42.42
16.51
21.16
32.97
9.84
10.64
29.56
40.69
12.86
13.80
 The first step is to group the data. A reasonable choice of class intervals is:
0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70.
The frequency table that results from the use of these intervals is:
Interval
0-10
10-20
20-30
30-40
40-50
50-60
60-70
Frequency
5
11
10
9
2
1
2
N.B. By convention, any
observation that is at a
boundary of a class will be
put into the higher class. For
example, an observation of 10
above would be put into the
10-20 category.
To construct the histogram in this situation (i.e. all class widths equal):
 Mark boundaries of the class intervals on the horizontal axis.
 The height of the bars above each interval can be taken as the frequency for that interval.
A histogram showing mercury contamination in hair
Frequency
10
5
0
0
10
20
30
40
50
Mercury content (ppm)
4
60
70
Instead of using frequencies to give the heights of the rectangles in a histogram, relative frequencies may be used.
The relative frequency for an interval is that interval's frequency divided by the total frequency.
 So for the mercury example…
Interval
0-10
10-20
20-30
30-40
40-50
50-60
60-70
Total
Frequency
5
11
10
9
2
1
2
40
Relative frequency
.125
.275
.250
.225
.050
.025
.050
1
The relative frequencies can be expressed as percentages (which is how Minitab produces a relative frequency
histogram):
A relative frequency histogram for the mercury data
Relative frequency (%)
30
20
10
0
0
10
20
30
40
50
60
70
Mercury content (ppm)
Notice that the shape of the histograms, whether using frequencies or relative frequencies, is the same.
2.1.2
Histograms with unequal class widths
There is no hard and fast rule as to how many intervals should be used. Too many classes produce an uneven
distribution, but having too few loses information. Usually the number of classes is about 6-20. The more
observations we have, the more classes we will usually use.
The width of the intervals defining the histograms need not all be equal. It is often sensible to choose short
intervals where the data is quite dense but intervals with a longer width where the data is more sparse. This will
ensure that we don’t have too many intervals with zero frequency, yet keeps as much information about the
distributional shape of the data as possible.
When unequal interval widths are used, then the frequency density should be used on the vertical scale on the
histogram, where
Frequency density = Frequency  class width.
5
 Example:
The lengths (in metres) of 250 vehicles aboard a cross-channel ferry are summarised in the following table:
Vehicle length (m)
3.0-4.0
4.0-4.5
4.5-5.0
5.0-5.5
5.5-7.5
Class width
1
0.5
0.5
0.5
2
Frequency
90
80
40
24
16
Frequency density
90
160
80
48
8
A histogram showing the lengths of 250 vehicles
200
180
160
Frequency density
140
120
100
80
60
40
20
0
2
3
4
5
Vehicle length (m)
6
7
8
Notice that if we had simply defined the heights of the rectangles to be the frequencies, then the histogram
would exaggerate, for example, the incidence of cars between 3 and 4 metres in length.
An alternative way of producing a histogram in situations were not all class widths are equal is to set the bar
height to be the relative frequency density. This is given by:
Relative freq. density = Relative freq.  class width.
If the histogram is produced in this way, then the total area of all the bars is 1.
 Example (continued)
The relative frequency densities for the car vehicle length data are as follows:
Vehicle length (m)
3.0-4.0
4.0-4.5
4.5-5.0
5.0-5.5
5.5-7.5
Class width
1
0.5
0.5
0.5
2
Frequency
90
80
40
24
16
The corresponding histogram can then be produced:
6
Relative freq.
0.36
0.32
0.16
0.096
0.064
Rel. freq. density
0.36
0.64
0.32
0.192
0.032
A histogram showing vehicle lengths
0.7
Frequency density
0.6
0.5
0.4
0.3
0.2
0.1
0.0
3.0
4.0
4.5
5.0
5.5
7.5
Length (m)
2.1.3
Histogram shapes
Histograms are very useful for giving some idea of the shape of a density by approximating the histogram to a
smooth curve.
Densities can take many different shapes:
Unimodal
Unimodal distribution
Bimodal
Bimodal distribution
7
Multimodal
Multimodal distribution
Symmetric
Positive skew
Positively skew distribution
Symmetric distribution
Normal
Negatively skew distribution
Heavy-tailed
Normal distribution
2.1.4
Negative skew
Light-tailed
Light-tailed distribution
Heavy-tailed distribution
Histograms for discrete data
Discrete data is usually illustrated using a bar-line chart (or a bar chart), whilst histograms are generally used for
continuous data. However, when the number of possible values for the observations is large, a bar diagram
would become uninformative. In this case it is acceptable to group the values into class intervals, much as you
would for continuous data.
 Example:
Suppose we have the following data:
1
8
13
17
1
9
13
17
2
9
13
17
2
9
13
18
2
9
14
18
3
10
14
19
3
10
14
19
4
10
14
20
4
10
14
21
5
10
14
21
5
11
15
22
5
11
15
22
5
11
15
23
6
11
15
23
6
12
15
24
7
12
16
26
7
12
16
27
7
12
16
29
As there are a large number of different values here, to get a better idea of the shape of the distribution, we can
group data into classes. Let's consider grouping all observations between 1 - 3, 4 - 6 and so on. To draw a
histogram we need a continuous scale and so we need to define our histogram intervals to be 0.5 - 3.5, 3.5 - 6.5,
and so on. (Remember: a histogram never has gaps between the bars).
We then get the following frequency distribution:
8
Interval
0.5 - 3.5
3.5 - 5.5
5.5 - 9.5
9.5 - 12.5
12.5 - 15.5
15.5 - 18.5
18.5 - 21.5
21.5 - 24.5
24.5 - 27.5
27.5 - 30.5
Frequency
7
8
8
13
14
8
5
5
2
1
The histogram can now be drawn in the normal way.
2.2
Stem-and-leaf plots
Stem-and-leaf plots are an effective way of providing a visual display of quantitative data with very little effort.
The idea of the plots is to separate each observation into 2 parts - the first part being the stem and the second the
leaf.
To construct a stem-and-leaf plot:
 Select one or more leading digits for the stem values. The following digit or digits become the leaves.
 List possible stem values in a vertical column.
 Record the leaf value for every observation beside the corresponding stem value.
 Indicate the units for stems and leaves.
 Example:
To investigate the efficiency of new air-conditioning equipment installed on Boeing 720 aircraft, the times (in
hours) to first failure of the equipment were obtained from 28 different aircraft:
79
90
10
60
61
49
14
24
56
20
84
44
25
59
46
37
32
76
26
35
29
53
75
25
44
23
27
33
For these data an obvious choice for the stems is the leading digit (tens) and the leaves are then the second digits
(units). So, for example, the first observation of 79 has stem 7 and leaf 9. The data values range from 10 up to 90,
so we have the stem values 1-9.
1
2
3
4
5
6
7
8
9
0
4
7
9
6
0
9
4
0
4
0
2
4
9
1
6
5 6 9 5 3 7
5 3
6 4
3
An unordered stem-and-leaf
diagram for the Boeing data
Leaves- these
should be in
columns
5
Stem
Scale: Stem = 10s
1
2
3
4
5
0
0
2
4
3
4
3
3
4
6
Leaves = units
4 5 5 6 7 9
5 7
6 9
9
Leaves have
now been put
in order
An ordered stem-and-leaf diagram
for the Boeing data
9
6
7
8
9
0 1
5 6 9
4
0
Scale: Stem =10s
Leaves = units
N.B. Rearranging the leaves in ascending order clarifies things and is useful for producing numerical summaries.
N.B.2 One advantage that stem-and-leaf diagrams have over histograms is that they retain the detail of the raw
data.
2.2.1



Use of stem-and-leaf plots
Stem-and-leaf plots give a visual display of the rough shape of the distribution of the variable being
measured. We can identify whether the density is a) unimodal or multimodal; b) symmetric, negatively or
positively skewed; c) normal, heavy- or light-tailed.
Stem-and-leaf plots are useful for informal inference. We can find medians and quartiles easily from the
diagrams and obtain estimates of probabilities. For example, in the Boeing data 10 pieces of equipment
lasted under 30 hours so we could estimate the probability of a new piece of equipment failing within the
first 30 hours as 10/28.
Stem-and-leaf plots are useful for identifying outliers- these are unusually large or small observations. For
example, for the Boeing example, if there had been an extra observation of 119, then this might be an outlier:
1
2
3
4
5
6
7
8
9
0
0
2
4
3
0
5
4
0
4
3
3
4
6
1
6
4 5 5 6 7 9
5 7
6 9
9
9
This could be
considered an
outlying value
10
11 9
2.2.2
Choice of stem unit
Choice of stem unit can be important.
 Example:
To determine the age of a pre-historic settlement in North Wales, 24 small fragments from a wooden boat found
at the settlement were independently radio-carbon dated. The radio-carbon determiniations (in years) of age of
fragments are:
4969
5163
5052
5144
4965
5152
4967
4934
4895
5078
5019
4908
5009
5046
4912
5012
4889
5034
4914
5117
4931
5081
4984
4881
 Possibility 1: We could round each observation to the nearest one hundred years:
5000
5000
5200
5000
5100
4900
5100
5000
5000
4900
5200
5000
5000
4900
4900
5100
4900
4900
5100
5100
5000
5000
4900
4900
Taking the stem unit to be 1000 years gives the following diagram:
10
Scale: Stem = 1000's
Leaves = 100's
4 9 9 9 9 9 9 9 9
5 0 2 1 1 0 2 0 1 0 0 0 0 0 1 1 0
Because we have so few stem values here, we lose a lot of information. We can’t say anything for example
about the shape of the distribution.
 Possibility 2: Round observations to the nearest 10 years.
4970
5010
5160
5050
5050
4910
5140
5010
4970
4890
5150
5030
4970
4910
4930
5120
4900
4930
5080
5080
5020
4980
4910
4880
Taking the stem unit as 100 years gives:
48
49
50
51
9
7
5
6
8
7 7 3 0 0 1 1 3 8
8 2 1 5 1 3 8
4 5 2
Scale: Stem = 100's
Leaves = 10's
This plot is a little more informative, but we could still do with having slightly more stems.
11
 Possibility 3: Split the stems into high and low values
48L
48H
49L
49H
50L
50H
51L
50H
8
0
7
1
5
2
5
9
0
7
1
5
4
6
1
7
2
8
In the high category
you write any 5s, 6s,
7s, 8s or 9s.
1 3 3
8
3
8
Scale: Stem = 100's
Leaves = 10's
In each low category
you put any 0s, 1s,
2s, 3s, or 4s.
The diagram is now quite informative about the distribution- there is evidence of a positive skew.
[Note that if the stem unit was taken to be 10s, then the diagram we would get would be poor- we would then
have too many stem values (a lot of the rows would have no values in them).]
2.2.3
Back-to-back displays for displaying two independent samples
If there are 2 sets of data which you wish to compare, then both of these can be put on the same stem-and-leaf
plot with the leaves for one dataset going to the right and the leaves of the other dataset going to the left.
 Example:
Using a technique involving chromium dioxide, the protein assimilation efficiencies (i.e. percentage of protein
intake actually absorbed) were measured on field mice and voles fed on their natural diets. The assimilation
efficiencies (in percentages) are given below:
A.E.'s of field mice:
61.3
65.4
57.8
70.6
71.7
70.5
62.6
68.9
63.6
62.6
76.3
69.7
67.8
74.6
61.9
A.E.'s of voles:
51.7
70.1
72.0
75.2
69.8
73.8
63.7
59.6
77.2
69.9
62.6
77.6
63.5
74.1
66.7
67.3
69.2
73.7
Rounding observations to the nearest integer gives us:
An unordered back-to-back stem-and-leaf
diagram for the protein data
A.E.s for field mice
3
2
0
Scale: Stem = 10's
4
9
1
3
8
1
5
A.E.s for voles
8
1
5
2
6
5L
5H
6L
6H
7L
7H
2
4
7
2
7
Outlier?
3
9
0
5
4
8
0
8
Leaves = 1's
Then ordering the leaves we get…
12
0
7
4
0
4
4
67.5
An ordered back-to-back stem-and-leaf
diagram showing the protein data
A.E.s for field mice
4
3
3
9
1
2
Scale: Stem = 10's
2.2.4
A.E.s for voles
5L
5H
6L
6H
7L
7H
8
1
5
0
5
2
8
1
6
2
0
7
0
5
3
7
0
7
4
8
0
8
4
9
2
4
4
4
Leaves = 1's
Stem-and-leaf diagrams for matched-pair data
It is not a good idea to do a back-to-back plot if the 2 variates are not independent. Consider the following
example.
 Example:
Fifteen people participated on a short typing course. Their typing speeds (words/min) before and after the course
were recorded:
Subject
Before
After
1
15
26
2
18
28
3
23
27
4
27
26
5
36
28
6
12
24
7
8
26
8
19
42
9
32
32
10
22
36
11
17
20
12
21
29
13
16
21
14
15
22
15
33
28
These data are an example of matched-pair data (there are two measurements recorded on each participant).
Matched-pair data are likely to be dependent (a person with a fast typing speed before the course is also likely to
have a fast typing speed after the course). By drawing a stem-and-leaf diagram you lose information about how
the measurements pair up. You could draw a scatter diagram (this would show the pairings). Alternatively, you
could produce a stem-and-leaf diagram of the differences:
Subject
Change
1
11
2
10
3
4
4
-1
5
-8
6
12
7
18
8
23
9
0
10
14
11
3
12
8
13
5
14
7
15
-5
A stem-and-leaf diagram showing the change in typing speeds
after a short course
-0
0
1
2
1
4
1
3
8
0
0
5
3
2
8
8
5
4
7
Scale: Stem = 10’s
Leaves = units.
A slightly more informative diagram can be obtained by splitting each stem up into two parts (one for the lower
leaves and the other for higher leaves):
13
A stem-and-leaf diagram showing the change in typing speeds
after a short course
-0H
-0L
0L
0H
1L
1H
2L
8
1
4
8
1
8
3
5
0
5
0
3
7
2
Scale: Stem = 10’s
Leaves = units.
4
Each diagram could then be ordered.
2.2.5
Problems
Stem-and-leaf plots cannot be used for displaying qualitative data and they become impractical for large
numbers of observations.
2.3
Cumulative Frequency Plots
A cumulative frequency plot also uses classes and frequencies. The cumulative frequency for a class is the
number of observations with values less than the upper boundary for that class.
 Example:
Consider the mercury example again. The cumulative frequencies are given in the table below:
Interval
0-10
10-20
20-30
30-40
40-50
50-60
60-70
Frequency
5
11
10
9
2
1
2
Cumulative frequency
5
16
26
35
37
38
40
In a cumulative frequency polygon the cumulative frequencies are plotted against the upper class boundaries of
the classes. These points are then joined with a straight line.
 Example (continued)
For the mercury example we want to plot the points (0, 0), (10, 5), (20, 16),…, (70, 40) and then join these points:
14
A cumulative frequency polygon for the mercury data
Cumulative frequency
40
30
20
10
0
0
10
20
30
40
50
60
70
Mecrcury level
A cumulative frequency plot is useful for giving us some idea of the shape of the distribution function of the
variable. They can also be used to obtain estimates of the median and other quantiles for grouped data.
2.4
Scatter Plots.
Scatter plots are useful for assessing relationships between 2 variables. To draw a scatter plot we represent one
of the variables by the horizontal axis and the other variable by the vertical axis. We then simply plot the pairs of
data points on the graph.
 Example:
Fifteen children were given a visual-discrimination (V) test during the first week at primary school and a
reading-achievement (R) test at the end of their first year of schooling. Scores out of 100 were calculated for
each test.
Child no.
V-score
R-score
1
75
95
2
69
90
3
70
82
4
62
69
5
52
58
6
45
49
7
42
38
8
39
35
9
37
30
10
34
20
11
34
31
12
66
75
13
54
61
To draw a scatter plot we now want to plot the points (75, 95), (69, 90), (70, 82), …, (63, 77).
A scatter plot depicting primary school test results
100
90
R-score
80
70
60
50
40
30
20
30
40
50
60
70
80
90
100
V-score
The plot would suggest that there is a positive relationship between the V-score and the R-score.
15
14
58
64
15
63
77
2.4.1
Positive/ negative correlation
The following graphs give illustrations of variables that are (a) positively and (b) negatively correlated with each
other. Correlation can also be categorised as strong or weak depending upon how close the points are to lying on
a straight line.
15
15
Weak, positive
Strong, positive
10
y
y
10
5
5
0
0
0
5
10
-5
15
0
5
x
10
15
10
15
x
15
20
Strong, negative
Weak, negative
15
10
y
y
10
5
5
0
0
0
5
10
-5
15
0
5
x
2.4.2
x
Correlation does not imply causation
It is important to realise that scatter plots point to associations between variables. They do not necessarily show
a causal relationship.
 Example:
Information about two variables (life expectancy and the number of people per television set) is available for 12
countries:
Life expectancy plotted against number of people per TV
Life expectancy
80
70
60
50
40
0
100
200
Number of people per TV
It is clear that the two variables are negatively correlated. However, it clearly would be wrong to conclude that
simply sending more televisions to countries with low life expectancies would cause their inhabitants to live
longer.
16
This example illustrates the very important distinction between causation and association. Two variables may be
strongly correlated without a cause-and-effect relationship existing between them. Often the explanation is that
both variables are related to a third variable not being measured. In the example above for instance both life
expectancy and the number of televisions in the population will both be related to the country’s wealth.
There is one further type of graph that we will consider later in the chapter (namely box-and-whisker plots). We
first however need to look at numerical summary measures for data.
2.5
Numerical summaries of data
In the next few sections we will look at some numerical ways of summarising data.
2.5.1
Some notation
Suppose that we would like to learn about the random variable X. To do this we will observe a random sample of
n observations, X 1 ,..., X n , such that each X i has the same distribution as X. The observed values of X 1 ,..., X n
are then denoted x1 ,..., x n .
 Example:
Suppose we are interested in the number of units of alcohol students at UKC consumed last week. To do this we
could randomly select 50 students to form a random sample X 1 ,..., X 50 , where X i is the random variable
representing the number of units of alcohol consumed by the ith student. The observed value of X i is denoted
xi .
Now suppose that we order the random sample x1 ,..., x n . We let:
 x (1) denote the smallest observation;

x ( 2) denote the second smallest observation;
…
 x (i ) denote the ith smallest observation;
…
 x (n ) denote the largest observation.
Then x (i ) is called the ith order statistic and the following relation holds:
x(1)  x( 2)  ...  x( n) .
 Example:
Suppose that we have the observations:
x1  5, x 2  10, x3  2, x 4  7.
Then
x(1)  x3  2, x( 2)  x1  5, x(3)  x4  7, x( 4)  x2  10.
When we have frequency data, we will denote the frequency of the kth class by f k for k = 1,…, K, where K is the
K
number of classes. Then
 fk
 n.
k 1
17
 Example:
Consider the mercury example again. Here we have the frequency table given by:
Interval
0-10
10-20
20-30
30-40
40-50
50-60
60-70
Frequency
5
11
10
9
2
1
2
Here we have 7 classes, so that K = 7. Then f1  5, f 2  11, and so on, such that
2.5.2

7
 f k  40  n .
k 1
Measures of location
The Sample Mean
Let X 1 ,..., X n denote the random variables for a sample of size n. The sample mean, denoted X , is defined by:
X 1  ...  X n 1 n
  Xi.
n
n i 1
The observed value of the sample mean for a particular sample is therefore:
x  ...  x n 1 n
x 1
  xi .
n
n i 1
X
When the data are grouped by means of a frequency table, then the equivalent formula for x is given by:
K
x
 xk f k
k 1
K
 fk
,
k 1
where K is the number of classes or groups, and x k is the mid-point of class k.
 Example:
Consider the mercury example again.
Interval
0-10
10-20
20-30
30-40
40-50
50-60
60-70
Mid-point, x k
5
15
25
35
45
55
65
Frequency, f k
5
11
10
9
2
1
2
The sample mean is therefore:
(5  5)  (15  11)  (25  10)  ...  (65  2) 1030
x

 25.75.
40
40
18
Note: The mean is probably the most useful measure of location. Its advantages are that it uses all the values in
the data and is easy to manipulate mathematically. A disadvantage is that it is not robust- this means that its
value can be sensitive to the presence of outlying values. More robust measures of location (such as the median
or trimmed mean) are increasing in popularity amongst statisticians.

The Median
To find the median of a set of n data values, we must first rearrange them in order of size. The median is then
equal to the middle observation if n is odd, and the average of the middle two observations is n is even.
More formally,
if n is odd
 X ( 0.5( n 1))

median   1
if n is even.
 2 X ( 0.5n )  X ( 0.5n 1)


 Example 1:
The values below are systolic blood pressures of patients admitted to a hospital:
112.1 138.6 115.9 109.5 108.2 110.9 159.6 115.8 122.3 122.4
123.8
117.5.
To find the median value for the blood pressure, we must first list them in ascending order:
108.2 109.5 110.9 112.1 115.8 115.9 117.5 122.3 122.4 123.8 138.6
159.6.
Here we have an even number of observations. So
1
1
Sample median =
X (6)  X (7)  115.9  117.5  116.7.
2
2


For these data the sample mean is:
1
108.2  109.5  110.9  ...  159.6  1456 .6  121.38
12
12
which is somewhat larger than the sample median. The mean is influenced by the outlying value (159.6). The
median is more robust than the mean and is not really affected by outliers.
Sample mean =
 Example 2:
A football team has scored the following number of goals in the last 44 matches:
Number of goals
Frequency
0
9
1
8
2
15
3
9
4
3
As n = 44, the median will lie halfway between the 22nd and 23rd observations. Since both x( 22) and x( 23) are 2,
the median value is 2.
For grouped data, the most convenient way to estimate the median is by graphical methods. This is most easily
demonstrated via an example.
 Example
Consider the mercury example once again. The cumulative frequency plot is given below. We have a total of 40
observations, so when the cumulative frequency is 20 we might expect the corresponding value of mercury read
off from the graph to be an estimate of the median. In this case we estimate the median as 23 approximately.
19
A cumulative frequency polygon for the mercury data
Cumulative frequency
40
30
20
10
0
0
10
20
30
40
50
60
70
Mecrcury level
Note:
The median is also often a better measure of location than the mean when data are highly skewed. The following
show the relative positions of the mean and median for 3 densities:
20
 Example:
Distributions of incomes are commonly positively skewed as there are typically a few very large salaries which
gives the density a long right-hand tail. Therefore the median is often used to give a typical salary value, rather
than the mean.
Disadvantages for the median:
There are two main disadvantages of using the median. It ignores the actual values of the data and uses only their
ranks (it effectively uses only the “middle” part of the data set). It is also not as easy to use mathematically in the
theory of statistics as the arithmetic mean.

The Trimmed Mean
The trimmed mean can be viewed as some sort of compromise between the mean and the median. To calculate a
trimmed mean:
 order the data values
 delete a selected number of values from each end of the ordered list
 average the remaining values.
The trimmed mean avoids the disadvantages of the mean by excluding extreme observations and avoids that of
the median by taking some account of the observations other than the middle one. To calculate the 5% trimmed
mean for example, discard the top 5% and the bottom 5% of observations, and average those remaining.
 Example:
The body temperatures (deg. F) of 10 patients hospitalised with meningitis are as follows:
104.0
100.8
104.8
104.2
101.6
100.2
The sample mean for these data is:
108.0
103.8
102.4
101.4
1031 .2
x
 103.12.
10
To find the 10% trimmed mean, as we have 10 observations, we drop the smallest and largest data values.
823
10% trimmed mean =
 102 .875.
8
In this case the 10% trimmed mean is probably a better representation of the centre of the distribution as it
ignores the (possible) outlier, 108.

The Mode
The mode is a very simple measure of location. For discrete data, it is the value of x with the largest frequency.
We cannot calculate a mode for ungrouped continuous data. For data grouped into classes we obtain a modal
class.
 Example:
Consider again the family size data presented in the previous section. The numbers of children in the sampled
families are:
2, 6, 3, 2, 2, 7, 5, 4, 1, 4, 0, 5, 2, 4, 1.
Here the most commonly occurring value is 2 and so this is the mode.

Quantiles
The median divides the data into two equal parts. In a similar way, quartiles divide the data into four equal parts,
deciles divide the data into 10 equal parts and percentiles divide it into 100 equal parts.
The upper and lower quartlies can be found in the following way:
21
sample lower quartile = median of lower half of data
sample upper quartile = median of upper half of data
If n is odd, then the median of the entire sample is included in both halves.
Note that deciles and percentiles only tend to be used on very large data sets.
 Example:
The salinity values for 28 water specimens are as follows:
7.6
7.7
4.3
5.9
5.0
10.5
6.5
8.3
8.2
13.2
12.6
13.6
10.4
10.8
13.1
12.3
10.4
13.0
7.7
14.1
14.1
9.5
13.5
15.1
12.0
11.5
12.6
12.0
To find the quartiles we first need to order the data:
4.3
5.0
5.9
6.5
7.6
10.4
10.4
10.5
10.8
11.5
13.0
13.1
13.2
13.5
13.6
7.7
12.0
14.1
8.2
12.3
15.1
8.3
12.6
9.5
12.6
7.7
12.0
14.1
We have 28 observations and so
1
1
median  x(14)  x(15)  10.8  11.5  11.15.
2
2
To find the lower and upper quartiles we need to find the median of the lower 14 and upper 14 observations
respectively:
1
1
lower quartile  x(7)  x(8)  7.7  8.2  7.95.
2
2
1
1
upper quartile  x( 21)  x(22)  13.0  13.1  13.05.
2
2






 Exercise:
Find the median, together with the lower and upper quartiles for the following examination marks:
68, 72, 31, 60, 90, 96, 45, 57, 54, 45, 16, 22, 82, 63, 52.
Just as with finding the median, we can estimate quantiles graphically.
 Example:
Consider again the cumulative frequency polygon for the mercury data. As the total number of observations is 40,
we can estimate the lower and upper quartiles by reading off the mercury values from the graph for a cumulative
frequency of 10 and 30, respectively.
A cumulative frequency polygon for the mercury data
Cumulative frequency
40
30
20
10
0
0
10
20
30
40
50
Mecrcury level
We see UQ = 34 and LQ = 14 (approximately).
22
60
70
2.5.3
Measures of dispersion
Obviously specifying the central value of a set of data does not tell the whole story. We also need to consider the
variability (or spread or dispersion) of the data.

The Range
The simplest measure of dispersion is the range which is simply the difference between the largest and smallest
values in the data set. If we have grouped data then we cannot calculate an exact range, only an upper limit.
 Example:
For the water salinity data, the largest observation is 15.1 and the smallest is 7.6. Therefore,
range = 15.1 - 7.6 = 7.5.
Note: The range is sensitive to the presence of one or two extremely large or small values in the data.

Inter-quartile range
This is a more useful measure of dispersion than the range. It is simply the difference between the upper and
lower quartiles. The inter-quartile range contains the middle half of the data set.
 Example:
We calculated the upper and lower quartiles for the water salinity data to be 13.05 and 7.95 respectively.
Therefore,
Inter-quartile range = 13.05 - 7.95 = 5.1.

The Mean Deviation
The deviations in a sample are the differences,
x1  x, x2  x, ..., xn  x.
One possible idea for obtaining a summary measure of the dispersion in the sample would be to calculate the
mean of these deviations. However, the mean of these deviations is always zero. [Think about why this should
be.]
Instead we could take the absolute value of each of the deviations and calculate the mean of these. This gives the
mean (absolute) deviation:
mean absolute deviation 
1 n
 | xi  x | .
n i 1
For grouped data the equivalent formula is:
mean absolute deviation 
1 K
 f k | xk  x | .
n k 1
where, x k is the midpoint of the kth class.
 Example
Twelve students record their weight in kg, creating the following sample:
50, 51, 61, 75, 62, 73, 64, 86, 65, 58, 73, 59.
23
1
1
 (50  51  ...  59)   777  64.75 kg.
12
12
The deviations of each value from the mean are:
The mean of these 12 observations is:
x
-14.75, -13.75, -3.75, 10.25, -2.75, 8.25, -0.75, 21.25, 0.25, -6.75, 8.25, -5.75.
So the mean deviation is:
Mean deviation =

1
1
 (14.75  13.75  3.75  10.25  ...  5.75)   96.5  8.0417 kg .
12
12
The Sample Variance and Sample Standard Deviation
Instead of taking the absolute values of the deviations (so that the positive and negative deviations don't just
cancel each other out), we could use the squares of the deviations. The sample variance (usually denoted by s 2 )
can be thought of as an ‘average’ of the squared deviations.
The sample variance is defined by:
s2 
1 n
 ( xi  x) 2 .
n  1 i 1
Note that although we are summing n squared deviations, we divide through by n – 1. This is important! The
reason why we use n - 1 and not n in the definition of the sample variance will become apparent later on in the
course when we look at unbiased estimators.
The disadvantage of using the sample variance is that it is not measured in the units of measurement used for the
data, but in squared units. This problem is overcome by using the standard deviation. The sample standard
deviation is simply the square root of the sample variance, ie:
s
1 n
 ( xi  x) 2 .
n  1 i 1
Note: For grouped data, we use the following definition for a sample s.d.:
s
1 K
f k ( x k  x) 2 .

n  1 k 1
 Example
Consider again the weights of the 12 students given above. The deviations from the mean were:
-14.75, -13.75, -3.75, 10.25, -2.75, 8.25, -0.75, 21.25, 0.25, -6.75, 8.25, -5.75.
So the sample variance is:
1 n
1
s2 
( xi  x) 2 
(14.75) 2  (13.75) 2  (3.75) 2  10.25 2  ...  (5.75) 2

n  1 i 1
11

1
 1200 .25  109 .1136 .
11
This means that the sample standard deviation is s = 109.1136 = 10.446 kg.

24

Result:
Using the above formula to calculate the sample variance can be complicated. In general it is better to use the
expression:
s2 
 xi 2 
.
1  n 2
 xi 
n  1  i 1

n


To calculate the variance using this expression we need to know the sum of the observations and the sum of the
squares.
Proof:
We need to show that both formulae for the sample variance are equivalent. It suffices to show:
n
 ( xi  x)
2
i 1
n

i 1
xi2
 xi 2

.
n
Now,
n
n
i 1
i 1
 ( xi  x) 2    xi2  2 xi x  x
But, x 
1
n
2
n
n
2
   xi  2 x  xi  n x .
 i 1
i 1
2
n
 xi , so
i 1
n
n
 ( xi  x)  
i 1
2
i 1
n
 xi 
1

1
2
 2  xi   n  xi    xi2 
n
n
i 1
n i

2
xi2
2
as required.
Note:
There is an equivalent expression for grouped data, so that:
1  K
s2 
 f k xk2 
n  1  k 1

 f k xk 2 .
n


 Example 1:
Consider again the student height data:
50, 51, 61, 75, 62, 73, 64, 86, 65, 58, 73, 59.
We can check that the new formula for calculating the variance does in fact give us the same result:
12
 xi2  50 2  512  612  ...  59 2  51511,
i 1
12
 xi  50  51  61  ...  59  777
i 1
So,
s2 
( xi ) 2  1 
(777 ) 2 
1 
2

x


51511

 109.1136
 i

 11 
n 1 i
n
12




as before.
 Example 2:
For an example of grouped data, consider the mercury data again:
25
Interval
0 - 10
10 - 20
20 - 30
30 - 40
40 - 50
50 - 60
60 - 70
Frequency
5
11
10
9
2
1
2
Mid-point, x k
5
15
25
35
45
55
65
Here we have,
7
 f k xk2  5  5 2  11  15 2  10  25 2  ...  2  65 2  35400 ,
i 1
7
 f k xk  5  5  11  15  10  25  ...  2  65  1030
i 1
So,
( xi ) 2  1 
(1030 ) 2 
1 
2
s 
xi 
  35400 
 227 .6282 .

 39 
n 1 i
n
40 



2
The sample standard deviation is therefore 227.6282 = 15.09.
 Exercise:
A sample of 50 adults were asked how many lottery tickets they purchased last week:
Number of lottery tickets
Frequency
0
19
1
11
2
10
3
3
4
4
5
3
Find the sample standard deviation.
Note:
Find out how to use your calculator’s statistical mode to calculate s.d.s.
2.6
Box-and-whisker plots
Box-and-whisker plots aim to highlight a few important features of a data set. They are based on the following
location summaries: minimum, lower quartile, median, upper quartile and maximum. These 5 quantities are
sometimes referred to as the five-number summary.
 Simple Example:
The number of runs scored by a batsman on 14 occasions are as follows:
40, 22, 17, 50, 24, 48, 5, 0, 28, 19, 30, 25, 16, 37.
Ordering these values we get:
0, 5, 16, 17, 19, 22, 24, 25, 28, 30, 37, 40, 48, 50.
The five-number summary then is:
Minimum value = 0
Median, Q2 = 24.5
Maximum value = 50
Lower quartile, Q1 = 17
The box-and-whisker plot then looks like:
26
Upper quartile, Q1 = 37
A box plot showing a batsman's runs
0
10
20
30
40
50
Number of runs
In the above diagram, the box indicates the interquartile range. The whiskers go from the lower and upper
quartiles to the smallest and largest observations respectively. The median is represented by a line within the box.
Note: the position of the median within the box gives an indication of whether the data are skewed:
 Symmetry: Q2  Q1  Q3  Q2 ;
 positive skew: Q2  Q1  Q3  Q2 ;
 negative skew: Q2  Q1  Q3  Q2 .
Box-and-whisker plots are especially useful for comparing two different data sets as they give a simple picture of
the locations and spreads of different distributions.
 Example:
The numbers of hysterectomies performed by 15 male doctors and 10 female doctors are given below:
Male doctors
Female doctors
20
5
25
7
25
10
27
14
28
18
31
19
33
25
34
29
36
31
First of all we need to find the five-number summaries for the two data sets.
Summary statistic
Minimum
Lower quartile
Median
Upper quartile
Maximum
Male doctors
20
27.5
34
47
86
Female doctors
5
10
18.5
29
33
27
37
33
44
50
59
85
Box-and-whisker plot comparing number of procedures by sex
Male doctors
Female doctors
0
10
20
30
40
50
60
70
Number of hysterectomies performed
80
90
 Exercise
Consider again the protein assimilation efficiency data given in Section 2.2.3. We then had the following stemand-leaf diagram:
An ordered back-to-back stem-and-leaf
diagram showing the protein data
A.E.s for field mice
4
3
2
Scale: Stem = 10’s
3
9
1
2
8
1
6
A.E.s for voles
8
1
5
0
5
5L
5H
6L
6H
7L
7H
2
0
7
0
5
3
7
0
7
4
8
0
8
4
9
2
4
4
4
Leaves = 1’s
Draw box-and-whisker plots for the field mice and voles and compare the shapes of these.
Note:
Minitab calculates the quartiles slightly differently to the method used in this course. Consequently, slightly
different values for the quartiles can arise when using Minitab.
28
Chapter 3: Common Distributions
In this chapter we examine four of the distributions that will be frequently encountered later in the course.
3.1
The Normal Distribution
3.1.1
Recap from MA304
The normal distribution is the most widely used distribution in statistics. Continuous data such as mass, length,
etc, can often be modelled using a normal distribution.
The normal distribution has two parameters- the mean (  ) and variance (  2 ). If a random variable X has a
normal distribution then we can write this as:
X ~ N[  ,  2 ].
A normal distribution with  = 0 and  = 1 is referred to as a standard normal distribution (and a random
variable with this distribution is usually denoted Z).
Important result: If X is a random variable distributed as N[  ,  2 ] , then
X 

~ N[0,1].
The process of subtracting the mean and dividing by the standard deviation is referred to as standardisation:
General Normal
X ~ N[  ,  2 ]
Standard Normal
Z ~ N[0, 1]
z
x

 Example:
The fully grown lengths (in mm) of a certain insect can be regarded as having the following normal distribution:
X ~ N[64, 16].
What is the probability that an insect has length less than 59 mm?
 Applying the standardisation formula,
z
Thus,
3.1.2
x


59  64
 1.25.
4
P( X  59)  P(Z  1.25)  P(Z  1.25)  1  (1.25)  1  0.8944  0.1056 .
Percentage points
29
Definition: Consider a random variable X with some distribution. The (upper) 100 % point is the value of x
such that:
P(X > x) = .
For the standard normal distribution, we will denote the (upper) 100% point by z , i.e.:
P(Z > z ) =  .
X ~ N[  ,  2 ]
Z ~ N[0, 1]
z
x

In statistical tables (e.g. Lindley and Scott), there is a separate percentage point table covering the most used
values of . In Lindley and Scott,
 P represents 100 ,
 x(P) represents the value of z .
Extract:
P = 100
10%
5%
2%
1%
0.1%

0.01
0.05
0.02
0.01
0.001
The 10% point for
the standard
normal is
x(P) = z
1.2816
1.6449
2.0537
2.3263
3.0902
z 0.1  1.2816 .
 Example 1:
Let X ~ N[50, 16]. Find the value of x such that P(X > x) = 0.05, i.e. find the (upper) 5% point.
X  50
~ N[0,1].
4
The 5% point for the standard normal is z 0.05  1.6449 .
 If X ~ N[50, 16], then
Thus, the 5% point for a N[50, 16] distribution can be obtained by solving
So, the 5% point is x  50  1.6449  4  56.5796 .
x  50
 1.6449 .
4
 Example 2:
Let Z ~ N[0, 1]. Find the value of z such that P(Z < z) = 0.01 (i.e. find the lower 1% point).
 The upper 1% point for a standard normal is z 0.01  2.3263 . Therefore, P(Z > 2.3263) = 0.01.
By symmetry, we must also have P(Z < -2.3263) = 0.01. So, the lower 1% point is –2.3263.
3.2
The chi-squared distribution
30
3.2.1
Introduction
The chi-squared (  2 ) distribution has a single parameter called the degrees of freedom- this can be any positive
integer. The  2 distribution with n degrees of freedom is denoted  n2 .
Probability density function:
If X ~  n2 , then the p.d.f. of X (for x > 0) is given by:
1
f ( x)  n / 2
2 
For x  0, f ( x)  0.

n
2
x n / 21e  x / 2 .
This density is written in terms of the gamma function. Some of the key properties of this function are:
 ( x)  ( x  1)( x  1);
12  
 ;



( x)  ( x  1)!
if x is a natural number.
The degrees of freedom, n, define the shape of the  2 density. For n < 3, the density has a mode at zero. For n 
3, the mode moves further away from zero as n increases. The shapes of some specific densities are given below.
Graph of several chi-squared densities
0.6
n=
n=
n=
n=
0.5
1
2
4
8
0.4
0.3
0.2
0.1
0
3.2.2
0
2
4
6
8
10
12
Finding probabilities
Probabilities associated with the  2 distribution can be looked up in probability tables. Lindley and Scott list
the d.o.f. (which they denote ) along the top of each column. Then for each value x listed, the values in the table
are the probability that X < x.
Extracts:
31
 = 3.0
x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
etc
 = 7.0
P(X < x)
0.0000
0.0811
0.1987
0.3177
0.4276
0.5247
0.6084
0.6792
0.7385
x
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
P(X < x)
0.0052
0.0402
0.1150
0.2202
0.3400
0.4603
0.5711
0.6674
0.7473
0.8114
 Example 1:
If X ~  32 , then P(X < 2.5) = 0.5247.
 Example 2:
Suppose X ~  72 . Find P(X > 10).
 Now, from tables we can find, P(X < 10) = 0.8114  P(X > 10) = 1 – 0.8114 = 0.1886.
3.2.3
Percentage points
The 100 % point for the  n2 distribution is denoted  n2, . Therefore, if X ~  n2 , then
P(X >  n2, ) = .
The percentage points of the  2 distribution are in a separate table in Lindley and Scott.
Extract:
P
99
95
10
5
1
 = 1.0
 = 2.0
 = 3.0
 = 4.0
 = 5.0
 = 6.0
 = 7.0
 = 8.0
0.000
0.020
0.115
0.297
0.554
0.872
1.239
1.646
0.004
0.103
0.352
0.711
1.145
1.635
2.167
2.733
2.706
4.606
6.251
7.779
9.236
10.64
12.02
13.36
3.841
5.991
7.815
9.488
11.07
12.59
14.07
15.51
6.635
9.210
11.34
13.28
15.09
16.81
18.48
20.09
 52, 0.1  9.236.
So P(X > 9.236) = 0.1
In this table, the degrees of freedom for the distribution are listed going down the rows and P is 100.
The chi-squared distribution is not symmetric (unlike the normal distribution). So if we want a lower percentage
point (i.e. a value of x such that P(X < x) = ) , then we can't simply negate the corresponding upper percentage
point. Instead we need to find  n2,1 .
32
 Example 1:
Let X ~  82 . Find the lower 1% point (i.e. the value of x such that P(X < x) = 0.01).
 The lower 1% point is denoted  82, 0.99 , the value for which is 1.646.
 Example 2:
2
Suppose X ~  10
. Find the value of t for which P(X > t) = 0.1321.
 Here, t would be the 13.21% point for the distribution. But, 0.1321 is a non-standard value of . So we
need to use the distribution function table to find t.
P(X > t) = 0.1321  P(X < t) = 1 – 0.1321 = 0.8679.
Going through the distribution table we find that t = 15.
3.3
The Student t-distribution
3.3.1
Introduction
Definition: Suppose that we have two independent random variables Y and Z, such that:
Y ~ N[0, 1] and Z ~  n2 .
Then the random variable X defined by
Y
X
Z n
has a t-distribution with n degrees of freedom- denoted t n .
The t-distribution is symmetric about zero and its general shape is like the bell-shape of a normal distribution.
However, the tails of the t-distribution can approach zero much more slowly than those of the normal
distribution- i.e. the t-distribution is more heavy tailed than the normal. The degrees of freedom define how
heavy-tailed the t-distribution is.
Note:
The t-distribution with n = 1 is sometimes referred to as the Cauchy distribution. This is so heavy tailed that its
mean and variance do not exist! (This is because the integrals specifying the mean and variance are not
absolutely convergent.)
Important note:
The density of a t-distribution converges to that of the standard normal as n  .
The diagram below shows how the t-distribution varies for different degrees of freedom.
33
Comparing several t distributions with the standard normal
0.4
normal
t2
t5
t 20
0.35
0.3
Density
0.25
0.2
0.15
0.1
0.05
0
-3
3.3.2
-2
-1
0
x
1
2
3
Probabilities
Probabilities associated with the t-distribution can be looked up in tables. In Lindley and Scott, the degrees of
freedom are again denoted by  and are listed along the top of the columns. Then for each value t listed, the
values in the table are the probability that X < t.
 Example 1:
Let X ~ t 3 . Then P(X < 2.5) = 0.9561.
 Example 2:
Let X ~ t12 . Find P(X > 2.5).
 P(X > 2.5) = 1 - P(X < 2.5) = 1 - 0.986 = 0.014.
3.3.3
Percentage points
The 100 % point for the t n distribution is denoted by t n, . If X ~ t n , then:
P(X > t n, ) = .
Percentage points for the t-distribution are tabulated separately. The degrees of freedom for the distribution are
listed down the rows and P = 100.
 Example 1:
Find the 5% point for t 6 .
 Directly from tables, this is seen to be t 6, 0.05  1.943 . (Thus P(X > 1.943) = 0.05.)
As the t-distribution is symmetric, finding lower percentage points is simple.
 Example 2:
Let X ~ t10 . Find the value of t such that P(X < t) = 0.01 (i.e. find the lower 1% point).
34
 The upper 1% point is t10,0.01  2.764.
But
P(X > 2.764) = 0.01  P(X < -2.764) = 0.01.
So, the lower 1% point, t, is -2.764.
Note: To find non-standard percentage points (such as the 12.5% point, for example), we need to use the tdistribution function table.
3.4 The (Fisher’s) F-distribution
3.4.1
Introduction
Definition: Consider two independent random variables Y and Z such that
nY ~  n2 and mZ ~  m2 .
The random variable X defined by
Y
Z
is then said to have an F-distribution with n and m degrees of freedom- denoted Fn, m .
X
The F-distribution therefore has two parameters, both of which are degrees of freedom. The order of the degrees
of freedom is important! The Fn ,m distribution is not the same as the Fm, n distribution.
Note: The density for the F-distribution is only defined for positive values of x. The values of the two degrees
of freedom define the shape of the distribution. Plots of the F-distribution for various values of n and m are
shown below.
35
Graphs of several F distributions
1
n=2, m=2
n=4, m=4
n=8, m=8
n=20, m=20
0.9
0.8
0.7
Density
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
x
4
5
6
Graphs of several more F distributions
1
n=
n=
n=
n=
0.9
0.8
2, m = 4
4, m = 2
5, m = 10
10, m = 20
0.7
Density
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
2.5
x
36
3
3.5
4
4.5
5
Lindley and Scott do not have tables for looking up probabilities associated with the F-distribution.
3.4.2
Percentage points
Separate tables giving 10, 5, 2.5, 1, 0.5 and 0.1 percentage points for F-distributions with different combinations
of degrees of freedom can be found in Lindley and Scott.
We will denote the (upper) 100 % point for the Fn, m distribution by Fn, m, . If X ~ Fn, m , then:
P(X > Fn, m, ) = .
In the table of the 100 percentage points for the F-distribution, the first degrees of freedom is denoted  1 and
listed along the columns. The second degrees of freedom is denoted by  2 and listed down the rows.
Extract: 1% points of the F-distribution
1 
2 
1
2
3
4
5
1
4052
98.50
34.12
21.20
16.26
2
4999
99.00
30.82
18.00
13.27
3
5403
99.17
29.46
16.69
12.06
4
5625
99.25
28.71
15.98
11.39
5
5764
99.30
28.24
15.52
10.97
The (upper) 1% point for an
F5, 3 distribution is 28.24. We
write F5, 3, 0.01  28.24.
 Example:
Find the 5% point for both the F5,10 and the F10, 5 distributions.
 From the 5% points table:
F5,10, 0.05  3.326
F10, 5, 0.05  4.735
Notice that these are not the same.
The tables in Lindley and Scott give the upper percentage points only- i.e. they give the values of x such that
P(X > x) = , for small values of . Since the F-distribution is not symmetric, to find lower upper percentage
points we cannot simply use the negative of the corresponding upper percentage point:
P( X  x)  P( X   x).
The density is in fact not even defined for x < 0.
37
3.4.3
Finding lower percentage points
Result: Suppose that X 
Y
~ Fn, m . Then
Z
X 1 
Z
~ Fm, n .
Y
Proof:
Y
~ Fn, m if nY ~  n2 and mZ ~  m2 .
Z
But by definition of the F-distribution, this means that
Z
~ Fm, n
Y
as required.
X
We can use this result to find lower percentage points for F-distributions:
Important result:
The lower 100 percentage point for the Fn, m distribution is the reciprocal of the upper 100 percentage point
of the Fm, n distribution.
Proof:
If X ~ Fn, m and x represents the lower 100 percentage point for this distribution, then P(X < x) = .
But
 1 1
 P     .
 X x
P( X  x )  
1
1
~ Fm, n then is (by definition) the upper 100 percentage point of the Fm, n distribution.
x
X
1
So, x 
.
Fm, n,
As
 Example 1:
Let X ~ F5,10 . Suppose we wish to find x such that P(X < x) = 0.05- i.e. we want to find the lower 5% point of
the F5,10 distribution.
 The lower 5% point of the F5,10 distribution is the reciprocal of the upper 5% point of F10, 5 distribution.
So,
x
1
F10, 5, 0.05

1
 0.2112 .
4.735
 Example 2:
Suppose X ~ F4,7 . Find the upper and lower 10% points.
38
 The upper 10% point can be found directly from tables:
F4, 7, 0.1  2.961 .
The lower 10% point is the reciprocal of the upper 10% point of the F7, 4 distribution:
Lower 10% point = F4, 7, 0.9 
1
F7, 4, 0.1

1
 0.2513 .
3.979
 Exercise:
Suppose X ~ F2, 4 . Find the upper and lower 1% points.
3.5
Some additional facts about distributions
1) If X 1 ,..., X n are independent with X i ~ N[  i ,  i2 ] , i = 1, …, n, then
n
n

a 0   ai X i ~ N a 0   ai  i ,
i 1
i 1

2) If X 1 ,..., X n are i.i.d. as N[0, 1], then
(a) X i2 ~  12 , for i = 1, 2, …, n;
n
(b)
 X i2 ~  n2 ;
i 1
3) If X 1 ,..., X n are independent with X i ~  k2i , i = 1, …, n, then
n
 X i ~  k2 ,
i 1
where k  k1  ...  k n .
4) If X ~ t n , then X 2 ~ F1, n .
These results are not proved in this course.
39
n

i 1

 ai2 i2  ;
Chapter 4: Sampling Distributions
4.1
Parameters
The purpose of many statistical investigations is to learn about the distribution of some random variable X. Many
aspects about X's distribution may be of interest, but attention often focuses on one or two particular population
characteristics.
 Example 1:
A bakery needs to decide how many loaves of fresh bread it should put out on its shelves each day. If they put
out too many, then they will lose money as stale bread will not sell, and if they put out too few, then they will
lose potential sales. Therefore, to help the bakery make its order, interest might focus on the mean number of
loaves, , usually sold on a particular day.
 Example 2:
Suppose that a company has the job of packing a certain breakfast cereal into boxes, so that each box
approximately contains 500g of cereal. The weight of cereal in each box varies around 500g due to the
variability of the cereal product. The company wants to check that the amount going into each box doesn't vary
too much about 500g- weights greater than 500g will lose the company money and weights less than 500g could
lead to customer dissatisfaction. In this case, attention may focus on the variability of weights in the boxes as
described by , the standard deviation of weights.
 Example 3:
When testing a new drug, a doctor might not be interested so much in the number of people cured by the drug,
but rather the proportion, , of people who are cured by the drug.
We call , , or  population parameters. To learn about such parameters, we can observe a random sample of n
observations, x1 ,..., x n , and then use these data to calculate estimates for the parameter(s) of interest.
For example, a sample mean could be used to estimate .
Definition: Any quantity computed from values in a sample is called a (sample) statistic.
 Example:
All the numerical summaries introduced in Chapter 2 are statistics as they are all calculated from values in the
random sample. This includes statistics such as the sample mean (which utilises all the observations in its
calculation) and the sample median (which only takes account of the middle observations).
It is important to realise that there is a difference between population parameters and sample statistics. The
population parameter is a characteristic of the distribution of the random variable, is typically unknown and
cannot be observed. By contrast, a statistic is a characteristic of the sample and can be observed. For example,
the population mean  has some fixed (but unknown) value. On the other hand, the sample mean, X , can be
observed and therefore can be known for a particular sample. The observed value of X , however, can vary from
sample to sample (as different samples will give different values of x1 ,..., x n ). The value of a statistic, therefore,
is subject to sampling variability.
Definition: As a statistic is a function of the random variables X 1 ,..., X n , it is itself a random variable. The
distribution of a statistic is called its sampling distribution.
The sampling distribution of a statistic describes the long-run behaviour of the statistic's values when many
different samples, each of size n, are obtained and the value of the statistic is computed for each sample.
40
4.2
The sampling distribution of the sample mean
To investigate the sampling distribution for X , we will consider several experiments.
Experiment 1: We generate 500 random samples (each of size n) from N[100, 400]. For each of these 500
samples we calculate x , so we have a random sample of 500 observations from the sampling distribution of X .
This was repeated for n = 5, 20, 50.
Sampling distribution for the sample mean (n = 20)
60
70
50
60
40
Frequency
Frequency
Sampling distribution for the sample mean (n = 5)
30
20
10
50
40
30
20
10
0
0
80
90
100
110
85
120
95
105
115
Sample mean
Sample mean
Sampling distribution for the sample mean (n = 50)
90
80
Frequency
70
60
50
40
30
20
10
0
90
100
110
Sample mean
Observations: In each case the distribution seems roughly normal and it is clear that each of these histograms is
centred roughly at 100 (the mean of the normal distribution from which the samples were generated). We can
also see that as the sample size n increases, the variability in the sampling distributions decreases (look carefully
at the scales on the horizontal axes).
These points can also be seen if we look at some statistics relating to each histogram above:
Sample size
n=5
n = 20
n = 50
Mean
100.07
99.83
100.05
Standard deviation
8.17
4.40
2.81
We will do a similar set of experiments to see what the sampling distribution for X is like when we are not
sampling from the normal distribution.
Experiment 2: We generate 500 random samples (each of size n) from a uniform U[0,1] distribution. Again, for
each of these 500 samples we calculate x , so we have a random sample of 500 observations from the sampling
distribution of X . This was repeated for n = 5, 10, 20, 50.
41
Note: If X ~ U[0, 1], then E[X] = 0.5 and Var[X] = 1/12 (so s.d. = 0.289).
Sampling distribution for the sample mean (n = 10)
Sampling distribution for the sample mean (n = 5)
80
60
70
60
Frequency
Frequency
50
40
30
20
50
40
30
20
10
10
0
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.2
0.9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Sample mean
Sample mean
Sampling distribution for the sample mean (n = 20)
Sampling distribution for the sample mean (n = 50)
90
60
80
50
60
Frequency
Frequency
70
50
40
30
40
30
20
20
10
10
0
0
0.25
0.35
0.45
0.55
0.65
0.75
0.35
Sample mean
0.45
0.55
0.65
Sample mean
Observations: The shapes of the histograms relating to the sample means look increasingly more like normal
distributions as n increases- this is despite the data being sampled from a uniform distribution. The histograms
in each case seem to centre on 0.5 (the mean of the U[0, 1] distribution). Also, the variability of the sampling
distributions is decreasing as the sample size becomes larger.
The mean and standard deviation for the data in the four situations above are given below:
Sample size
n=5
n = 10
n = 20
n = 50
Mean
0.491
0.504
0.502
0.499
Standard deviation
0.133
0.095
0.068
0.042
42
Important Result:
For an independent random X 1 ,..., X n from a distribution with mean  and variance  2 , the sampling
distribution for X has the following properties:
1. E[ X ]   .
2.
Var[ X ] 
2
n
. The standard deviation of X (often called the standard error) is therefore

.
n
 2
3. If each X i ~ N[  ,  2 ] , then X ~ N   ,
 regardless of the size of n.
n 

4. If X 1 ,..., X n are not normally distributed then when n is large (say at least 30) the distribution of X is
 2
approximately N   ,
.
n 

Proof
1 n
 1 n
1
E[ X ]  E   X i    E[ X i ]  n   (as required).
n
n 1
 n 1
Because we are assuming that the random variables are independent, we can also write:
1 n
 1 n
1
2
(as required).
Var[ X ]  Var  X i   2  Var[ X i ]  2 n 2 
n
n
n 1
 n 1
A linear combination of normally distributed random variables also has a normal distribution. The mean and
variance are as given above.
Not proved here.
Note:
Part (4) of the above result is the Central Limit Theorem, an extremely powerful and useful result in Statistics.
 Example 1:
X 1 ,..., X 20 are independently and identically distributed N[30, 5]. Find the sampling distribution for X .
 Here n = 20 and so X ~ N[30, 5/20] = N[30, 0.25].
 Example 2:
X 1 ,..., X 40 are i.i.d Po(10) random variables. What approximately is the sampling distribution for X ?
 The sample size can be considered large enough for the Central Limit Theorem to be applied. The sampling
distribution can therefore be considered approximately normal. A Po(10) distribution has mean and variance
 10 
equal to 10, therefore X ~ N 10,   N10, 0.25 (roughly).
 40 
43
4.3
Sampling distribution of the sample proportion
In many statistical investigations we are interested in learning about the proportion of individuals, or objects, in a
population that possess a specified property. For example, we might be interested in what proportion of patients
are alive 5 years after diagnosis of a particular cancer, or we might be interested in the proportion of UK adults
who would like a ban on blood-sports. Denote the true population proportion of interest by . Note that  is a
population parameter.
To learn about , we could observe a random sample in which each of the n observations is either a “success” or
a “failure”. The sample proportion, p, is given by:
p = (number of successes)  n.
The sample proportion is clearly a sample statistic. It makes sense to use p to learn about . We are therefore
interested in the sampling distribution for p.
To investigate the sampling distribution for p, we will look at 2 experiments in which we generate random
samples of observed values of p.
Experiment 1:
Suppose that we generate 500 samples of size n where each sampled value is either a success (with probability 
= 0.25) or a failure (with probability 1 -  = 0.75). We then calculate the observed proportion of “successes” in
each of the 500 samples. We will do this for n = 5, 10, 25 and 50.
Sampling distribution for the sample proportion (n = 5)
Sampling distribution for the sample proportion (n = 10)
140
200
120
Frequency
Frequency
100
100
80
60
40
20
0
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Sample proportion, p
Sample proportion, p
Sampling distribution for the sample proportion (n = 20)
Sampling distribution for the sample proportion (n = 50)
70
100
60
Frequency
Frequency
50
50
40
30
20
10
0
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
Sample proportion, p
0.2
0.3
0.4
0.5
Sample proportion, p
Observations:
For a sample of size 5, the possible values of p are 0, 0.2, 0.4, 0.6, 0.8 and 1. The sampling distribution for p
gives the probability of each of these 6 values. The histogram for the case n = 5 is positively skewed.
44
As n increases, the histograms become more and more symmetrical and in fact when n = 50 the histogram
clearly resembles a normal curve centred on 0.25. In addition, increasing the sample size decreases the range of
observed values for p.
Experiment 2:
Once again we will generate 500 samples, but this time we will have the sample sizes n = 10, 25, 50 and 100 and
we will take the true proportion of successes, to be 0.07. So once again each observation in each sample is either
a success (S) with probability 0.07, or failure (F) with probability 0.93.
Sampling distribution for the sample proportion (n = 10)
Sampling distribution for the sample proportion (n = 25)
150
Frequency
Frequency
200
100
100
50
0
0
0.00
0.0
0.1
0.2
0.3
0.4
0.05
0.10
0.15
0.20
0.25
Sample proportion, p
Sample proportion, p
Sampling distribution for the sample proportion (n = 50)
Sampling distribution for the sample proportion (n = 100)
80
70
60
Frequency
Frequency
100
50
50
40
30
20
10
0
0
0.0
0.1
0.2
0.00
Sample proportion, p
0.05
0.10
0.15
Sample proportion, p
Observations:
When n = 10, the possible values for p are 0, 0.1, 0.2, …, 1. The histogram for the 500 samples is very
positively skewed and no values greater than 0.4 was observed for p. [Notice how in the previous experiment,
the density for p was not very skewed when n = 10].
As n increases to 25 and 50, the histograms still look positively skewed. However, when the sample size reaches
100, the histogram is beginning to look slightly more normal. Therefore we note that in this experiment we need
larger sample sizes than in Experiment 1 before the sampling distribution for p looks approximately normal.
We also note that increasing the sample size again results in a narrowing in the range of observed values for p.
Thus to summarise the observations from this experiment:
 Densities are roughly centred about  = 0.07.
 Variance for p decreases as n increases.
 As the sample size increases, the density for p becomes approximately normal. However, the density tends to
normality much slower than when we had  = 0.25. Therefore, it appears that the rate at which the
sampling distribution for p tends to normality depends not only on the sample size n, but also on the
value of .
45
Important result:
If p is the sample proportion of successes in a random sample of size n where  is the true proportion of
successes, then the following results hold:
 The expected value of p is .
 (1   )
 The standard error (i.e. s.d.) of p is
.
n
 When n is sufficiently large, the sampling distribution for p is approximately normal.
Note: The further the value of  is from 0.5, the larger the value of n must be in order for the normal
approximation of the sampling distribution for p to be accurate.
Rule of thumb:
If both n  5 and n(1   )  5 , then we may use the normal approximation for p.
Proof:
Let X = total number of successes in the sample. Then X ~ Bi[n, ] and so:
E[X] = n
V[X]= n(1 - )  sd[X] = n (1   ) .
But, by definition, the sample proportion p =
X
, and so
n
1
X  1
E[p] = E    E[ X ]  n   .
n
n n
2
 (1   )
1
X  1
Also,
V[p] = V      V[ X ]  2 n (1   ) 
.
n
 n  n
n
Taking square roots, we get the required standard error for p.
Proof of the normality approximation is simply an application of the Central Limit Theorem, so that for large n
  (1   ) 
X ~ N  ,
.
n 

approximately.
 Example 1:
Suppose that the proportion of women who believe that they are underpaid is 0.55.
a)
If we had a random sample of size 10, could we assume that the sampling distribution for p is
approximately normal?
b)
For a random sample of 400, what are the mean value and standard deviation for p?
c)
In a sample of size 400, what is the probability that we observe the proportion of women who believe
they are underpaid to be greater than 0.6?
a)
 = 0.55 and n = 10, so n = 5.5 and n(1 - ) = 4.5.
As both of these are not  5, then we cannot assume that the distribution of p is normal with only a sample size
of 10.
b)
n = 400, so:
E[p] =  = 0.55
 (1   ) 0.55  0.45
V[p] =

 0.000619
n
400
 sd[X] = 0.0249.
For n = 400, n = 220 and n(1 - ) = 180 and so p's distribution can be considered approximately normal.
Therefore:
46
p ~ N[0.55, 0.000619].
c)
0.6  0.55 

P( p  0.6)  P Z 
  P(Z  2.008)  1  (2.008)  1  0.9778  0.0222 approximat ely.
0.0249 

 Example 2:
Suppose that the true proportion of individuals with a particular disease is 0.02. What minimum sample size
would be needed before p's distribution can be assumed to be approximately normal?
 For approximate normality we need n  5 and n(1 - )  5. Now
n (0.02)  5

n  250.
n (0.98)  5

n  5.102
Therefore, to assume approximate normality for p, we would need a sample size of at least 250.
 Exercise:
90% of the population are right-handed. In a sample of 200 people, what is the probability that the sample
proportion who are right-handed is less than 0.86?
4.4
Sampling distribution for sample variance
When we want to learn about the variance,  2 , of a population, it is natural to first look towards the sample
variance, S 2 . We are therefore interested in the sampling distribution for S 2 .
In general, the sampling distribution for S 2 does not follow any fixed rules and so here we will only look at the
case when X 1 ,..., X n are i.i.d. N[  ,  2 ].
Important result:
If X 1 ,..., X n are i.i.d. N[  ,  2 ] where  is unknown, then
(n  1) S 2
2
Proof: The proof will not be given in this course.
Experiment:
47
~  n21 .
To demonstrate that this result does in fact hold in practice, 500 samples were generated from N[100, 400] for
various samples sizes, n and the value of
(n  1) S 2
2

(n  1) S 2
calculated for each of the 500 samples.
400
Histogram for n = 3
Frequency
200
100
0
0
5
10
15
Statistic
Histograms of these samples then demonstrate what the sampling distribution for
( n  1) S 2
2
looks like in each
case.
Histogram for n = 5
90
80
Frequency
70
60
50
40
30
20
10
0
0
5
10
Statistic
Histogram for n = 20
60
70
50
60
Frequency
Frequency
Histogram for n = 10
40
30
20
50
40
30
20
10
10
0
0
0
0
1
2
3
4
5
1
2
6
3
4
5
Statistic
Statistic
Observations:
In the case when n = 3, the histogram for the sample of 500 observations of
( n  1) S 2
is heavily positively
2
skewed and resembles a  22 distribution. The histograms for the other cases, where n = 5, 10 and 20, also
resemble chi-squared distributions (the respective degrees of freedom should be 4, 9 and 19).
48
49
Chapter 5: Point Estimation
Definition:
A (point) estimator, ˆ , is a statistic (some function of the sample X 1 ,..., X n ) used to produce a single value
estimate of a parameter . An estimate is the value an estimator takes for a particular sample.
Statistic
Parameter
Population mean, 
Sample mean, median, trimmed mean, …
Estimator for
Sample variance, S 2
Population variance,  2
Sample proportion, p
Population proportion, 
There will be a range of possible estimators for a population parameter, . However, some estimators will be
sensible to use and some will not. To help us decide whether ˆ is good to use, we look at its sampling
distribution.
Suppose that the sampling distribution of ˆ (an estimator for ) looks like:
In this case, the true value of  is to the right of the sampling distribution for ˆ , so ˆ is a poor estimator, as it
will always underestimate . Ideally, the distribution of ˆ should be concentrated around , i.e. we want:
Definition:
ˆ is an unbiased estimator of  if
50
E[ˆ]   .
So on average the observed value of an unbiased estimator will be the true value of the parameter it is trying to
estimate.
X is an unbiased estimator of .
Result 1:
Proof:
1 n
 1 n
1 n
E[ X ]  E   X i    E[ X i ]      .
n 1
n 1
 n 1
Therefore as E[X ]   , X is an unbiased estimator of .
S 2 is an unbiased estimator of  2 .
Result 2:
Proof:

1 n
Recall that S 
 Xi  X
n 1 1
But,
2
2
.
 X i  X    ( X i   )  ( X   )    X i   
1
1
1
2
n
2
n

2
n
  X i     2 X  
n
2

n
1
n
1
n
2
n
2



 n

   X i     2 X     X i  n   n X  
1
 1


   X i     2n X  
2
2  nX   2
1
n

  Xi 

2

2
n X  .
1
So
 

 1 n
E S 2  E
 Xi  X
n 1 1

2   n 1 1 E   X i   2  nX   2 
n




1




2 
1 n
2

 E  X i     nE  X    .


n 1  1
But, by definition,



 
2

E  X i   2  VarX i    2 and E  X     Var X 
.


n
n 2
1 
 2 
1
E S2 
n 2   2   2 .
Therefore, we have
   n 

n 1 
n
n

1

1

 

51
n
1
  X i     nX   2
1


 2   X i    X     X  
2


2
 
Therefore as E S 2   2 , S 2 is an unbiased estimator of  2 .
Note: This is why we choose n - 1 rather than n as the divisor in the definition of the sample variance.
Result 3:
Suppose that X is the number of successes in n trials so that X ~ Bi[n, ]. Then the sample
proportion p = X / n is an unbiased estimator of .
Proof:
As X has a binomial distribution, and therefore mean n, we have:
1
X  1
E[ p]  E    E[ X ]  n   .
n
n n
Therefore p is an unbiased estimator of .
Definition:
The bias of ˆ is defined as
 
B ˆ  E ˆ   .
 Example:
Find the bias of the estimator
ˆ 2 

1 n
 Xi  X
n 1
2

 Now, we know that
 1 n
2
E
Xi  X    2.

n 1 1

So,
1 n
2
n 1  1 n
E ˆ 2  E   X i  X  
E
 Xi  X
n
n 1

n 1 1
Therefore,
n 1 2
1
B ˆ 2  E ˆ 2   2 
  2    2
n
n
n
2
1
X i  X will always underestimate  2 .
indicating that

n 1 1


 



2   n n 1  2 .

   


Just because an estimator is unbiased, it doesn't necessarily mean that it is a good estimator (it just means that on
average it will produce a value that is the true value of ).
Illustration:
52
Suppose that we have 2 possible estimators ˆ1 and ˆ2 so that:
Here ˆ1 is unbiased whereas ˆ2 is biased. However, in this case we would prefer ˆ2 to ˆ1 . This is because the
observed value of ˆ is likely to be closer to the true value of  than ˆ (it has a smaller standard error). So,
2
1
by choosing ˆ2 rather than ˆ1 we are maximising our chance that the observed value of our estimator will be
close to the true value of .
Ideally we want an estimator with small bias and small standard error.
 Example:
Suppose that X 1 ,..., X n , n > 1, is a random sample from N[ ,  2 ]. Show that X1, the first observation, is an
unbiased estimator of  . If you were given a choice of using X1 or X as your estimator for  , which would you
prefer?
 Now,
X1 ~ N[ ,  2 ], so E[X1] =  . Therefore X1 is an unbiased estimator for  .
Both X1 and X are unbiased estimators, so we'll choose the one with the smallest standard error.
 s.e.[X1 ] = s.d.[X1 ] = ,

s.e.[ X ] =

.
n
So as n > 1, s.e.[ X ] < s.e.[ X1] and so we would prefer to use X as an estimator of .
53
Chapter 6: Interval Estimation
6.1
Introduction
The heights (cm) of a random sample of 12 primary school children of a certain age were as follows:
114, 137, 132, 140, 125, 116, 110, 118, 136, 131, 122, 128.
We might be interested in learning about the mean height, , of all children of that age. We know that the
sample mean can be used as a point estimate for - here x  125 .75 cm. However, because of sampling
variability, the true value of  may be quite different from this estimated value.
It would be more useful if we could use the data to identify an interval within which we believe the true mean 
would lie. We call this a confidence interval.
We can show the above data diagrammatically on a dotplot:
A dotplot showing the heights of children
110
120
130
140
C1
It is unlikely that 
would be here.
The true value of  is
likely to be somewhere in
the centre of the data.
Likewise, it is not
likely that  would be
here (if the sample
were random).
In Statistics, the degree of confidence we have that an interval contains the parameter we are trying to estimate is
expressed as a percentage. For example, if a 95% confidence interval were produced then we would be 95%
confident that the resulting interval would contain the true value of the parameter. Alternatively, we could
produce a 99% confidence interval- this would be wider than a 95% confidence interval.
6.1.1
Definitions
Definition:
An interval [T1 , T 2 ] is a 100(1 - )% confidence interval for a parameter  if it contains  with
probability (1 - ).
An alternative way of thinking of this is as follows…. If the method for deriving, for example, a 95%
confidence interval were to be repeated a large number of times then approximately 95% of the intervals
produced would contain the true value of .
Note: We have to be very careful when talking about confidence intervals. It is not acceptable, for example, to
refer to  having a given probability of lying in a confidence interval. This is because, by attaching a probability
to  lying within the interval, you are creating the impression that  is not a fixed quantity. It is the end-points of
the confidence interval that are random quantities varying from sample to sample.
6.1.2
Example (continued)
54
Returning to the simple introductory example about the heights of primary school children, a 95% confidence
interval for the population mean  is shown in the diagram below.
Dotplot of Heights of children
(with 95% t-confidence interval for the mean)
[
110
]
_
X
120
130
140
Heights of children
Later in the chapter (Section 6.3), you will find out how to calculate this interval for yourselves. You will also
discover how to find confidence intervals for a population variance and for a population proportion.
We start with the most basic situation, namely finding a confidence interval for a population mean when the
population variance is known.
6.2
Confidence Intervals for  (Known Population Variance)
6.2.1
Confidence intervals when data follow a normal distribution
Background:
Consider a random sample X 1 ,..., X n drawn from a N[  ,  2 ] distribution, where we assume that
the population variance,  2 , is known.
Problem:
Suppose that we wish to calculate a 100(1 - )% confidence interval for . Then we want to
find two statistics T1 and T2 such that:
P[T1 , T2 ]     1  
Note 1:
T1 and T2 are the random variables, not .
Note 2:
(1 - ) is usually taken to be 0.9, 0.95 or 0.99. The higher the value of (1 - ), the more
confident we are that the confidence interval does in fact contain . However, the higher (1 - ) is, the wider the
interval becomes and therefore the less informative it is about ’s location. So there exists a trade off.
Derivation of confidence interval: We know that if X 1 ,..., X n are normally distributed then
 2
X ~ N  .
.
 n 
Thus, by applying the standardisation formula,
X 

~ N[0,1].
n
Therefore,
55



X 

 


P   z / 2 
 z / 2   1    P  z / 2 
 X    z / 2 
  1  .

n
n




n
Rearranging further gives:



 

 
P  X  z / 2 
    X  z / 2 
   X  z / 2 
  1    P  X  z / 2 
  1  .
n
n
n
n


We can therefore see that we have the following result:
Result: When we have a sample X 1 ,..., X n from a N[  ,  2 ] distribution with known variance  2 then a 100(1
- )% confidence interval for  is given by:
X  z / 2 

.
n
 Example:
A biologist selects 15 beetles at random from a colony she is studying. The weights of these beetles (in g) are as
follows:
5.7, 4.9, 5.3, 5.0, 5.4, 5.1, 5.2, 5.2, 5.3, 5.4, 5.7, 5.1, 5.6, 5.0, 5.3.
Assuming that the weights follow a normal distribution with known population standard deviation 0.2 g,
calculate a 95% confidence interval for the population mean weight.
5.7  4.9  5.3  ...  5.3 79.2

 5.28 g.
15
15
From normal percentage point tables, z 0.025  1.96.
Thus, the 95% confidence interval is
 Sample mean =
X  z / 2 

 5.28  1.96 
0.2
n
 (5.179, 5.381).
15
 Exercise:
A new drug to lower blood pressure is given to 20 volunteers and their fall in BP is recorded. From previous
work the standard deviation of the change in BP is known to be 8mmHg and the falls are believed to follow a
normal distribution. The mean fall in the sample is 6mmHg. Find a 99% confidence interval for the mean fall in
BP.
6.2.2
Confidence intervals when sample size n is large
The assumption that the data follow a normal distribution can be relaxed if the sample size, n, is large (rule of
thumb, n > 30). This is because in such situations the Central Limit Theorem can be applied ensuring that the
sample mean, X , will approximately be normally distributed. Thus we have the result:
Result: When we have a sample X 1 ,..., X n from any distribution with mean  and known variance  2 then, if
the sample size n is large, a 100(1 - )% confidence interval for  is approximately given by:
X  z / 2 
 Example:
56

n
.
Michael is a keen cyclist and rides his bicycle every day. On a random sample of 44 days he averages 18 miles
per day. The standard deviation for all days is known to be 5 miles. Find a 90% confidence interval for his mean
daily mileage.
 In this example, the sample size is n = 44, i.e. large enough for the Central Limit Theorem to apply.
From tables, the 5% point for a normal distribution is z 0.05  1.65. Therefore, the 90% confidence interval is:

5
X  z / 2 
 18  1.65 
 (16.76, 19.24).
n
44
So we are 90% confident that the mean number of miles travelled per day lies in the interval (16.76, 19.24).
Important: It is the interval which varies from sample to sample, not . So for example, if we generated a 95%
confidence interval for each of 100 different samples, we would expect 95 of them to contain 
6.3
Confidence Interval for  (Unknown Population Variance)
In most situations the population variance is not known. In such situations, some amendment is needed to the
formulae presented in Section 6.2. We deal first with the case where the sample size, n, is small.
6.3.1
Confidence intervals when  is unknown and n is small
The formulae presented in the previous section for a confidence interval for  are written in terms of the
population variance,  2 . When this is unknown, the formulae cannot be applied.
A confidence interval is instead derived from the following result:
Important result: Suppose X 1 ,..., X n is a random sample from a N[  ,  2 ] distribution, where both parameters
are unknown. If S 2 denotes the sample variance, then
X 
~ t n 1
S
n
57
Proof: (see also Example Sheet 4)
We know that Y 
X 

~ N[0, 1] and Z 
(n  1) S 2

n
2
~  n21 . Therefore, by definition,
Y
Z
n 1
But,
Y
Z
n 1

X 

n

S2
2

X 

n


S

~ t n 1 .
X 
.
S
n
Derivation of confidence interval:
From the above result, we know that


X 


P  t n 1, / 2 
 t n 1, / 2   1  
S


n
We now need to rearrange the inequality so that  is in the centre:

S
S 
P  t n 1, / 2 
 X    t n 1, / 2 
 1
n
n


S
S 
 P  X  t n 1, / 2 
     X  t n 1, / 2 
 1
n
n


S
S 
 P  X  t n 1, / 2 
   X  t n 1, / 2 
 1
n
n

Thus the upper and lower end-points for the 100(1 - )% confidence interval are given by:
S
X  t n1, / 2 
.
n
Result: Suppose that X 1 ,..., X n ~ N[  ,  2 ] where  2 is unknown. Then a 100(1 - )% confidence interval
for  is given by:
S
X  t n1, / 2 
n
where S is the sample standard deviation.
Comparing this result to that give in the previous section where  was assumed known, two changes can be
clearly seen:
 A percentage point from a t distribution is used in place of a normal percentage point;
 The population standard deviation is replace by the sample standard deviation, S.
 Example:
The number of hours spent by 10 randomly chosen computer science students completing their assessed
coursework were as follows:
5.5, 1.5, 3.6, 7.2, 2.4, 3.8, 4.0, 1.9, 5.3, 2.7.
Calculate a 99% confidence interval for the mean time spent on the coursework in the population of all students.
58
 Here, the population variance is unknown. So we must begin by finding the sample mean and variance:
10
 xi
37.9
 3.79 hours.
10
10
( xi ) 2  1 
1
37.9 2 
2
2
S   x i 
  172.49 
  3.205  S  1.790 hours.
9 i
10  9 
10 


The appropriate percentage point here comes from a t distribution with 10 – 1 = 9 degrees of freedom:
t 9,0.005  3.250.
x
i 1

Thus the 99% confidence interval for the population mean  is
S
1.79
x  t 9,0.005 
 3.79  3.250 
 (1.95, 5.63).
n
10
Note that in producing this confidence interval we need to assume that the data are normally distributed.
 Exercise:
A tennis player wishes to examine his service performance in a particular match. The speeds (in mph) of 8
randomly selected serves were as follows:
98, 92, 101, 80, 94, 99, 88, 96.
Calculate a 95% confidence interval for this player’s mean service speed in this match.
6.3.2
Confidence intervals for  when  is unknown and n is large
We noted in Chapter 3 that a t distribution looks very much like the standard normal distribution when the
degrees of freedom are large.
Therefore, in producing a confidence interval for  in situations when
i. the population variance is unknown and
ii. the sample size, n, is large (e.g. n > 30),
we can approximate the percentage point t n1, / 2 that occurs in the formula by z / 2 .
Moreover, when the sample size is large, the assumption that the data follow a normal distribution is less critical
(because the Central Limit Theorem can then be applied). We therefore have the following result:
Given a large (n > 30) sample X 1 ,..., X n , drawn from a distribution with mean  and (unknown) variance, a
100(1  )% confidence interval for  is (approximately) given by:
S
X  z / 2 
.
n
 Example:
A nursery is growing a large number of tomato plants. A sample of 45 plants was taken at random and their
heights were found. If the sample mean and standard deviation were 5.2 cm and 1.3 cm respectively, calculate a
90% confidence interval for the mean height of the tomato plants in the nursery.
 Here the sample size is n = 45, so can be considered large. Consequently, we can take our percentage points
from the standard normal distribution rather than from t 44 (which incidentally does not appear in tables). The
population standard deviation is unknown, so we use the sample standard deviation as an estimate.
From statistical tables, the appropriate 5% point is z 0.05  1.645 . Therefore, the 90% confidence interval is:
59
X  z 0.05 
S
 5.2  1.645 
n
1.3
 (4.88, 5.52).
45
 Exercise:
75 randomly selected smokers were asked how many cigarettes they had smoked the previous day. The sample
mean and variance were 20 and 196 respectively. Calculate a 95% confidence interval for the population mean.
6.4
Confidence Intervals for the Population Variance
Let us assume that we have a random sample, X 1 ,..., X n , drawn from a normal distribution, N[  ,  2 ], with both
parameters unknown. In the previous section, we learnt how to produce a confidence interval for . We now
look at producing a 100(1  )% confidence interval for  2 .
Idea:
We need to find T1 and T2 such that P(T1   2  T2 )  1   . These values can be obtained by making
use of the sampling distribution of S 2 .
Derivation of confidence interval:
(n  1) S 2
~  n21 . Therefore,
We know that
2





(n  1) S 2
1
2
1
2

  1  .

P  n21,1 / 2 



1



P


n

1
,

/
2
2
2
2
2




 n1,1 / 2 


  n1, / 2 (n  1) S
Multiplying throughout by (n  1) S 2 gives:
 (n  1) S 2
(n  1) S 2 
P 2
 2  2
 1  .

 n1,1 / 2 
 n1, / 2
Upper % point
Lower % point
We therefore have the following result:
Result: Given a random sample X 1 ,..., X n from N[  ,  2 ], a 100(1  )% confidence interval for  2 is given
by
 (n  1) S 2 (n  1) S 2 

.
, 2
 2

 n1, / 2  n1,1 / 2 
 Example:
The blood cholesterol levels in a sample of 11 people are as follows:
270, 256, 330, 324, 291, 279, 329, 344, 308, 297, 310.
Calculate 95% confidence intervals for the population mean and standard deviation.
 We first need to calculate the sample mean and variance:
270  ...  310 3338
x

 303 .45.
11
11
60
( xi ) 2  1 
1 
3338 2 
 xi2 
  1020584 
  765 .273  S  27.66 .
10  i
11  10 
11 


Confidence intervals for  and  can only produced if the data are normally distributed. So we need to make this
assumption.
S2 
A 95% confidence interval for  is then
X  t n1, / 2 
S
 303.45  2.228 
n
27.66
 (284.87, 322.03).
11
t10,0.025  2.228
2
The upper and lower 2.5% points from a  10
distribution are 20.48 and 3.247 respectively. The 95% confidence
interval for  2 is therefore
 (n  1) S 2 (n  1) S 2   10  765 .273 10  765 .273 


, 2
,
  (373 .67, 2356 .86).
 2
 
20
.
48
3
.
247


n 1,1 / 2 
 n1, / 2
Taking square roots of the upper and lower end-points results in the following confidence interval for :
(19.33, 48.55).
 Exercises:
1. A machine puts rice into 400g packets and the standard deviation over a long period is 2.5g. A new machine
is evaluated by means of a random sample of 21 packets whose sample standard deviation is 3.2g. Find a
90% confidence interval for the standard deviation of the new machine.
2. The speeds in mph of 15 randomly selected cars passing a police speed checkpoint were as follows:
27, 31, 34, 30, 32, 38, 26, 30, 32, 34, 31, 29, 41, 35, 33.
Calculate a 99% confidence interval for the population mean and variance.
6.5
Confidence Intervals for a Population Proportion (with large n)
We often want to make inferences about a proportion. For example, we might want to estimate the proportion of
people who currently support the Conservative Party.
Suppose we denote the population proportion by . Then following our previous method, to find a confidence
interval for , we need to find T1 and T2 such that P(T1    T2 )  1   .
To do this we make use of the sampling distribution of the sample proportion, p:
  (1   ) 
.
p ~ N  ,
n 

Recall that this result was appropriate if the sample size is large ( n  5 and n(1   )  5 ). Standardising gives
the approximate result:
p 
~ N[0, 1].
 (1 )
n
Derivation of confidence interval:
Using the above result, we can write
61





p 
 (1   )
 (1   ) 
P  z / 2 
 z / 2   1    P  z / 2 
 p    z / 2 
1 .


 (1 )
n
n




n


Rearranging so that  is alone in the centre of this inequality gives:

 (1   )
 (1   ) 
P  p  z / 2 
    p  z / 2 
1


n
n



 (1   )
 (1   ) 
 P  p  z / 2 
   p  z / 2 
1


n
n


But the limits of this confidence interval are functions of , which is unknown. So to calculate the limits  must
p (1 p )
n
be estimated. As long as the sample size is large, the value of
should be close to
 (1 )
n
and can be
used in its place.
This result thus follows:
Result: A 100(1  )% confidence interval for  when n is large ( np  5 and n(1  p)  5 ) is given by
p  z / 2 
p(1  p)
.
n
 Example:
120 university students were randomly selected. Of these, 11 had taken one or more years off between leaving
school and entering university. Calculate a 95% confidence interval for , the proportion of all students entering
university on this basis.
 From the question, the sample proportion is p 
11
 0.0917 . As np  11  5 and n(1  p)  109  5 , the
120
confidence interval for  can be calculated as:
p(1  p)
0.0917  0.9083
p  z / 2 
 0.0917  1.96 
 0.040, 0.143. .
n
120
 Exercise:
The paper “Worksite smoking cessation programs: a potential for national impact”(Amer. J. of Public Health.
1983, pp 1395-96) investigated the effectiveness of smoking cessation programs at work. The program tested
involved group meetings and monetary incentives for attending meetings and for not smoking. Of those who
participated in the experiment, 91% successfully stopped smoking and were still abstinent 6 months later.
Suppose a representative sample of 70 people were involved in the experiment. Let  denote the success rate of
the program ( = population proportion of participants who would still be non-smokers 6 months after
completing the program). Find a 99% confidence interval for .
6.6
Choosing the Sample Size
All the confidence intervals we've looked at depend on the sample size n. For example, the confidence interval
for  with  known is
x  z / 2 
62

n
.
As n gets larger, the width of the confidence interval decreases, which means that the interval becomes more
informative about the unknown parameter.
 Example:
There is interest in learning about the mean I.Q. of students at UKC. If the standard deviation of I.Q.s can be
assumed to be 20, find the sample size that will ensure that the width of a 99% confidence interval for  is less
than 4 units.
 As the population s.d. is known, the appropriate formula for the confidence interval for  is:
x  z / 2 
The width of this is 2  z / 2 

n

.
n
. Because the appropriate percentage point is z 0.005  2.5758 , to find n we
need to solve
2  2.5758 
20
103 .032
 4  n  25.758  n  663.5 .
n
n
We would need around 664 students in the sample therefore.
6.7
4 
One-sided Confidence Intervals
Up until now we have been calculating two-sided confidence intervals for parameters. In other words, we have
been setting a lower and upper confidence limit on the parameter in question.
For example, consider again the following data relating to the heights (in cm) of primary school children of a
certain age:
114, 137, 132, 140, 125, 116, 110, 118, 136, 131, 122, 128.
Assuming normality, we can construct a 95% confidence interval for the population mean, . This is shown in
the diagram below. (Note that because the population variance is unknown, the confidence interval must be
calculated using a percentage point from a t distribution.)
Dotplot of Height (cm)
(with 95% t-confidence interval for the mean)
[
110
]
_
X
120
130
140
Height (cm)
We can therefore express 95% confidence that the mean lies between the indicated lower and upper limits (i.e.
there is a 5% probability that the interval will not contain ).
However, we might only be interested in finding a lower (or upper) limit for .
63
A dotplot showing children's heights and a one-sided confidence interval
Lower
limit =
120.65
110
120
130
140
Height (cm)
In producing the one-sided confidence interval above, we are putting just a lower limit on . Here, a 95% onesided confidence interval has been calculated as [120.65, ). Note that there is a 5% probability that the lower
limit will be greater than .
 Example:
Consider again the blood pressure example that was presented in Section 6.2.1. Here there were 20 volunteers
in the sample. The population standard deviation was known to be 8 mmHg and the sample mean was 6 mmHg.
Suppose all that matters is the least possible fall in blood pressure. We would then calculate a one-sided
confidence interval. For example, for a one-sided 99% confidence interval the lower limit would be:
X  z 0.01 

 6  2.3263 
8
 1.84 mmHg.
n
20
The 99% one-sided confidence interval therefore is [1.84, ).
 Exercise:
Consider again the following cholesterol data taken from 11 volunteers:
270, 256, 330, 324, 291, 279, 329, 344, 308, 297, 310.
Suppose that we are only interested in an upper limit for the population mean. Calculate this one-sided interval
with a 95% confidence coefficient.
64
Chapter 7: Hypothesis testing
7.1
Introduction
Introductory scenario
Proponents for a particular dieting regime claim that people will, on average, lose 14 pounds if the plan is
followed for six weeks. A nutritionist wishes to test this claim- she suspects it to be false.
In order to test the plausibility of the claim (or hypothesis) some data are needed. Relevant data here would be
the weight losses from a sample of say 50 people who followed the diet over 6 weeks. We then need to assess
how consistent the hypothesis is with the observed data.
It is important to note that we cannot absolutely prove or disprove a hypothesis, only gather evidence for or
against it.
Other examples:
 a manufacturer might claim that the mean lifetime of a brand of battery is 110 hours;
 a political party might claim that the proportion of voters who will vote for them in the next general election
is 45%.
In each case, a sample could be taken and the sample values used to determine whether or not the hypothesised
population value is reasonable or not.
To introduce hypothesis testing we shall use a specific example of testing hypotheses about a mean  when the
population variance (  2 ) is known.
7.2
Testing hypotheses for  (known population variance)
7.2.1
Terminology
In hypothesis testing, we wish to choose between two competing hypotheses. These are called the null
hypothesis (denoted H 0 ) and alternative hypothesis (denoted H1 ). Generally, the null hypothesis is the one that
we suspect could be false and the alternative hypothesis is the one that we usually hope to be true.
We illustrate this terminology through two examples.
 Example 1:
An IQ test is designed so that the average score in the population as a whole is 100 with s.d. 20, and so that the
scores follow a normal distribution.
A random sample of 25 children at a school under investigation takes the test. The sample mean score is
x  108 .3 . Is there any evidence that this school has children with an IQ different from the general population?
Let  denoted the mean I.Q. for all children at that school. The null and alternative hypotheses would then be as
follows:
 H 0 :   100
(i.e. the school has the same IQ as the whole population);
 H 1 :   100
(i.e. the school has a different mean IQ from the general population).
The null hypothesis here is the cautious hypothesis which we initially assume to be true- i.e. without any sample
data, given any group of children we would initially assume that their average IQ is the same as the general
population.
Note that we here have a two-sided alternative hypothesis as we are testing whether the mean IQ differs from the
hypothesised value of 100. If we were looking to test whether the mean IQ was greater (or smaller) than this
value, we would need to specify a one-sided alternative hypothesis (see later).
65
 Example 2:
Researchers have postulated that, due to differences in diet, Japanese children have a different mean blood
cholesterol level compared with British children. Suppose that the mean level for British children is known to be
170. Let  represent the mean blood cholesterol level for Japanese children. What hypotheses should the
researchers test?
The null hypothesis represents what we initially assume to be true. So without any sample information about
Japanese children we'd initially assume that  = 170 and so
H 0 :   170 .
The alternative hypothesis is that the cholesterol level of Japanese children differs from that of British children
and so
H1 :   170 .
7.7.2
General formulation (two-sided test)
In general suppose that we have the hypotheses:
H 0 :    0 versus H1 :    0 .
Background:
To test H 0 against H1 , we initially assume that H 0 is true. We then see how plausible data at least as extreme
as our observed data would be under this assumption. So if the probability of observing our sample result is
small under H 0 ’s distribution, then this means that we are unlikely to have observed what we actually have if
H 0 were true. In this case it therefore looks like H 0 is not true. On the other hand, if the probability of
observing our sample result is large, then we could plausibly have observed the sample we in fact got, and
therefore H 0 could be true.
The sample mean, X , is a good estimator for , so it makes sense to use our observed sample mean, x ,
to test hypotheses about .
Theory:
Consider the case where a sample X 1 ,..., X n is obtained from a N[  ,  2 ] distribution (where  2 is assumed to
be known). The hypotheses we are interested in testing are
H 0 :    0 versus H1 :    0 .
We know from Chapter 4 that
 2
X ~ N  ,
.
n 

If the null hypothesis is true, then

X  0
2
X ~ N  0 ,
~ N[0, 1].
Z

n 

n
We will use the distribution of Z to decide whether our sample of data (summarised by the sample mean) could
plausibly have been obtained from a normal distribution with mean  0 . Z is referred to as the test statistic.
The observed value of the test statistic is:
z
x  0

.
n
If H 0 is true, a sampled value z of Z will have come from N[0, 1]. In this case we are most likely to observe a
value of z which lies in the main body of the distribution (as these values would be the most probable values of Z
66
to observe). Therefore if we observe z in the main body of N[0, 1], this sampled value would support H 0 . We
would then have no evidence to reject H 0 , or equivalently we could say that we “accept” H 0 to be true. Note
that this is not the same thing as saying that H 0 is true, only that we have no evidence to say that it is false.
Suppose now that the observed value z of Z lies in the tails of the standard normal distribution. Such a value
would have been unlikely to occur if H 0 were true. So if we observe z outside the main body of N[0, 1], then
this sample value would not support H 0 . We would therefore reject H 0 .
The range of values of the test statistic that would lead us to reject the null hypothesis is called the critical
region. The next problem, then, is to decide how to specify the exact values of our critical region.
In carrying out a hypothesis test there are two types of error we can make.
 Type 1 error. This is when H 0 is rejected when in fact it is true.
 Type 2 error. This is when you fail to reject H 0 , when in fact it is false.
P(type 1 error) is usually denoted  and we call it the size of the test. We can use this value to find a suitable
critical region for the test.
A type 1 error is usually thought to be the more serious and we therefore define our test so that we have a
suitably low value of . The values of  that are acceptable will vary from situation to situation. The most usual
values are 0.1, 0.05, 0.01 or 0.001. Now
  P(reject H 0 when it is true)  P(observe z in the tails of N[0, 1]).
So by setting a value for , we can find our critical region.
For example, if  = 0.05, then we will reject H 0 if we observe z  z 0.025 or z   z 0.025 , i.e. if z  1.96 or
z  1.96 . So our critical region is z : z  1.96 or z  1.96. We then say that we have a test at the 5%
significance level.
To test between the hypotheses
H 0 :    0 versus H1 :    0
when X 1 ,..., X n is a random sample from a normal distribution with known population variance:
X  0

use the test statistic Z 

n
reject H 0 at the 100% level if | z | z / 2 .

67
 Example:
Consider again the IQ example. Here we had a random sample of 25 children from a particular primary school.
The mean IQ in the sample was 108.3. The hypotheses of interest were
H 0 :   100 versus H1 :   100 .
The population standard deviation is known to be 20 and we can assume that IQs follow a normal distribution.
 The test statistic in this situation is given by:
Z
X  0

.
n
The observed value of this test statistic then is:
108 .3  100 8.3
z

 2.075 .
20
20
5
25
For a 5% test, the critical values for the test statistic are  z 0.025  1.96. As the observed value of the test
statistic lies in the critical region, we can reject H 0 at the 5% significance level.
For a 1% test, the critical values would be  z 0.005  2.5758 . Since z  2.075  2.5758 , we would not be able
to reject H 0 at the 1% significance level.
We interpret these test results as follows… The data provide some evidence (but not strong evidence) to suggest
that the mean IQ of children in this primary school differs from the general population.
Important notes:
 Always state the level of significance you are using when rejecting or accepting H 0 .
 Rejection of H 0 means there is definite evidence to reject H 0 . “Acceptance” of H 0 means that there is
insufficient evidence to reject H 0 - i.e. H 0 may still be untrue, but we do not have enough data to reject it.
This is regularly misunderstood.
 Significance tests are commonly conducted at the following levels: 5%, 1% and 0.1%. These significance
levels provide varying degrees of evidence against H 0 :
5% level- some evidence against H 0 ;
1% level- strong evidence against H 0 ;
0.1% level- very strong evidence against H 0 .
68
 Exercise:
A machine is designed to produce bolts with a mean length of 25mm. The standard deviation of the length of the
bolts is known to be 0.23 mm. After a routine service, a random sample of bolts were measured and the lengths
(in mm) were found to be:
25.5
25.3
25.1
25.6
24.9
25.0
25.4
25.3
25.0
24.8
25.2
25.4.
Test to see whether the servicing of the machine has altered the mean length of the bolts it produces. Assume
that the standard deviation is unchanged and that the data can be assumed to follow a normal distribution.
Note:
If the sample size is large, the assumption that the data are normally distributed is less critical. This is because
 2
the central limit theorem ensures that X ~ N   ,
 (approximately) whatever the distribution of X 1 ,..., X n
n 

when n is large.
 Example:
The manager of a telesales department claims that the average time that an operator spends talking to a potential
client is 70 seconds. The managing director of the company doubts this claim and times a random sample of 40
telephone calls. The sample mean was 62 seconds. If the population standard deviation is known to be 45
seconds, carry out a hypothesis test at the 5% significance level.
 If  denotes the population mean call length, the hypotheses are:
H 0 :   70 versus H1 :   70 .
The sample size here is large (n > 30) and so we do not need to assume that the call times follow a normal
distribution (the Central Limit Theorem ensures that the distribution of X is roughly normal).
The test statistic is
Z
X  0

n
giving an observed value
62  70
 1.124 .
45
40
For a 5% test, the critical values would be  z 0.025  1.96. So no evidence to reject H0 at this level, i.e. it is
plausible that the mean call length is 70 seconds.
z
7.1.3
Link between hypothesis tests and confidence intervals
Consider our usual hypotheses:
H 0 :    0 versus H1 :    0
We will be able to reject the null hypothesis at the 5% significance level if a 95% confidence interval for 
excludes  0 :
E.g.
95% C.I.
69
Interpretation:
Reject null hypothesis at
the 5% level
*
* * * * **
* * *
* *
0
If a 95% confidence interval for  includes  0 then we do not have sufficient evidence at the 5% level to reject
the null hypothesis:
E.g.
95% C.I.
*
* * * * **
* * *
* *
0
Interpretation:
“Accept” the null hypothesis at the
5% level (  0 is a plausible value
for the population mean).
In general, we can accept H 0 at the 100% level if and only if a 100(1  )% C.I. for  excludes  0 .
7.1.4
p-values
Specifying the size of test, together with the conclusion about whether the result was statistically significant at
that level is one way in which a hypothesis test can be carried out. A more informative way of giving the
strength of evidence against a null hypothesis is to calculate a p-value.
The p-value gives the exact observed significance of the data, i.e. it specifies the probability of observing a result
at least as extreme as our sample result given that H 0 is true. The p-value is often simply denoted by p.
 Example (IQ example continued):
In the IQ example we observed z = 2.075. To calculate the p-value we need to calculate the probability of
observing a result which is at least as extreme as this:
70
p  P(Z  2.075 or Z  2.075)  P(Z  2.075)  P(Z  2.075)  2  (1  (2.075))  2  0.019  0.038.
The observed level of significance is therefore 0.038.
This value is consistent with our earlier conclusions- we can reject the null hypothesis at the 5% level but not at
the 1% level.
7.1.5
One-sided tests
 Example (continued):
Consider again the earlier example concerning whether Japanese children have a different mean blood
cholesterol level than British children. Because a Japanese diet has less saturated fat than a British diet,
researchers might postulate that the mean cholesterol level for Japanese children is in fact lower than British
children (whose mean level is 170). They may then want to test this via a hypothesis test.
Once again the null hypothesis represents what we initially assume to be true and so again we'd set
H 0 :   170
However, the alternative hypothesis is now that the cholesterol level of Japanese children is less than that of
British children and so
H1 :   170 .
To test these hypotheses, we’ll reject H 0 in favour of H1 only if we observe small values of z. We wouldn't
reject H 0 if we observe large values of z this time, because large values of z are now more consistent with H 0
than H1 . We'll therefore reject H 0 only if we observe z in the lower tail of N[0, 1]. So as we're only rejecting
H 0 if z falls in one of the tails of the distribution we call this a one-tailed test.
Note:
A one-tailed test is appropriate only when it is known that deviations from the null hypothesis will be in a
particular direction.
 Example:
The average mark in an A-level examination paper has traditionally been 58%. After a change in the syllabus, it
is suspected that the A-level paper will now be easier. The marks of 10 randomly chosen candidates sitting the
new syllabus are as follows:
64, 67, 35, 46, 78, 59, 53, 84, 60, 56.
If the population variance is known to be 225, perform a hypothesis test to see whether marks are now
significantly higher than before.
 The hypotheses we wish to test are as follows:
H 0 :   58 versus H1 :   58 .
To carry out this (one-sided) test, we need to assume that the data are normally distributed. The sample mean is:
n
x
 xi
1
n

602
 60.2.
10
71
Therefore, the observed value of the test statistic is
x   0 60.2  58
z

 0.464 .

225
n
10
For a 5% test, we would reject the null hypothesis if z  z 0.05  1.6449 . The conclusion then must be to “accept”
the null hypothesis at this level. The data provide no evidence to support the view that the examination marks
are on average higher than before the syllabus change.
Incidentally, the p-value associated with this test can be found as follows:
p  P(Z  0.464)  1  (0.464)  1  0.6772  0.3228 (approximately).
Null hypothesis is onesided so we find the
probability only of
larger values than we
observed.
 Exercise:
Suppose that the mean systolic blood pressure for white males aged 35-44 is 127.2. A random sample of 13
diabetic males aged 35-44 was taken and their systolic blood pressure was measured. The results are given below.
119.2, 130.2, 134.4, 120.1, 137.6, 128.0, 136.9, 129.1, 130.6, 127.9, 136.8, 135.4, 142.0.
Suppose that you are told that the standard deviation of systolic blood pressure for white males aged 35-44 is
6.726 and that it can be assumed that the data roughly follow a normal distribution. Investigate whether there is
evidence to suggest that the systolic blood pressure is
i. different
ii. higher
for diabetic 35-44 year old males than for the general population. Calculate the p-value in each case.
7.1.6
Calculating the probability of type 1 and 2 errors
 Example:
A coin is tossed 7 times. Suppose that we want to test the hypotheses:
H0: the coin is fair
versus
H1: the coin is biased in favour of heads.
A test is proposed which rejects H0 if 6 or more heads are observed.
a)
b)
What is the probability of a type 1 error?
What is the probability of a type 2 error if the the coin is in fact biased so that P(heads)=0.6?
Solution:
a)
P(type 1 error) = P(reject H0 | H0 true) = P(6 or more heads |coin fair)
7
6
1
 7  1   1 
1
          0.0546  0.0078  0.0624 .
 2
 6  2   2 
b)
P(type 2 error) = P(accept H0 | P(heads) = 0.6) = 1  P(reject H0 | P(heads) = 0.6)
= 1  P(6 or more heads | P(heads) = 0.6)
 7
 1  0.67     0.6 6  0.4  1  0.131  0.028  0.841 .
 6
72
7.1.7
Power function
Definition: The power function of a test, which we'll denote  ( ) , is defined as
 (  )  P(reject H 0 |  ) .
So for each value of , we will have a different value for the power of the test.
 Example:
Consider the one-tailed examination marks example again. For a 5% test we reject H0 when
z > 1.6449
or equivalently when
x  58
 1.6449 .
225
10
So for example, the power for  = 62 is calculated as follows.


 X  58

 (62)  P( Z  1.64 |   62)  P
 1.6449   62   P( X  65.80 |   62)
 225

10


Now, if  = 62, then
X  62
 225 
X ~ N 62,
~ N[0,1].

10


22.5
Therefore,

65.80  62 
  P( Z  0.80)  1  0.7881  0.2119
P( X  65.80 |   62)  P Z 
22.5 

using standard normal tables.
Note:
We will want a test to have large power for values of  in H1 and small power for values of  in H0 - i.e. we want
to maximise the chance of coming to the correct conclusion.
Notice that
P(type 1 error) = P(Reject H0 | H0 true) = P(Reject H0 |    0 ) =  (  0 )
and when constructing our test we have already set this value to be suitably small.
Further if our hypotheses are simply
then
H 0 :    0 versus H1 :   1
 ( 1 )  P(Reject H 0 |   1 )  1  P(AcceptH 0 |   1 )
 1  P(AcceptH 0 | H 0 false)  1  P(type 2 error)
So in this case, if we have a large power, then we have a small probability of making a type 2 error.
7.2
Hypothesis tests for  (unknown population variance)
Recall that when finding a confidence interval for  when  2 is unknown we made use of the result:
73
X 
~ t n 1
S
n
where S is the sample standard deviation. We will use this result again in hypothesis testing.
7.2.1
One sample t-test
Consider the situation where we have a random sample of observations, X 1 ,..., X n , drawn from a normal
distribution with unknown mean and variance. We wish to use these data to compare the following hypotheses
about the population mean :
H 0 :    0 versus H1 :    0 .
The relevant test statistic would now be
X  0
T
S
n
which we know follows a t distribution with n – 1 degrees of freedom if the null hypothesis is true.
As in the previous section, if H0 is true then we would expect to observe values of T in the main body of the t n 1
distribution. On the other hand, if H0 is not true, then we might expect to observe t in the tails of this tdistribution.
The critical values that we use as cut-off points between accepting and rejecting the null hypothesis are the
(/2)% points from the t n 1 distribution. Hence we reject H0 if we observe
t  t n1, / 2 or t  t n1, / 2 .
74
One-sample t-test: To test the hypotheses
H 0 :    0 versus H1 :    0 .
when X 1 ,..., X n follow a normal distribution with unknown variance:


X  0
S
n
reject H0 at the 100% level if
use the test statistic T 
| t |  t n1, / 2 .
This test can easily be adjusted if the alternative hypothesis is one-tailed. For example, if H1 took the form:
H1 :    0
then we reject the null hypothesis if t  t n1, .
Note: In performing a one-sample t-test we assume that the data are independently distributed as a normal
distribution. This assumption is less critical if the sample size is large (see later).
 Example:
Ten randomly selected ‘pints’ pulled from a campus bar are measured accurately. The amount of beer (fl.oz) in
these ‘pints’ was as follows:
19.96, 19.97, 19.94, 20.01, 19.99, 19.97, 19.95, 19.97, 20.00, 19.98.
Test between the hypotheses H 0 :   20 versus H1 :   20. Find the associated p-value.
 Here, we begin by finding the sample mean and variance:
19.96  ...  19.98 199.74
x

 19.974;
10
10
( xi ) 2  1 
1 
199.74 2 
2

S2 
x


3989
.
611

 0.000471  S  0.0217 .
 i
 9
n 1 i
n
10 



For testing between the hypotheses:
H 0 :   20 versus H1 :   20 ,
we use the following test statistic:
X  0
T
.
S
n
This test statistic follows a t 9 distribution if the null hypothesis is true. Here, its observed value is
19.974  20
t
 3.79.
0.0217
10
The relevant critical values for different sizes of test are:
t 9, 0.025  2.262
5% test:
1% test:
t 9, 0.005  3.250
0.1% test:
t 9, 0.0005  4.781 .
Conclusion: We can reject the null hypothesis at the 1% level. There is strong evidence that the average beer
contents are not 20 fl.oz. (i.e. a pint).
Note that in performing this test it is necessary to assume that the measurements follow a normal distribution.
75
The p-value associated with this test is:
p  P(T  3.79 or T  3.79)  P(T  3.79)  P(T  3.79) .
But,
P(T  3.79)  1  P(T  3.79)  1  0.9979  0.0021 .
So, the p-value is 0.0021  2 = 0.0042 (or 0.42%).
 Example:
The widths (in mm) of a sample of 7 beetles, chosen from a particular island, were measured and found to be:
29, 34, 26, 31, 38, 33, 36.
The mean length of the beetles on the island is usually 36 mm, but due to recent adverse weather conditions it is
believed that their growth may have been stunted. Perform a hypothesis test to assess whether the data provide
any evidence to support this view.
 We must again assume that the lengths follow a normal distribution. The hypotheses we wish to test are
H 0 :   36 versus H1 :   36 .
It can be shown that the sample mean and standard deviation are 32.4286 mm and 4.1173 mm respectively.
Therefore the observed test statistic is:
32.4286  36
t
 2.29.
4.1173
7
Because t 6, 0.025  2.447 , we are unable to reject the null hypothesis at the 5% level. We have no evidence to
suggest that the beetles’ average length has decreased.
 Exercise:
The mean weight (in kg) of British children of a certain age is 32 kg. A random sample of American children of
this same age gave the following set of weights:
38, 34, 35, 43, 47, 40, 31, 39, 37, 42, 36, 35, 29, 38.
Perform a test (stating the necessary distributional assumptions) to assess whether there appears to be a
difference in the mean weights of American and British children at this age.
7.2.2
Hypothesis tests for  for large samples (unknown population variance)
When the sample size is large (say n > 30) the distribution of the sample mean should be approximately normal
whatever the distribution of the original data. We therefore do not need to make the assumption of normality in
the one-sample t-test for large n.
Further, when the sample size is large, the distribution of the test statistic,
X  0
T
S
n
will be approximately a standard normal. [Recall: the t-distribution becomes approximately a N[0, 1] as the
degrees of freedom increase.]
76
 Example:
A machine putting cereal into boxes should be set so that the average content of each box weights 510 g. The
machine is serviced after which the weight of cereal in a random sample of 38 boxes is checked. The sample
mean was 513.4g and sample variance was 67.8 g2.
Test to see if there has been a change to the average content of the boxes.
 The hypotheses here are:
H 0 :   510 versus H1 :   510 .
The observed value of the test statistic is:
x   0 513 .4  510

 2.545 .
S
67.8
n
38
As the sample size is large, the critical points for this test should be approximately those from a standard normal.
Thus, z 0.025  1.96 and z 0.005  2.5758 . We can see that we can reject the null hypothesis at the 5% level, but
that there is not quite enough evidence to reject it at the 1% level.
7.3
Hypothesis tests for the population variance
We here assume that X 1 ,..., X n , follow a normal distribution, N[  ,  2 ] , where  is unknown. We now are
interested in testing hypotheses about  2 .
When finding a confidence interval for  2 when  was unknown, we used the fact that
(n  1) S 2
2
We will use this fact to define a hypothesis test for  2 .
~  n21 .
Suppose that the null hypothesis is H 0 :  2   02 . Our test statistic then is
Y=
( n  1) S 2
 02
which has a chi-squared distribution with n – 1 degrees of freedom under H0.
Then, if H0 is true we would expect to observe values of Y in the main body of a  n21 distribution and if H0 is
not true, then we might expect to observe Y in the tails of this distribution.
So, if we have H1 :  2   02 , then we'll reject H0 if we observe
y   n21, / 2 or y   n21,1 / 2
77
One-sided alternative hypotheses can be tested by using the critical points  n21, or  n21,1 , as appropriate.
Result: To test the hypotheses
H 0 :  2   02 versus H1 :  2   02
when  is unknown and X 1 ,..., X n follow a normal distribution:
(n  1) S 2

use the test statistic Y 

reject H0 at the 100% level if
 02
;
y   n21, / 2 or y   n21,1 / 2 .
Adjust for 1-tailed tests accordingly.
 Example:
Historically it is known that the journey time between 2 points is normally distributed with a standard deviation
of 6 minutes. After roadworks a sample of 10 journey times is found to have a sample standard deviation of 5
mins. Is there evidence of a change in the population variance?
 Want to test
H 0 :  2  36 versus H1 :  2  36
We've observed
9  25
 6.25.
36
If H0 is true, Y ~  n21   92 . For a 5% test , the appropriate critical points are
y
 92,0.975  2.70 and  92,0.025  19.02.
Our observed test statistic lies between these critical points. Therefore, there is no evidence at the 5%
significance level to reject H0. So no evidence for a change in the population variance.
7.4
Hypothesis tests for a proportion (with large n)
Let the unknown population proportion of interest be denoted by . Recall from Section 4.3 that if p is our
sample proportion, then for large sample size, n, we have the approximate result:
p 
~ N[0,1].
 (1 )
n
Now suppose that we have H 0 :    0 . To test this we can use the statistic:
p 0
W=
.
 0 (1   0 ) n
If the null hypothesis holds, then W ~ N[0, 1], when the sample size is large.
We therefore have the following result:
Result: To test the between the hypotheses H 0 :    0 versus H1 :    0 for large sample sizes,
p 0
 use the test statistic W 
 0 (1   0 ) n
78

reject H0 at the 100% level if W   z / 2 or W  z / 2 .
We adjust the test accordingly for one-sided alternative hypotheses. For example, if the alternative hypothesis is
H1 :    0 , then we would reject the null hypothesis only when W  z .
 Example:
In a survey of 588 doctors, 365 believed that it was sometimes right to agree to hasten a patient's death. Based on
this information, would you conclude that more than 60% of all doctors feel that it is sometimes appropriate to
help a seriously ill person die? Carry out a test at the 5% level.
 Our hypotheses here would be:
H 0 :   0.6 versus H1 :   0.6
We've observed p = 365  588=0.62 and so
0.62  0.6
w
 0.989 .
0.6(1  0.6) 588
But z 0.05  1.6449  0.989 and so we cannot reject H0 using a test at the 5% significance level. There is
insufficient evidence to suggest that the proportion of doctors who think mercy killings are sometimes
appropriate is greater than 0.6.
 Additional example:
A coin is thrown 140 times resulting in 85 heads. Test whether the data suggest that the coin is biased.
79
Chapter 8 Two sample problems
8.1
Introduction
There are many situations where we wish to compare the characteristics of two different populations, on the
basis of a sample drawn from each.
8.1.1
Introductory example
Daily protein intake (in grams) is measured on a sample of individuals living below the poverty level and
another sample living above the poverty level with the results:
Below poverty level:
51.4, 49.7, 72.0, 76.7, 65.8, 55.0, 73.7, 62.1, 79.7, 66.2, 75.8, 65.4, 65.5, 62.0, 73.3
Above poverty level:
86.0, 69.0, 59.7, 80.2, 68.6, 78.1, 98.6, 69.8, 87.7, 77.2.
Given these data we might be interested in seeing whether we conclude that poverty influences diet. The sample
mean and standard deviation for the protein intakes for each group are as follows:
Sample mean
Sample s.d.
Sample size
Below poverty level
x1  66.29
S1  9.17
n1  15
Above poverty level
x 2  77.49
S 2  11.34
n 2  10
A box-and-whisker plot showing the protein intakes in the two groups is given below:
Boxplots of Below and Above
(means are indicated by solid circles)
100
90
80
70
60
50
Below
Above
This shows that on average individuals above the poverty level seem to have a higher daily protein intake than
those living below the poverty level.
What we need though is a formal test to see whether there is a significant difference between the two groups.
8.1.2
Notation and preliminary work
Suppose that in general we have two populations and we select a random sample from each:
80
Sample from population 1:
X 11, X 12 ,..., X 1n1
Sample from population 2:
X 21, X 22 ,..., X 2n2
We will consider the case in which the populations are normally distributed. In this case:
X 1i ~ N[ 1 ,  12 ], i  1,..., n1
X 2 j ~ N[  2 ,  22 ], j  1,..., n2
We aim to use the sample data to make inferences about the difference in population means  1   2 .
To answer this question we need to examine the sampling distribution of the estimator of  1   2 . Now,
  12 

 22 
X 1 ~ N 1 ,
 and X 2 ~ N  2 ,
.
n1 
n2 


As the two samples are independent then X 1 and X 2 are also independent, so

 12  22 
X 1 - X 2 ~ N  1   2 ,

.
n1
n2 

So this is our sampling distribution of X 1  X 2 and we can use it to find confidence intervals or carry out
hypothesis tests involving  1   2 . Just as before we can distinguish two different cases -- when the population
variances are known and when they are not.
8.2
Inferences for 1  2 (variances known)
When  12 and  22 are known, then we can use the sampling distribution of X 1  X 2 , namely

 12  22 
X 1 - X 2 ~ N  1   2 ,

,
n1
n2 

to define a hypothesis test or confidence interval relating to  1   2 .
8.2.1
Hypothesis test
Suppose that we wish to test the null hypothesis:
H 0 : 1   2  k .
Usually, we wish to test whether both populations have the same mean value- in this case, k = 0.
Using the same reasoning as for one-sample hypothesis testing, we have the following result:
81
Result: To test between the hypotheses:
H 0 : 1   2  k versus H1 : 1   2  k
when
a)  12 and  22 are known
b) both samples come from a normal distribution
then:

use the test statistic Z 
X1  X 2  k
 12
n1


 22
n2
reject H0 at the 100% level if z   z / 2 or z  z / 2 .
Note: If the alternative hypothesis is one-sided, we adjust the rejection criteria in the usual way.
 Example:
A consumer magazine is interested in testing the time (in hours) that two types of battery last. The following
data was obtained:
Type A:
2116, 2347, 2215, 2098, 2156, 2108, 2073, 2205, 2271.
Type B:
2067, 2102, 2090, 2017, 1996, 2114, 2088, 2053.
It is known that the standard deviation for lives of batteries of type A and type B are 90 hours and 45 hours
respectively. Hence test the hypothesis that batteries of type A and B have the same mean life. (You may
assume that the observations are normally distributed).
 Our hypotheses here are:
H 0 : 1   2  0
i.e.
1   2
1   2 .
H1 : 1   2  0 i.e.
The sample means for the two groups of observations can be shown to be:
x1  2176 .6 and x 2  2065 .9.
Substituting the values of the (known) population variances into the formula for the test statistic gives:
x1  x 2  k 2176 .6  2065 .9  0
z

 3.26.
 12  22
90 2 45 2


9
8
n1
n2
The critical points should be obtained from a standard normal distribution:
z 0.025  1.96
5%
z 0.005  2.5758
1%
z 0.0005  3.2905
0.1%
We can therefore reject the null hypothesis at the 1% level (and very nearly at the 0.1% level). There is strong
evidence to suggest that the mean lives of the two types of battery are different.
8.2.2
Confidence interval
Now,
Z
X 1  X 2  ( 1   2 )
 12
n1

 22
n2
which means that
82
~ N[0,1]






X 1  X 2  ( 1   2 )
P   z / 2 
 z / 2   1   .


 12  22



n1
n2


So when the population variances are known, just as before we can rearrange this so that we are left with just
1   2 in the middle of the inequality. The two outer limits will then be our 100(1 - )% confidence interval
for  1   2 . Rearranging the above we get:

 12  22
 12  22 
P X 1  X 2  z / 2 

  1   2  X 1  X 2  z / 2 

 1  .

n1
n2
n1
n2 


Result: The 100(1  )% confidence interval for  1   2 when
i)
ii)
is given by:
 12 and  22 are known
the data follow normal distributions
X 1  X 2  z / 2 
 12
n1

 22
n2
.
 Example (continued):
Consider the battery example again. Suppose we want to find a 95% confidence interval for  1   2 . This
interval has limits:
x1  x 2  z / 2 
 12
n1

 22
n2
 2176 .6  2065 .9  1.96 
90 2 45 2

 44.14, 177.26 .
9
8
Notice that this interval does not contain 0. This is to be expected as we know that the null hypothesis of equal
means can be rejected at the 5% significance level.
The case in which  12 and  22 are known is unlikely to occur in practice.
8.3
Inferences for 1 - 2 (population variances unknown but sample sizes large)
Recall that when we were making inferences in a single sample when the population variance was unknown, we
had two different approaches, depending on whether n was large or not. Here we have similar results.
If n1 and n2 are large (rule of thumb n1, n2 > 30) then S12 will be a good estimator of  12 and S 22 will be a good
estimator of  22 and then
Z
X 1  X 2  ( 1   2 )
S12 S 22

n1 n2
~ N[0,1] (approximat ely)
We can use this distribution for the statistic Z and directly extend the results for inferences about  1   2 when
the variances are known.
83
Result: To test between the hypotheses:
H 0 : 1   2  k versus H1 : 1   2  k
when
a)  12 and  22 are unknown
b) both samples come from a normal distribution
c) n1 and n2 are both large (i.e. > 30)
then:
X1  X 2  k

use the test statistic Z 

reject H0 at the 100% level if z   z / 2 or z  z / 2 .
S12 S 22

n1 n 2
Adjust for 1-tailed tests accordingly.
Result: The 100(1  )% confidence interval for  1   2 when
a)
b)
c)
 12 and  22 are unknown
both samples follow normal distributions
n1 and n2 are both large (i.e. > 30)
is given (approximately) by:
X 1  X 2  z / 2 
S12 S 22
.

n1 n2
 Example:
A number of studies have focused on the question of whether children born to women smokers differ
physiologically from children born to non-smokers. The paper “Placental transfer of lead, mercury, cadmium and
carbon monoxide in women” (Environ. Research, 1978, 494 - 503) reported on results from one such
investigation. Blood-lead concentration (g/l) was measured in new-born children of 109 smokers and 333 nonsmokers. The results are given below.
Sample
Mothers who smoke
Mothers who don't smoke
Sample size
109
333
Sample mean
8.9
8.1
Sample s.d.
3.3
3.5
Is there evidence to suggest that the blood-lead concentrations are different for smokers' babies than for nonsmokers' babies?
 Let  1 denote the mean for smokers' babies and  2 denote the mean for non-smokers' babies. Want to test:
H 0 : 1   2  0 versus H1 : 1   2  0.
We have observed
z
8.9  8.1
 2.16.
3.3 2 3.5 2

109
333
If H0 is true, then Z ~ N[0, 1] approximately. Now, z0.025 = 1.96 < 2.16 so we'll reject H0 at the 5% level and
conclude that there is some evidence to suggest that blood-lead concentration is higher for smokers' than nonsmokers' babies.
84
Notice also that
p-value = P(Z > 2.16 or Z < -2.16) = 0.015 + 0.015 = 0.03.
In addition, the 95% confidence limits for this example are given by:
S12 S 22
3.32 3.5 2

 8.9  8.1  1.96 

 (0.075, 1.525)
n1 n2
109
333
As this interval is entirely positive, it suggests that the average lead concentration is higher for babies of smoking
mothers than for babies of non-smoking mothers.
x1  x 2  z / 2 
 Exercise:
An agricultural scientist believes that plants of a particular species tend to be taller if grown in a greenhouse
rather than outdoors. To test his theory, he performs an experiment. He grows 45 plants from seed in a
greenhouse and 64 plants from seed outside. The heights of these plants were later measured. The results can be
summarised as:
Greenhouse
Outdoors
Sample mean
18.6 cm
17.3 cm
Sample variance
4.9 cm2
6.2 cm2
Perform a hypothesis test to see whether the data provide evidence to suggest that the plants grown in a
greenhouse tend to be taller than those grown outside.
8.4
Two-sample t-test
When the sample sizes are small and the population variances are unknown, there is no simple way of estimating
1   2 . There is a solution, however, when it can be assumed that the separate population variances, although
unknown, are equal. This assumes that  12   22   2 , say. Note that we don't just casually assume that the
variances are equal- we need to check that this is a reasonable assumption. We can assess how reasonable such
an assumption is using the F-test (see later).
Given equality of variances,


2
2
X
~
N

,
X 1 ~ N  1 ,
and
2
 2
.

n 2 
n1 


Therefore

2 2
X 1  X 2 ~ N  1   2 ,


n1
n 2 


X 1  X 2   1   2  ~ N0, 1 .

1
1

n1 n2
As in the one-sample case, we now have to replace  by a suitable estimator S.
85
8.4.1
Obtaining the pooled sample variance
For each sample we have the sample variance S i2 , i = 1, 2, with which we can estimate  2 . However, if we can
combine these two estimators in some way, we should be able to get a better estimate of  2 than if we just used
1
one of the single sample variances. An intuitive estimate to use would be S12  S 22 . However, if n1 is larger
2
than n2, then we'd expect S1 to be a better estimator of  than S2. Therefore, instead of taking a straight average
of S12 and S 22 , we'll take a weighted average (taking account of the relative magnitudes of the two sample sizes).


The pooled estimate S2 of  2 is therefore defined by:
S2 
(n1  1) S12  (n 2  1) S 22
n1  n 2  2
So for example, if we have two samples of the same size, then
S 2  S 22
S2  1
2
which is the straight average of the two. On the other hand, if for example n1 >n2, then we'll give more weight to
S12 .
Note:
(n1  1) S12  (n 2  1) S 22
n S 2  n 2 S 22
results in an unbiased estimate of  2 , whereas 1 1
is a
n1  n 2  2
n1  n 2
biased estimator for the population variance.
The formula S 2 
8.4.2
Sampling distributions
Recall that when we have a single sample, X 1 ,..., X n , drawn from a normal distribution with unknown mean and
variance, then
X 
~ t n 1 .
S
n
Here we have a similar set up. This time we have

2 2
X 1  X 2 ~ N  1   2 ,


n1
n 2 

where  is unknown. An intuitive statistic to base our hypothesis tests and confidence interval on will be
X 1  X 2  1   2 
T
1
1
S

n1 n2
We need to find the distribution of T.
(n  1) S 2
(n  1) S 2
We know that 1 2 1 ~  n21 1 and 2 2 2 ~  n22 1 and therefore,


(n1  1) S12 (n 2  1) S 22

~  n21  n2  2
2
2




(sum of 2 chi-squared random variables). But the pooled sample variance S 2 is:
(n  1) S12  (n 2  1) S 22
S2  1
n1  n 2  2
86
and so
(n1  1) S12

2
We also know that,

(n2  1) S 22

2

(n1  n2  2)S 2

2
~  n21 n2 2
X 1  X 2   1   2  ~ N0, 1 .
1
1

n1 n2

Recall that if Y ~ N[0, 1] and Z ~  m2 then:
X
Y
Z
~ tm .
m
Therefore, with
Z
(n1  n2  2)S 2

and Y 
2
X 1  X 2   1   2 
1
1

n1 n2

we get:
Y

X 1  X 2   (1   2 ) .
Z
n1  n2  2
S
1
1

n1 n2
Therefore, we have
T
X 1  X 2   (1   2 ) ~ t
1
1
S

n1 n2
n1  n2  2 .
We will use T and its distribution to define hypothesis tests and confidence intervals for  1   2 when we have
small normal samples with unknown variances.
8.4.3
Hypothesis test: 2-sample t-test
Suppose we want to test the hypotheses:
H 0 : 1   2  k versus H1 : 1   2  k
Then if H0 is true, 1   2  k so T 
X 1  X 2   k ~ t
1
1
S

n1 n 2
n1  n2  2
.
Therefore (using exactly the same reasoning as before) we'll reject H0 if we observe T in the tail regions of the
t n1 n2 2 distribution
87
Two-sample t-test:
To test the hypotheses
when
a)
b)
 12
H 0 : 1   2  k versus H1 : 1   2  k
 22
and
are unknown but assumed equal
both samples come from normal distributions,
then

use the test statistic T 
X 1  X 2   k
S

1
1

n1 n 2
reject H0 at the 100% level if
| t | t n1 n2 2, / 2
Adjust for 1-tailed tests appropriately.
 Example:
Recall the example that we used to introduce this chapter. Here we had daily protein intake measurements
recorded on two sets of individuals:
Below poverty level:
51.4, 49.7, 72.0, 76.7, 65.8, 55.0, 73.7, 62.1, 79.7, 66.2, 75.8, 65.4, 65.5, 62.0, 73.3
Above poverty level:
86.0, 69.0, 59.7, 80.2, 68.6, 78.1, 98.6, 69.8, 87.7, 77.2.
Suppose we wish to see whether these two groups differ in their mean protein intake. Our hypotheses would
then be
H 0 : 1   2  0 versus H1 : 1   2  0
where 1 and  2 represent the mean protein intake of those below and above the poverty level.
We first must find the sample mean and s.d. for each sample. These are:
Sample mean
Sample s.d.
Sample size
Below poverty level
x1  66.29
S1  9.17
n1  15
Above poverty level
x 2  77.49
S 2  11.34
n 2  10
The pooled sample variance is therefore given by:
(n  1) S12  (n 2  1) S 22 14  9.17 2  9  11.34 2
S2  1

 101 .50  S  10.07.
n1  n 2  2
15  10  2
The observed value of the test statistic therefore is:
x1  x 2  0
66.29  77.49
t

 2.72.
1
1
1
1
S

10.07 

n1 n 2
15 10


The appropriate critical points are found from a t distribution with 15 + 10 – 2 = 23 degrees of freedom:
5% test: 2.069
1% test: 2.807
0.1% test: 3.768.
We can see that we can reject the null hypothesis at the 5% level. There is some evidence to suggest that the
mean protein intakes in the two groups differ.
Note that the p-value for this test is
88
p  P(T  2.72 or T  2.72)  2  P(T  2.72)  2  (1  0.994)  0.012.
Note: In performing the analysis two assumptions have been made- equal population variances and normality.
We will consider techniques that can be used to assess how reasonable these assumptions are in later sections.
8.4.4
Confidence intervals:
Now,
X 1  X 2   (1   2 ) ~ t
1
1
S

n1 n2
n1  n2  2 .
and so






X 1  X 2  ( 1   2 )
P  t n1  n2  2, / 2 
 t n1  n  2, / 2   1  
2
1
1


S



n
n
1
2


Rearranging this so that we only have  1   2 in the middle of the inequality we get







1
1
1
1 
P X 1  X 2  t n1  n2 2, / 2  S 

 1   2  X 1  X 2  t n1  n2  2, / 2  S 

1 .

n1 n 2
n1 n 2 

Result: A 100(1  )% confidence interval for  1   2 when
 12 and  22 are unknown but assumed equal
both samples come from normal distributions,
a)
b)
is given by
X 1  X 2   t n n 2, / 2  S 
1
2
1
1

n1 n2
 Protein intake example (continued):
Suppose that we also require a 95% confidence interval for the difference in mean daily protein intakes between
those below and above the poverty level. This would be given by:
1
1
1
1
X 1  X 2  t n1 n2 2, / 2  S 

 (66.29  77.49)  2.069  10.07 
 .
n1 n2
15 10
Hence, the 95% confidence interval is: (-19.7, -2.7).


89
 Exercise:
A car hire firm is trying to decide which kind of tyre to use. It has narrowed the choice down to two types, A
and B. Randomly selected samples of tyres of each type were tested to destruction on a machine. The number
of hours to failure are:
Tyre A:
Tyre B:
3.82, 3.11, 4.21, 2.64, 4.16, 3.91, 2.44, 4.52.
4.16, 3.02, 3.94, 4.22, 4.15, 4.92, 4.11, 5.45, 3.65.
Test to see whether there appears to be a significant difference between the mean time to failure for tyres of type
A and B. Find also a 99% confidence interval for the difference in population means.
8.5
Inferences about the ratio of two variances
Recall that before we can carry out the 2-sample t-test, we need to make the assumption that  12   22 . In this
section we'll look at inferences about
 12
 22
. In particular we'll be interested in whether
 12
 22
 1 , i.e. whether
 12   22 . Once again we will assume that both samples come from normal distributions.
8.5.1
Testing for equality of variances
Suppose we want to test the hypotheses: H 0 :  12   22 versus H1 :  12   22 .
Recall from Chapter 3 that if Y and Z are independent random variables with
k1Y ~  k21 and k 2 Z ~  k22
then
Y
~ Fk1 , k2 .
Z
We know that
(n1  1) S12
 12
(n 2  1) S 22
~  n21 1 and
 22
~  n22 1 .
So if we let
Y
S12
and Z 
 12
S 22
 22
then
Y S12  22


~ Fn1 1, n2 1 .
Z  12 S 22
So if H0 is true, then  12   22   2 say, and so
S12
 12
We will therefore reject H0 if we observe
S12
S 22
S12
S 22

 22
S 22

S12
S 22
~ Fn1 1, n2 1 .
in the tail ends of the Fn1 1, n2 1 distribution- i.e. if
 Fn1 1, n2 1,1 / 2 or if
S12
S 22
 Fn1 1, n2 1, / 2
Recall from Chapter 3 that we cannot find lower percentage points for the F-distribution directly from tables and
90
Fn1 1, n2 1,1 / 2 
1
Fn2 1, n1 1, / 2
.
F-test: To test the hypotheses
H 0 :  12   22 versus H1 :  12   22 .
given two normally distributed samples:
S2
i) use the test statistic 12
S2
ii) reject H0 at the 100% level if
S12
S12
1

F
or
if

.
n1 1, n2 1,  / 2
2
2
S2
S 2 Fn2 1, n1 1, / 2
Adjust for 1-tailed tests appropriately.
 Example:
Suppose we have two samples (assumed normally distributed), the first with 13 observations with S12  16.37
and the second with 11 observations with S 22  12.98 . We want to test
H 0 :  12   22 versus H 1 :  12   22 .
 The test statistic is:
S12
16.37
 1.26.
12.98
Suppose that we wish to perform a test at the 10% level of significance. We then would want to compare the test
statistic with the 5% upper and lower percentage points from the F12,10 distribution.
S 22

Looking up in F12,10 tables we find F12,10, 0.05 = 2.913.
To find the lower percentage point:
F12,10, 0.95 
1
F10,12, 0.05

1
 0.363 .
2.753
Our observed test statistic is 1.26 which is neither smaller than 0.363 nor larger than 2.913. Therefore we do not
reject H0 at the 10% level and so we find no evidence to suggest that the population variances differ.
 Exercise:
Look back to the example at the end of Section 8.4.4 relating to the two types of car tyre. Test to see whether
there appears to be any evidence to suggest that the population variances are different. (Use a 5% level of
significance).
91
8.5.2
Confidence interval
We know that
S12
 12

 22
S 22
~ Fn1 1, n2 1
and therefore


S12  22
1

P
 2  2  Fn1 1, n2 1, / 2   1   .
 Fn 1, n 1, / 2  1 S 2

 2 1

2
We want to find a confidence interval for  1
 12
 22
 22
and so we will rearrange this equation until we just have
in the middle. We get:
 S2

 2 S2
1
P 22 
 22  22  Fn1 1, n2 1, / 2   1   .
 S1 Fn 1, n 1, / 2  1 S1

2
1


 S2

 12 S12
1

1
 P 2 
 2  2  Fn2 1, n1 1, / 2   1   .
 S 2 Fn 1, n 1, / 2  2 S 2

1
2


2
Result: A 100(1  )% confidence interval for  1
 22
when both samples come from normal distributions is
given by:

S2
S2 
1

 12 , Fn2 1, n1 1, / 2  12 
 Fn 1, n 1, / 2 S 2
S2 
 1 2

 Example:
Returning again to the earlier example in which we had two samples with 13 and 11 observations respectively
and S12  16.37 , S 22  12.98 . Then the 90% confidence interval is:

1
16.37
16.37   1


, F10,12, 0.05 
 1.26, 2.753  1.26 


12.98   2.913

 F12,10, 0.05 12.98
so that the 90% confidence interval is given by [0.43, 3.47].
8.6
Assessing Normality
The statistical tests that we have developed in this and the previous chapters have often relied upon the
assumption that the data follow a normal distribution. In this section we look at some techniques which can be
used to assess whether such an assumption appears reasonable.
8.6.1
Graphical methods
Graphical techniques provide a very simple way of gauging whether a set of data look roughly normally
distributed. For example, a histogram of normally distributed data should be roughly symmetrical and unimodal.
Further it should also show most of the observations near the mean and steadily fewer as we go further away
from the mean.
92
A probability plot is a slightly more sophisticated plot that is used for assessing normality. Probability plots are
also now widely available on statistical software packages (such as Minitab). To produce a probability plot for a
set of data x1 ,..., x n (ordered so that x1  x 2  ...  x n ), we plot y i against xi (for i  1,..., n) , where
y i   1 z i  and
 is the normal cumulative distribution function, and
 z i  (i  0.5) n .
The points should roughly lie on a straight line if the assumption of normality is appropriate.
Note: Minitab produces its probability plots by plotting z i against xi and using a special scale on the vertical
axis. This has the same effect as applying the inverse normal cdf.
Note 2: Other formulas for calculating z i exist.
Example:
Consider the data introduced at the start of this chapter relating to the daily protein intakes of two groups of
people. We focus here just on those that are below the poverty level:
Below poverty level:
51.4, 49.7, 72.0, 76.7, 65.8, 55.0, 73.7, 62.1, 79.7, 66.2, 75.8, 65.4, 65.5, 62.0, 73.3
The probability plot produced by Minitab for these data is:
A probability plot showing protein intake
99
ML Estimates
95
Mean:
66.2867
StDev:
8.85934
90
Percent
80
70
60
50
40
30
20
10
5
1
36
46
56
66
76
86
96
Data
This probability plot also contains a 95% confidence interval (as shown by the broken line)- we would expect
about 95% of the points to fall within these limits if the assumption of normality is valid.
For these data, all the points are contained within the confidence band. The probability plot therefore does not
cast any doubt about the appropriateness of a normal assumption.
8.6.2
Formal methods for assessing normality
A variety of more formal techniques can be used to assess how well a normal distribution fits a set of data:
 Shapiro-Wilk test;
 Kolmogorov-Smirnov test;
93
 Anderson-Darling
etc. These tests can be performed in Minitab.
8.7
Inferences for the difference between two proportions
Suppose that we have two populations for which the proportion of “successes” for population 1 is  1 and the
proportion of “successes” for population 2 is  2 . Suppose that we observe a sample from each population:
Population
1
2
Sample size
n1
n2
Sample proportion
p1
p2
Based upon these samples, we might be interested in drawing inferences about  1   2 (for example, to test
whether  1   2 ). To do this, we need to know the sampling distribution of p1  p 2 .
Now, we know that for large sample size ni (ni  i  5, ni (1   i )  5), i  1,2,

 (1   i ) 
pi ~ N  i , i

ni


(approximately).
So, for large sample sizes:

 (1   1 )  2 (1   2 ) 
p1  p 2 ~ N  1   2 , 1


n1
n2


W 
p1  p 2  ( 1   2 )
 1 (1   1 )
n1
8.7.1

 2 (1   2 )
~ N0, 1
n2
Hypothesis tests
We shall consider here only the simplest hypothesis test concerning  1   2 , namely where we wish to test the
following null hypothesis:
H0 : 1   2  0 .
Now if H0 is true, then  1   2   , say, and
W
p1  p 2
 (1   )
 (1   )
.

n1
n2
We can't, however, use this as our test statistic because we cannot compute it- H0 says that we have a common
value of , but it doesn't specify an actual value. So to find a test statistic we first estimate  from the sample
data and then use this estimate in W.
When  1   2   we get our best estimator for  by making use of both sample proportions, p1 , p 2 , and
pooling these suitably. Following on from the method we used to find the pooled sample variance, we'll use a
weighted average. The combined estimate of the population proportion is therefore
n p  n2 p 2
p 1 1
.
n1  n2
Using this pooled estimate in the test statistic gives:
94
W
p1  p 2
p(1  p) p(1  p)

n1
n2
~ N[0, 1].
We therefore get the following test:
Result: To test the hypotheses
H 0 :  1   2  0 versus H1 :  1   2  0
when ni is large i.e. ni  i  5, ni (1   i )  5, i  1,2,
i) use the test statistic
p1  p 2
W
p(1  p) p(1  p)

n1
n2
ii) reject H0 at the 100% level if
w   z / 2 or if w  z / 2 .
Adjust for 1-tailed tests accordingly.
 Example:
Two drugs are used to treat patients with a certain type of cancer. In order to compare their effectiveness, a
clinical trial was planned. 75 patients were given drug A whilst 60 patients were assigned to drug B. The
number of patients who survived for one year beyond diagnosis in each group was as follows:
Drug A: 49
Drug B: 34.
Test whether both drugs appear to be equally effective.
 Let  1 denote the proportion of people with this type of cancer who would survive for one year beyond
diagnosis if treated with drug A. Similarly for  1 .
We then wish to test:
H 0 :  1   2  0 versus H1 :  1   2  0 .
We have observed:
49
34
 0.6533
p2 
 0.5667 .
75
60
If the null hypothesis is true, then the pooled estimator of , the common population proportion, is:
n p  n2 p 2 49  34
p 1 1

 0.6148 .
n1  n2
75  60
p1 
The test statistic is:
w
p1  p 2
0.6533  0.5667

 1.027 .
p(1  p) p(1  p)
0.6148  0.3852 0.6148  0.3852


75
60
n1
n2
For a 5% test, we reject H0 when w < -1.96 or w > 1.96. Our test statistic does not lie in the rejection region. We
therefore have no evidence to reject the null hypothesis at this level. It is plausible that both drugs are equally
effective at treating this form of cancer.
 Exercise:
“Predictors of driving while intoxicated among teenagers” (J. of Drug Issues, 1988, 367 - 84) investigated how
common it is for teenagers to drive while intoxicated. The following results were obtained:
Number surveyed
Number driven while intoxicated
95
Boys
Girls
100
100
28
17
Use a p-value to decide whether there is sufficient evidence to suggest that the number of girls who have driven
while intoxicated is smaller than the number of boys.
8.7.2
Confidence interval
Now,
W
p1  p 2  ( 1   2 )
 1 (1   1 )
n1

 2 (1   2 )
~ N[0, 1]
n2
and so






p1  p 2  ( 1   2 )
P   z / 2 
 z / 2   1  
 1 (1   1 )  2 (1   2 )





n
n
1
2


We'll find our confidence interval by rearranging this so that we have  1   2 in the middle:

 1 (1   1 )  2 (1   2 )
 1 (1   1 )  2 (1   2 ) 
P   z / 2 

 p1  p 2  ( 1   2 )  z / 2 

1


n
n
n
n
1
2
1
2



 1 (1   1 )  2 (1   2 )
 1 (1   1 )  2 (1   2 ) 
 P p1  p 2  z / 2 

  1   2  p1  p 2  z / 2 



n1
n2
n1
n2


1
But we do not know the values of  1 and  2 which are in our limits. However, pi is a good estimator of  i , i =1,
2 and so we can find a confidence interval by substituting in pi for  i in the limits.
Result: A 100(1  )% confidence interval for  1   2 when ni i  5, ni (1   i )  5, i  1,2, is given
approximately by:
p1 (1  p1 ) p 2 (1  p 2 )
( p1  p 2 )  z / 2 

.
n1
n2
96
 Example (continued):
Returning to the cancer drug example. A 90% confidence interval for  1   2 is given by:
( p1  p 2 )  z / 2 
p1 (1  p1 ) p 2 (1  p 2 )

n1
n2
 (0.6533  0.5667 )  1.6449 
0.6533  0.3467 0.5667  0.4333

75
60
(as z0.05 = 1.6449).
The interval is therefore (-0.052, 0.225).
8.8
Paired data
So far, we have considered the case of two independent samples. Sometimes we have two sets of observations
that are made on the same group of individuals. For example, we could have blood pressure measurements that
are recorded on one group of women before and after the birth of their child. Such data are called paired.
Matched pairs are often used in experiments as the resulting data can yield more accurate inferences (by
reducing variability).
Suppose that we have the following paired data:
Sample 1:
Sample 2:
X 11
X 21
…
…
X 1i
X 2i
…
…
X 1n
X 2n
A pair of
observations.
To compare the two means of the populations we look at the differences:
Di  X 1i  X 2i for i  1,..., n .
Then D1 ,..., Dn are a random sample from N[ 1   2 ,  d2 ], where  d2 is some variance which is generally
unknown. Our problem has therefore reduced to a one-sample problem and so, by denoting 1   2 as  d say,
we can use a t-test to test the hypothesis
H0 : d  k .
Paired t-test: To test the hypotheses
H 0 :  d  k versus H 1 :  d  k
when we have a sample of matched pairs we use a one-sample t test applied to the differences.
Adjust for 1-tailed tests accordingly.
 Example:
Ten athletes ran a 400 m race at sea level and at a later meeting ran another 400 m race at high altitude. Their
times in seconds were as follows:
Athlete
Sea level
High altitude
1
48.3
48.7
2
47.9
49.2
3
50.2
50.1
4
51.7
51.9
5
46.5
48.2
6
44.9
45.8
7
45.2
48.0
8
47.7
47.3
9
48.4
50.2
10
49.1
51.5
Test whether the athletes are performing equally well at sea level and at high altitude.
 The data here are clearly paired (two measurements are recorded on each athlete). We let  d denote the
mean difference in times (  1   2 ). The hypotheses we wish to test are as follows:
H 0 :  d  0 versus H1 :  d  0 .
97
The difference in times for each athlete are:
-0.4, -1.3, 0.1, -0.2, -1.7, -0.9, -2.8, 0.4, -1.8, -2.4.
The sample mean and variance for these differences then are given by:
1
x d   (0.4)  (1.3)  0.1  ...  (2.4)   1.1
10
(11) 2 
1
S d2  22.6 
  1.1667
9 
10 
This gives the following value for the test statistic:
xd  0
 1.1
t

 3.22.
Sd
1.1667
10
n
We compare this with critical points from a t distribution with 9 degrees of freedom. As t 9, 0.005  3.25 we just
fail to reject the null hypothesis at the 1% level. We have some evidence to suggest that athletes performance
differs at the different altitudes.
[Note we need to assume here that the differences follow a normal distribution].
To find a confidence interval for the differences, simply use the corresponding one-sample results.
 Exercise:
“Effects of alcohol on hypoxia” (J of Amer. Med. Assoc., 1965, 135) examined the relationship between alcohol
intake and the time of useful consciousness during high-altitude flight. Ten men were taken to a simulated
altitude of 25,000 ft and given several tasks to perform. The time (in seconds) at which useful consciousness was
lost, due to lack of oxygen, was recorded. The experiment was repeated 3 days later after the subjects had .5cc of
100-proof whiskey per pound of body weight. The time of useful consciousness was again recorded. Does the
alcohol intake reduce the average time of useful consciousness?
Subject
1
2
3
4
5
6
7
8
9
10
Time of useful consciousness
No alcohol
Alcohol
Difference
261
185
76
565
375
190
900
310
590
630
240
390
280
215
65
365
420
-55
400
405
-5
735
205
530
430
255
175
900
900
0
98
Chapter 9: Introduction to Non-Parametric Tests
We use sample data to make inferences about the population from which it was drawn. In the one-sample and
two-sample problems covered in the previous chapters, we assumed that the samples come from normal
distributions (or at least that the sample size is so large that the central limit theorem applies). We then make
inferences about the parameters of the normal distribution- i.e. the means and variances.
If it turns out that the distributions are not normal, then our inferences may not be valid. If the assumption of
normality is not a reasonable assumption, we may decide to use tests that do not assume a specific form for the
population distribution. These are known as nonparametric (or distribution-free) tests.
9.1
The sign test
This may be used for testing hypotheses about the median of a distribution- i.e. the centre of the distribution. In
particular, with matched pairs, we may test that the median of the distribution of differences is zero. In this
context, the sign test is a nonparametric equivalent of the paired t-test.
Procedure

Calculate differences Di  X i  Y i .



Discard all zero differences.
Count the number of positive differences.
If the median is zero, then we would expect half of our values to be >0 and half of them to be <0. So test the
null hypothesis that p, the population proportion of positive differences, is 0.5.
The test is based on the fact that the number of positive observations in a sample of non-zero differences of
size n, S say, has a binomial distribution B[n, p]. So if H0 is true, then
S ~ B[n, 0.5].
We can then use this binomial distribution to find critical values for the test or p-values. Note that in this
case you probably won't be able to find critical values for the test which give a significance level of exactly
, because S is discrete. It is therefore usually simpler to just find the p-value here.

Alternatively, if the sample size is large, we may use a normal approximation so that the number of positive
observations, S, is then distributed:
n n
S ~ N , .
2 4
Our test statistic is then
s  12 n 2s  n
z

~ N[0, 1]
n
n
4
approximately if H0 is true.
If the alternative hypothesis is that the median difference is different to 0, then our decision rule is to reject H0 if
we observe:
z   z / 2 or z  z / 2
Note: If the alternative hypothesis is one-sided, then we would adjust this critical region as appropriate.
99
 Example:
To determine whether two tests are equally effective in evaluating job applicants for a certain position, the test
questions are randomly intermixed and a combined test is given to each of 14 applicants. The answers to the two
sets of test questions are then separated and the scores below were obtained. Using the sign test, test the
hypothesis that the two tests produce the same score distributions.
Test 1
78
84
65
98
56
28
70
66
55
87
90
61
70
83
Test 2
74
81
73
98
60
13
58
74
59
88
93
66
88
90
 We wish to test:
H0: median for the 2 tests is the same vs H1: medians are different (i.e. 2-tailed test).
We first need to count how many of the non-zero differences (Test 1 – Test 2) are positive and negative.
Test 1
Test 2
(Test1 – Test 2)
78
74
+
84
81
+
65
73
-
98
98
56
60
-
28
13
+
70
58
+
66
74
-
55
59
-
87
88
-
90
93
-
61
66
-
70
88
-
83
90
-
Of the 13 non-zero differences 4 are positive and 9 negative, so s = 4.
If H0 is true, S ~ B[13, 0.5]. The p-value when s = 4 for a test against a two-sided alternative can be found by
calculating 2  P(S  4) assuming the null hypothesis to be true (we need to multiply by 2 because we have a
two-tailed test). So the p-value is:
4
13 
p  2     0.5 i  0.513i  0.2668 .
i
i 0  
(Note that this can be easily found from binomial distribution tables in Lindley and Scott). As the p-value
is >0.1, we would not reject H0 even at the 10% significance level.

Alternatively, we can use a normal approximation and use the corresponding test statistic or a p-value. The test
statistic is
8  13
z
 1.387.
13
For a 5% test we reject H0 if
z  1.96 or z  1.96.
We therefore do not reject H0 at the 5% level and conclude that the two tests produce the same score distribution.
To calculate the p-value, a normal approximation gives
13 13 
S ~ N  ,  = N[6.5, 3.25].
2 4
The p-value is then:
p  2  P( S  4.5) (making use of a continuity correction)

4.5  6.5 
  2  P( Z  1.11)  0.267.
 p  2  P Z 
3.25 

Notice that, as expected, the approximate p-value is almost exactly the same as the exact one.
100
6.2
Mann-Witney (or Wilcoxon Rank Sum) Test
This is the nonparametric equivalent of the 2-sample t-test. It is used to compare two samples of data and
doesn’t make the assumption of either normally distributed observations or equal population variances.
We explain the procedure for performing this test in relation to the following example:
 Example:
There is interest in finding out whether stroke patients make a more successful recovery if they receive treatment
within 24 hours of the stroke occurring. The data below are the results of a mobility test and are scores on a 0100 scale. Patients with low scores are unable to do a lot of things for themselves. The test was performed one
week after the stroke occurred.
Treated within 24 hours:
Treated after 24 hours
63, 39, 77, 80, 59, 41, 55, 71, 84, 75.
44, 31, 58, 60, 47, 51, 68, 52, 34, 49, 26, 50.
We are interested in the hypotheses:
H0
No difference in scores between the two groups
H1
The scores from the two groups differ.
 Step 1:
Combine the two samples of data and rank the observations:
E.g.
Observation
Rank
Group
26
1
2
31
2
2
34
3
2
39
4
1
41
5
1
44
6
2
47
7
2
49
8
2
50
9
2
51
10
2
52
11
2
Observation
Rank
Group
55
12
1
58
13
2
59
14
1
60
15
2
63
16
1
68
17
2
71
18
1
75
19
1
77
20
1
80
21
1
84
22
1
Here, Group 1 represents those that received prompt treatment.
 Step 2:
Calculate the sum of ranks for each group.
E.g.
Group 1:
Group 2:
T1 = sum of ranks = 4 + 5 + 12 + … + 21 + 22 = 151;
T2 = sum of ranks = 1 + 2 + 3 + 6 + …+ 15 + 17 = 102
Note that T1 + T2 = 253 = 0.5  22  23 (i.e. the sum of numbers 1, 2, 3, …, 22).
 Step 3:
Calculate the Mann-Witney U statistic in the following way:
U  min( U 1 , U 2 )
where
U 1  T1  0.5n1 (n1  1)
U 2  T2  0.5n 2 (n 2  1)
and
n1 and n2 are the number of observations in Group 1 and Group 2 respectively.
E.g.
101
U 1  151  0.5  10  11  96
U 2  102  0.5  12  13  24.
and
So U = 24.
 Step 4:
Compare the value of U with statistical tables and draw conclusions.
E.g.
Here we have n1 = 10 and n2 = 12 and our test is two-sided. From tables, we can find the critical values
for various test sizes:
Size of test
5%
1%
Critical value
29
21
We reject the null hypothesis if our value of U is smaller than any of these critical values. We can see that we
can reject the null hypothesis at the 5% level (but not at the 1% level).
Notes:
1)
The critical values can be found from statistical tables if the two sample sizes are fairly small. If the two
sample sizes are large (rule of thumb: both greater than 10) then the distributions of T1 and T2 can be
taken as normal with
1
1
E[T1 ]  n1 (n 1  n2  1); E[T2 ]  n2 (n 1  n2  1);
2
2
1
Var[T1 ]  Var[T2 ]  n1n2 (n 1  n2  1).
12
2)
When ties are involved, it is usual to replace the ties with the average rank of all observations involved
in the tie.
For example, if the observations are
12, 14, 14, 17, 19
then the corresponding ranks would be
1, 2.5, 2.5, 4, 5.
3)
Note that U1 can be defined as the total number of times each observation from sample 1 comes before
each observation from sample 2.
9.3
Goodness-of-fit tests
In this section we'll look at how we can check (or test) whether a given distribution is plausible. This is called
goodness of fit testing.
 Example:
The number of thunderstorms reported in one summer month by 100 meteorological stations were given as:
Number
Frequency
0
22
1
37
2
20
3
13
4
6
5
2
If thunderstorms occur at random (in a Poisson process), we would expect the number observed in a month to
have a Poisson distribution. Therefore the question of interest which we'd like to test would be: “Does the
Poisson distribution fit the data?” and so we'd want to test:
H0: data follow a Poisson distribution versus H1: data do not follow a Poisson distribution.
One of the problems here is that we are trying to test whether data follow any Poisson distribution, as opposed to
testing whether the data follow a specific Poisson distribution, for example one with mean 22, say. Before we
look at how we might test the hypotheses above, we'll first look at how we might test a distribution which is fully
specified.
102
9.3.1
Goodness of fit tests for fully specified null hypotheses
 Example:
Two dice are thrown 180 times and the number of sixes, X, which occur are counted. These are displayed in the
table below.
X
Frequency
0
105
1
70
2
5
total
180
Given these data, is there evidence to suggest that the dice are loaded?
 We want to test:
versus
H0:
H1:
dice not loaded
dice loaded
If they are not loaded, then the probability of throwing a six with each die will be 1/6. On the other hand, if the
dice are loaded, then the probability of obtaining a six will not be 1/6. We therefore want to test:
 1
H0: X ~ B 2, 
 6
 1
versus H1: X is not B 2,  .
 6
To do this, we'll calculate how many times we would expect to observe X = 0, 1, 2 and compare these expected
values with the number of times we actually did observe these values. If the expected frequencies are close to the
observed frequencies, then this would suggest that H0 might be true and so we'll accept H0. On the other hand, if
the observed frequencies are very different to those which would be expected if H0 were true, then this would
cast doubt as to whether H0 were true and so we'd reject H0.
Our first task then is to calculate the expected frequencies. We know that under the null hypothesis:
2
25
5
P ( X  0)    
36
6
5 1 10
P( X  1)  2   
6 6 36
2
1
1
P ( X  2)     .
36
6
Since the dice are thrown 180 times, our expected frequencies are as follows:
x
0
1
2
Expected frequency
25
 180  125
36
10
 180  50
36
1
 180  5
36
Observed Frequency
105
70
5
Does this provide evidence on which we should reject H0?
Some general theory
Let us first consider a general problem in which we have several categories for which we have observed
frequencies. Suppose that we have calculated expected frequencies for these categories from our distribution
under H0 and we have compiled everything into a table:
Category
Observed
1
O1
2
O2
…
…
i
Oi
103
…
…
Expected
Difference
E1
O1  E1
E2
O2  E2
…
…
Ei
Oi  Ei
…
…
The smaller the differences are, the more plausible H0 is. We use a test statistic which measures how large these
discrepancies are.
Define
(Oi  Ei ) 2
.
Ei
As long as H0 is true, then, regardless of the distribution being fitted,
C ~ 2
The following general rule can be used to find the appropriate number of degrees of freedom for this chi-squared
distribution:
C

Degrees of freedom = number of categories  number of restrictions
What are the restrictions? Well we always ensure that
Oi   Ei - this is one restriction that always applies.
Note: We shall see later that when a parameter is unknown, we match the distribution to the data by estimating
parameters. We then get one restriction per parameter.
So we will reject H0 if our observed value of the statistic C lies in the upper % point of this chi-squared
distribution- i.e. this would indicate that the discrepancies between the Oi and Ei were larger than expected if H0
were true.
Note that we always use a one-tailed test here. Further, if C = 0, the observed and expected frequencies are
identical, and so we then have a perfect fit.
 Example (continued):
Returning to the dice example we have
x
O
E
O-E
0
105
125
-20
1
70
50
20
2
5
5
0
So
(105  125) 2 (70  50) 2 (5  5) 2


 11.2.
125
50
5
If H0 is true, then this test statistic should have a chi-squared distribution with 3 – 1 = 2 degrees of freedom. We
C
can therefore reject H0 at the 1% level if we observe C   22,0.01  9.21. We therefore reject H0 at the 1% level
and conclude that there is strong evidence to suggest that the dice are not fair.
9.3.2
Goodness of fit tests for more general H0
In the dice example, the distribution B(2, 1/6) was specified precisely. In many cases, we just want to test the
hypothesis that the data come from a general distribution.
 Example:
Consider the earlier storms example. Here we wanted to test the hypotheses:
H0: data follow a Poisson distribution
versus
H1: data do not follow a Poisson distribution.
104
So we're interested in testing whether the probability function
P( X  x )  e  
fits the data, for some value of .
x
x!
We don't have a specific distribution under H0 this time with which to calculate our expected frequencies. We
therefore need to identify the Poisson distribution that is likely to fit best and then we'll use this to find our
expected frequencies.
Now, the Poisson distribution which is likely to fit best will be when the distribution mean  is the sample mean
x . So we will calculate the frequencies which we'd expect to observe if the data followed a Poisson distribution
with mean x and then we will see how well our expected frequencies match our observed frequencies.
For the storm data:
150
 1.5
100
and we will use this value to calculate our expected frequencies:
1.5 0
 0.2231 .
P(X = 0) = e 1.5
0!
Therefore, out of a sample of 100 observations we would expect to observe X = 0 on
100  0.2231 = 22.31 occasions.
The other values are found similarly. We therefore end up with the following table:
x
x
0
1
2
3
4
5 or more
O
22
37
20
13
6
2
P(X = i)
0.2231
0.3347
0.2510
0.1255
0.0471
0.0186
E
22.31
33.47
25.10
12.55
4.71
1.86
O-E
-0.31
3.53
-5.10
0.45
1.29
0.14
Then
(O  E ) 2 (0.31) 2 3.53 2
0.14 2


 ... 
 1.793.
E
22.31
33.47
1.86
If the null hypothesis is true, we would expect C to have a chi-squared distribution. As before, the degrees of
freedom are given by:
C

Degrees of freedom = number of categories – number of restrictions.
Here, we have two restrictions (the totals of the observed and expected frequencies must agree and we are
estimating the mean  from the data).
The general rule is:
Degrees of freedom = number of categories  1  number of parameters estimated
In our example, there are 6 – 1 – 1 = 4 degrees of freedom. We would therefore be able to reject H0 at the 10%
significance level if we observe C   42,0.1  7.779 . Our observation is nowhere near the critical region and so
we’ll conclude that the poisson distribution appears to fit the data.
9.3.3
General format of test
105
A goodness of fit test considers the hypotheses:
H0: data follow some distribution
H1: data do not follow that distribution.
 Step 1: Specify a specific distribution for H0. Substitute in estimated parameters if they aren't already
specified.
 Step 2: Calculate the expected frequencies for each category assuming the distribution under H0 is true.
(Oi  Ei ) 2
 Step 3: Use the test statistic C 
.
Ei


Step 4: Reject H0 if C > 2, where the degrees of freedom, , are found according to the above rule.
9.3.4
Combining categories
The result that
(Oi  Ei ) 2
~ 2
Ei
is an asymptotic result. Mathematically it depends on the expected frequencies E being large.
C

General rules exist for combining categories:
Old rule of thumb:
Ensure all E's are > 5.
Modern rule of thumb:
Ensure all E's are > 1 and almost all are > 5
If rule is contravened:
Combine adjacent categories until all E's are acceptable.
9.3.5
Fitting a geometric distribution
Reminder:
A geometric distribution can be used to model situations where a count is made of the number of trials
performed until a success occurs. The conditions that give rise to the geometric distribution are:
 There is a sequence of (Bernoulli) trials;
 Only two outcomes, success and failure, are possible at each trial;
 The trials are independent;
 There is a constant probability p of success at each trial;
 The variable is the number of trials taken for the first success to appear.
If X has a geometric distribution, then
P( X  x)  p (1  p) x 1
for x = 1, 2, 3, …
Note: The expected value of X is E[ X ] 
1
.
p
 Example:
An infertility clinic records the number of treatment sessions (x) required by 100 patients until pregnancy results:
x
1
2
3
4
Observed frequency 57 24 10 9
106
a)
b)
Test whether a geometric distribution with p = 0.4 provides an adequate fit to these data.
Test whether the data can be modelled well by any geometric distribution.

a) The hypotheses to be tested are:
Null: the data follow a Ge(0.4) distribution;
Alternative: the data are not Ge(0.4).
With p = 0.4, the probabilities are:
P(X = 1) = 0.4
P(X = 2) = 0.4  0.6 = 0.24
P(X = 3) = 0.4  0.62 = 0.144
P(X  4) = 1 – 0.4 – 0.24 – 0.144 = 0.216.
Expected frequencies are found by multiplying by the total frequency (i.e. 100):
x
Observed frequency
Expected frequency
1
57
40
2
24
24
3
10
14.4
4
9
21.6
The test statistic is:
(Oi  Ei ) 2 (57  40) 2 (24  24) 2 (10  14.4) 2 (9  21.6) 2
C




 15.92.
Ei
40
24
14.4
21.6
i

From tables, the 1% point from  32 is 11.34 and the 0.1% point is 16.27. So we can reject the null hypothesis at
the 1% level (but not at the 0.1% level). Consequently, there is strong evidence that the data is not Ge(0.4).
b) The hypotheses to be tested now are:
Null: the data follow a geometric distribution;
Alternative: the data are not geometric.
The mean of the data is
(1  57)  (2  24)  (3  10)  (4  9)
 1.71.
100
1
1
So a good estimate of p would be 
 0.585 .
1
.
71
x
The expected probabilities under the null hypothesis then are:
P(X = 1) = 0.585
P(X = 2) = 0.585  0.415 = 0.243
P(X = 3) = 0.585  0.4152 = 0.101
P(X  4) = 0.071 (by subtraction).
The table of observed and expected frequencies is:
X
Observed frequency
Expected frequency
1
57
58.5
2
24
24.3
3
10
10.1
4
9
7.1
Therefore, the test statistic is:
C
i
(Oi  Ei ) 2 (57  58.5) 2 (24  24.3) 2 (10  10.1) 2 (9  7.1) 2




 0.55.
Ei
58.5
24.3
10.1
7.1
107
If H0 is true, C should be from a chi-squared distribution with 4 – 1 – 1 = 2 degrees as freedom. As the 5% point
for this distribution is 5.991, we are unable to reject the null hypothesis at the 5% level. There is therefore no
evidence to suggest that a geometric model is unsuitable.
9.4
An additional example
The marital-status distribution of the US adult population is given by:
Marital status
Percentage
Single
21.5
Married
63.9
Widowed
7.7
Divorced
6.9
A random sample of 750 US 25-29 year old males, yielded the following frequencies:
Marital status
Frequency
Single
289
Married
408
Widowed
0
Divorced
53
Does it appear that the marital-status distribution of all 25-29 year old US males is different from that of the US
adult population as a whole?
Solution:
Firstly we need to identify our hypothesised distribution. We want to test whether the marital-status distribution
of all 25-29 year old US males is different from that of the US adult population as a whole- i.e. we want to test
the null hypothesis that the distribution for 25-29 year olds is:
Marital status
Probability
Single
0.215
Married
0.639
Widowed
0.077
Divorced
0.069
against the alternative that the distribution for 25-29 year olds is different to this.
The next stage is to calculate the expected frequencies assuming that H0 is true. Of the 750 males sampled, we'd
expect to observe 750  0.215 = 161.25 of them to be single.
We calculate the other expected frequencies similarly to get:
Marital status
Observed
Expected
Single
289
161.25
Married
408
479.25
Widowed
0
57.75
Divorced
53
51.75
We can now calculate the test statistic:
(289  161.25) 2 (408  479.25) 2 (0  57.75) 2 (53  51.75) 2



161.25
479.25
57.5
51.75
 101.2  10.59  57.75  0.03  169.57
C
We will reject H0 at the 0.1% level if we observe  32,0.001  16.27 . It is clear that we should reject H0 at this
level and conclude that the marital-status distribution for US 25-29 year old males is different from the
US adult population as a whole.
108
Chapter 10: Association Between Variables
Consider two random variables. These may be related to each other- for example, heights and weights of people
are related. This section will look at ways of measuring the strength of relationship between two random
variables (i.e. the strength of association or correlation).
10.1
Product-moment Correlation Coefficient
When considering the nature of the relationship between two variables we might be interested in the folowing
questions:
 Is there a negative or positive relationship (or some other form of relationship)?
 Is the relationship linear?
 Example:
Consider the following scatterplots showing (hypothetical) data from 20 school children
Diagram (b)
120
120
110
110
Mark in maths exam
Mark in maths exam
Diagram (a)
100
90
80
70
60
50
40
100
90
80
70
60
50
40
30
40
50
60
70
80
90
100
20
30
Mark in mock maths paper
40
50
60
70
80
90 100 110 120
Mark in English exam
Diagram (c)
120
Mark in maths exam
110
100
90
80
70
60
50
40
150
155
160
165
170
175
180
185
Height (in cm)
We can see that:
 In Diagram (a): the points are not scattered far from a straight line- there is a strong positive relationship
between the mark in the maths exam and the mark in the mock paper;
 In Diagram (c): the points are very scattered- there appears to be no relationship between height and maths
mark;
 In Diagram (b): the relationship comes somewhere between a) and c)  i.e. there is a weak positive
relationship between English and maths marks.
The product-moment correlation coefficient, r, (also known as Pearson's correlation coefficient) gives a
summary measure of the strength of (linear) association between two random variables. r can take values in the
range [-1, 1]. If r is positive, this indicates a positive relationship between the variables. If r is negative, it
indicates a negative relationship. The further r is from 0, the stronger the association between the two random
variables.
109


r = +1
r = 1


exact straight line relationship with positive slope.
exact straight line relationship with negative slope.
Note: The value of r does not imply anything about the slope of the straight line fit, it just says something about
the quality of the fit.
Definition
The formula for calculating the product-moment correlation coefficient r from bivariate data ( x1 , y1 ), …,
( x n , y n ) is
S xy
r
SxS y
where,
 S x is the sample standard deviation of x1 , x 2 ,..., x n ;
 S y is the sample standard deviation of y1 , y 2 ,..., y n ;

S xy is the sample covariance between the two variables calculated using
S xy 
1 n
1 n
( x i  x )( y i  y ) 
 xi y i 
n  1 i 1
n  1  i 1



 xi  yi .
n

 Example:
Blood pressure was measured (in mm Hg) for 15 patients who had moderately raised blood pressure.
Patient number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Systolic blood pressure
210
169
187
160
167
176
185
206
173
146
174
201
198
148
154
Diastolic blood pressure
130
122
124
104
112
101
121
124
115
102
98
119
106
107
100
Let X denote the systolic and Y the diastolic blood pressure. Then, we have:
n  15,
 xi  2654,  yi  1685,  xi2  475502 ,  yi2  190817 .
i
Also,
i
 xi yi  (210  130)  ...  (154  100)  300137 .
i
i
i
Therefore:
S x2 
1 
2654 2 
475502 
  422 .92
14 
15 
S xy 
S y2 
1 
2654  1685 
300137 

  143 .17
14 
15

110
1 
1685 2 
190817 
  109 .67
14 
15 
So, the correlation coefficient is:
r
S xy
SxS y

143 .17
422 .92  109 .67
 0.665 .
As r is positive, the two measures of blood pressure are positively correlated. The value of r is not particularly
close to either 0 or 1, implying that the strength of association between the two variables is moderate. A plot of
the two variables is given below.
Diastolic BP (mm Hg)
130
120
110
100
140
150
160
170
180
190
200
210
Systolic BP (mm Hg)
Note: It is important to remember that the product-moment correlation coefficient is a measure of linear
association and can give misleading results for data which don't display a linear relationship.
 Example:
Suppose we have two variables with scatter plot:
15
y
10
5
0
0
10
20
x
This has r = 0.81 indicating quite a strong relationship. But if you take away the outlying data point, then there is
in fact no linear relationship.
So when interpreting r, it's always a good idea to have a look at a plot of the data.
The product-moment correlation coefficient is most useful when a plot of the data has an oval pattern:
111
20
y
15
10
5
0
5
10
15
x
It is less appropriate when the data are curvilinear:
1.00
y
0.75
0.50
0
5
10
x
or some data points lie away from most of the data:
y
15
10
5
0
10
20
x
112
10.2
The Spearman rank correlation coefficient
The Spearman rank correlation coefficient, rS , is more general than the product-moment correlation coefficient
as it measures the strength of the monotonic (i.e. always moving in a consistent direction) association. For
example, suppose that we have:
80
6
70
5
60
y
y
7
50
4
40
3
30
4
9
14
0
x
1
2
3
4
5
6
7
8
9
10
x
(a)
(b)
The product-moment correlation would have problems with (b), but Spearman's correlation can handle both.
To find rS :
1. Find the ranks of the X's and Y's separately. If two values are tied, then give both values the same average
rank.
2. Calculate the product-moment correaltion coefficient using the ranks.
rS = +1  perfect monotonic increasing relationship:
10
15
9
8
10
y
y
7
6
5
5
4
3
0
2
0
5
10
0
15
5
10
15
x
x
rS  1, r  1
rS  1, r  1
113
rS = 1  perfect monotonic decreasing relationship:
5
15
4
y
y
10
3
5
2
0
0
5
10
15
0
x
1
2
3
4
5
6
7
8
x
rS  1, r  1
rS  1, r  1
 Example:
A study used a new method of measuring body composition and the age and body fat percentage of 14 women
were obtained.
Age (years)
23
39
41
49
50
53
53
54
56
57
58
58
60
61
Body fat (%)
27.9
31.4
25.9
25.2
31.1
34.7
42.0
29.1
32.5
30.3
33.0
33.8
41.1
34.5
A scatter plot of the data is given below:
Body fat (%)
40
35
30
25
20
30
40
50
60
Age (years)
To calculate Spearman's rank coefficient, we first need to rank the data for each of the variables.
114
Age (years)
23
39
41
49
50
53
53
54
56
57
58
58
60
61
Rank
1
2
3
4
5
6.5
6.5
8
9
10
11.5
11.5
13
14
Body fat (%)
27.9
31.4
25.9
25.2
31.1
34.7
42.0
29.1
32.5
30.3
33.0
33.8
41.1
34.5
Rank
3
7
2
1
6
12
14
4
8
5
9
10
13
11
Now we need to find the product-moment correlation using the ranks.
n  14,
 xi  105,  yi  105,  xi2  1014,  yi2  1015,  xi yi  921.5
Then,
S x2 
1 
105 2 
1014 
  17.423
13 
14 
S xy 
S y2 
1 
105 2 
1015 
  17.5
13 
14 
1 
105  105 
921 .5 

  10.308 .
13 
14

So,
rS 
10.308
 0.590.
17.423  17.5
So the two variables are moderately positively correlated.
Note that in this example using the product-moment correlation would not have been a problem and we could
have simply used that.
10.3
Testing correlations
The product-moment and Spearman correlation coefficients measure the correlation between two variables for
our samples  i.e. they are sample statistics. However, we are often interested in making inferences about the
correlations in the population. In particular, we are often interested in testing whether there is really no
association between the variables in the population.
Consider testing the hypotheses:
H0: No association between the variables
versus
H1: Association between the variables.
It can be shown that when there is no association  i.e. when H0 is true  the distribution of the product-moment
correlation R is such that:
n2
R
~ t n2 .
1 R2
So we will use the test statistic:
R
n2
1 R2
and reject H0 at the 100% level if we observe:
115
r
n2
1 r
2
 t n  2,  / 2
or
r
n2
1 r2
 t n  2,  / 2 .
 Example (continued):
For the blood pressure example, we have the test statistic:
n2
15  2
r
 0.665
 3.21.
2
1 r
1  0.665 2
We want to compare this with t13, 0.025  2.16  3.21 and so we reject H0 at the 5% level and conclude that there
is some evidence of a positive association between the variables.
 Exercise:
The data below refer to a sample of 12 children suffering from cystic fibrosis. The two variables are a measure
to the resistance of breathing, x, and height, y (cm).
x
y
13.8
89
8.2
93
9.0
92
12.5
101
21.1
95
6.8
89
17.0
97
11.0
97
8.2
111
12.7
102
8.5
103
10.0
108
Calculate the (product-moment) correlation coefficient connecting the two variables and test to see whether it is
significantly different from 0.
When testing the above hypotheses using Spearman's rank correlations, we use exactly the same idea and use the
test statistic:
n2
RS
.
1  R S2
However, the distribution of the test statistic when H0 is true is very complicated and it is best to carry out the
test on the computer.
2.4
Contingency Tables
Suppose that we have a random sample and we categorise the sample according to which of two characteristics
each sample member has. We will then use these data to investigate whether the two characteristics are
associated.
 Example:
A national survey was conducted in the USA to obtain information about alcohol consumption and marital status.
1772 US adults were selected randomly and the results are displayed in the table below:
Maritial
Drinks per month
Total
Abstain
1-60
Over 60
status
Single
67
213
74
354
Married
411
633
129
1173
Widowed 85
51
7
143
Divorced 27
60
15
102
Total
590
957
225
1772
This is called a two-way contingency table (“two-way” because we are categorising in terms of two variables).
The question of interest here is: “Is there an association between the amount a person drinks and their marital
status?”  i.e. are the variables marital status and alcohol consumption statistically independent? We will define
a test to answer this question.
116
10.4.1 Chi-squared test of independence
Consider a two-way contingency table. The chi-squared test of independence tests whether there is any
association between the two variables in the table. To introduce the test, we will consider the specific example
above.
 Example:
Consider again the contingency table of marital status versus alcohol consumption. We want to investigate
whether the two variables are associated and so we want to test:
H0: Marital status and alcohol consumption are statistically independent
versus H1: Marital status and alcohol consumption are statistically dependent
Now if H0 is true, then the two variables are independent. This would mean that we would expect to observe the
same proportion in each of the alcohol categories across the marital status categories. For example, we'd then
expect to observe the same proportion of single people who abstain as married who abstain, etc.
The total number who abstain is 590 out of a total of 1772 people sampled. So if H0 were true, we'd expect the
proportion of people of each marital status who abstained to be 590/1772 = 0.333. Now, a total of 354 of the
590
sample were single, and so we would expect to observe 1772
 354  117.9 of the people sampled to fall in the
single/abstain category, if H0 is true. Similarly, a total of 1173 people in the sample were married, and so we
590
would expect to observe 1772
 1173  390.6 of the people sampled to fall in the married/abstain category under
the null hypothesis.
By using these arguments we can build up a table of frequencies which we would expect to observe in the table
H0 were true.
Table of Expected Frequencies
Marital
status
Single
Married
Widowed
Divorced
Total
Abstain
117.9
390.6
47.6
34.0
590
Drinks per month
Over 60
160
191.2
44.9
633.5
148.9
77.2
18.2
55.1
13.0
957
225
Total
354
1173
143
102
1772
We can now compare these expected frequencies with what we did observe. If what we observed is close to
what we'd expect, then this would give us no reason to reject H0 (as the data are consistent with H0). On the
other hand, if what we observed is very different from what we expected, then this would cast doubt on H0 and
we'd reject it.
The test statistic we'll use is exactly the same as we used for the goodness of fit test, namely:
C

(O  E ) 2
E
(adding over all cells in the table).
Then if H0 is true, C ~  2 (just as before). We now need to define our degrees of freedom:
Degrees of freedom = number of expected values in the table that can be chosen freely.
In this example we have 4 marital status categories and 3 for alcohol consumption and so we have 12 categories
altogether. However, when calculating the expected frequencies we kept the totals fixed for each category. That
leaves us with (4  1)  (3  1)  6 values to find freely and so the degrees of freedom is 6 (see below):
Expected frequencies that can be chosen freely
117
Marital
status
Single
Married
Widowed
Divorced
Total
Drinks per month
Abstain
Over 60
160
117.9
191.2
390.6
633.5
47.6
77.2
590
957
Total
354
1173
143
102
1772
225
Shaded cells can be
deduced as we know
row/column totals
We can now carry out the test formally. We have observed
(O  E ) 2 (67  117 .9) 2
(15  13) 2
C

 ... 
E
117 .9
13
 21.952  2.489  18.776  1.07  0  2.67  29.358 
8.908  6.856  1.427  0.438  0.324
 94.269 .

We will reject H0 at the 1% level if C   62,0.01  16.81 . We will therefore reject H0 at the 1% level and conclude
that marital status and alcohol consumption are associated.
Test details in general:
Suppose now that we have a general contingency table with r rows and c columns. Denote the frequency in row
i and column j by n ij . To test whether variable 1 and 2 are independent, we first need to calculate the marginal
totals. Let Ri and C j denote the row total for row i and the column total for column j respectively and let the
total of all the observations be n. We then have:
Variable 1
Variable 2
2
…
1
c
1
2
…
r
n11
n 21
…
n r1
n12
n 22
…
nr 2
…
…
…
n1c
n 2c
…
n rc
Total
C1
C2
…
Cc
Total
R1
R2
Rr
n
Then the expected frequency in row i and column j is:
Eij 
Cj
n
 Ri or Eij 
Ri  C j
n
We now have the expected frequencies and the observed frequencies so we can calculate the test statistic
(Oij  E ij ) 2
C
E ij
i, j

where Oij is the observed frequency in the (i, j)th cell and where we sum over all r  c cells.
118
The degrees of freedom is the number of expected values in the table which can be chosen freely. Again we have
fixed the totals for each marginal total and so we can choose (r  1)  (c  1) values freely and so this is our
degrees of freedom:
Variable 1
Variable 2
…
c-1
…
n1,c 1
1
1
n11
2
n12
c
2
n 21
n 22
…
n2,c 1
R2
…
r-1
…
n r 1,1
…
nr 1, 2
…
…
nr 1,c1
Rr 1
C c 1
Rr
n
r
Total
C1
…
C2
Cc
Total
R1
Summary:
A chi-squared test of independence uses an r  c contingency table to test the hypotheses:
H0:
the 2 variables in the table are independent
H1:
the 2 variables are not independent.
Step 1: Find marginal row and column totals.
Step 2: Calculate the expected frequencies for each category using
Ri  C j
E ij 
.
n
Step 3: Use the test statistic
(Oij  E ij ) 2
C
E ij
i, j

summed over the r  c cells.
Step 4: Reject H0 if C exceeds the  (2r 1)(c 1), upper percentage point.
Note:
Just as for the chi-squared goodness of fit tests, we must have reasonably large expected frequencies before we
can use the chi-squared distribution. We use the same rule of thumb as for the goodness of fit test.
 Example:
A case-control study was carried out among swimmers to investigate the possible association between exposure
to chlorinated swimming pool water and erosion of dental enamel. Among 49 swimmers with enamel corrosion
(the cases) 32 reported swimming 6 or more hours per week, compared with 118 out of 245 swimmers without
enamel corrosion.
Observed frequencies:
Amount of
swimming per week
 6 hours
< 6 hours
Total
Erosion of enamel
(cases)
32
17
49
No erosion of enamel
(controls)
118
127
245
Total
150
144
294
Hypotheses:
H0:
Amount of swimming per week and the occurrence of dental enamel erosion are independent.
H1:
The two variables are associated.
119
Expected frequencies:
Amount of
swimming per week
 6 hours
< 6 hours
Total
Erosion of enamel
(cases)
25
24
49
C

Eij

Total
150
144
294
49  144
294
So the value of the test statistic is:
(Oij  Eij ) 2
No erosion of enamel
(controls)
125
120
245
(32  25) 2 (118  125) 2 (17  24) 2 (127  120 ) 2



 4.802 .
25
125
24
120
We have to compare this test statistic with percentage points from a chi-squared distribution with 1 degree of
freedom. As 12,0.05  3.841 and 12,0.01  6.635 , we can reject the null hypothesis at the 5% level (but not at the
1% level). So there is some evidence of an association between the amount of swimming and erosion of enamel.
Note 1:
We have not demonstrated a causal relationship (i.e. that by swimming a lot in chlorinated swimming pools
increases your chance of eroding tooth enamel). It may be that people who swim more are those who take more
care of their body and perhaps spend more time brushing their teeth (perhaps brushing the tooth enamel away).
In other words, tooth enamel and swimming time may be associated because they are both related to a 3rd
variable (e.g. degree of health consciousness and personal hygiene).
Note 2:
It is important that expected frequencies are not too small. To improve the test statistic’s approximation to a chisquared distribution, a continuity correction is sometimes used (due to Yates). This is done by reducing each
difference (observed minus expected) by ½ in absolute value before squaring. The test statistic therefore
becomes:
C

(| Oij  Eij |  12 ) 2
Eij
.
Note 3:
This question could be solved by examining the difference in the two proportions:
Amongst the cases, the proportion who swim 6 or more hours per week is p1 
controls, this proportion is p 2 
32
 0.653 . Amongst the
49
118
 0.482.
245
Hypotheses:
H 0 :  1   2  0 (i.e.  1   2   )
H1 :  1   2  0
The pooled estimate of  is
p
n1
n2
49 32 245 118
p1 
p2 
. 
.
 0.510.
n1  n2
n1  n2
294 49 294 294
Therefore, the test statistic is
120
w
0.653  0.482
 2.19.
0.51  0.49 0.51  0.49

49
245
Comparing this with percentage points of N[0, 1], we can again reject the null hypothesis at the 5% level.
 Exercise:
A random sample of accident reports was taken in a large city. Safety officials know that males are expected to
have more accidents than females and they were interested to know whether the types of accidents differ
between the sexes. The data obtained are displayed in the following table.
Accident
Circumstance
While at work
Home
Motor vehicle
Other
Sex
Male
18
26
4
36
Female
4
28
6
24
Do the data provide sufficient evidence to conclude that in this city, accident circumstance and sex are
statistically dependent?
121