Download Chapter 1 Statistical Distributions

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Contents
1
2
3
Statistical Distributions
1.1 Introduction . . . . . . . . . . .
1.2 Some basic concepts . . . . . .
1.3 Frequency Distributions . . . . .
1.3.1 Binomial distribution . .
1.3.2 Poisson Distribution . .
1.4 The Normal Distribution . . . .
1.4.1 Mean and Variance . . .
1.4.2 Standard Normal Curve
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
8
10
11
11
11
14
Sampling
2.1 Sample Mean and Variance . . . . . . .
2.1.1 Degrees of Freedom . . . . . .
2.2 Interval Estimation . . . . . . . . . . .
2.2.1 The Sampling Distribution of x̄
2.3 Confidence Limits . . . . . . . . . . . .
2.4 Central Limit Theorem . . . . . . . . .
2.5 Coefficient of Variation . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
18
19
19
22
25
25
Comparing Two Samples
3.1 Hypotheses . . . . . . . . . . . . . . .
3.2 One and Two-tailed tests . . . . . . . .
3.3 Unpaired t-Test . . . . . . . . . . . . .
3.3.1 Theory . . . . . . . . . . . . .
3.3.2 Example of Student’s t-test . . .
3.3.3 Checking equality of variance .
3.3.4 Calculating Student’s t . . . . .
3.4 Paired t-Test . . . . . . . . . . . . . . .
3.5 Non-parametric tests . . . . . . . . . .
3.6 The Mann-Whitney U test . . . . . . .
3.6.1 Example 1 . . . . . . . . . . .
3.6.2 Procedure for small samples . .
3.6.3 Large samples (N1 or N2 > 20)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
28
29
29
31
32
33
33
35
36
37
37
38
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.7
Wilcoxon signed-rank test . . . . . . . . . . . . . . . . . . . . .
3.7.1 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.2 Large Samples . . . . . . . . . . . . . . . . . . . . . . .
4 Analysis of Variance
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
4.2 Within Sample Variance and Between Sample Variance
4.2.1 Example . . . . . . . . . . . . . . . . . . . .
4.2.2 Exercise . . . . . . . . . . . . . . . . . . . . .
4.3 Comparison of Means . . . . . . . . . . . . . . . . . .
4.3.1 Fisher’s Least Significant Difference (LSD) . .
4.3.2 Tukey’s Honestly Significant Difference . . . .
4.3.3 Dunnett’s Multiple Range Test Using Minitab .
39
39
40
.
.
.
.
.
.
.
.
43
43
43
44
48
51
51
51
52
5 Correlation and Regression
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Some useful formulae . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Product-Moment Correlation Coefficient. . . . . . . . . . . . . .
5.3.1 Correlation coefficient: Example . . . . . . . . . . . . . .
5.4 Spearman’s Rank Correlation Coefficient - rs . . . . . . . . . . .
5.4.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Linear Regression Analysis . . . . . . . . . . . . . . . . . . . . .
5.5.1 Regression by hand . . . . . . . . . . . . . . . . . . . . .
5.5.2 Assumptions involved in linear regression by least squares
5.5.3 Regression Example 1 . . . . . . . . . . . . . . . . . . .
5.5.4 Regression Example 2 . . . . . . . . . . . . . . . . . . .
5.5.5 Regression Example 3 . . . . . . . . . . . . . . . . . . .
5.5.6 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
55
57
57
58
61
61
62
64
65
67
70
73
74
6 Analysis of Counts in Contingency Tables
6.1 Introduction . . . . . . . . . . . . . .
6.2 Example . . . . . . . . . . . . . . . .
6.3 Small expected values . . . . . . . . .
6.4 2 × 2 Contingency Tables . . . . . .
6.4.1 Example of 2 × 2 analysis . .
6.5 Exercise 1 . . . . . . . . . . . . . . .
75
75
75
77
77
78
78
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Statistical Tables
7.1 Table 1. Critical values for the Wilcoxon signed-rank test . . . . .
7.2 Table 2a. Critical values for the Mann-Whitney U test. For twotailed test. 5% significance level. . . . . . . . . . . . . . . . . . .
7.3 Table 2b. Critical values for the Mann-Whitney U test. For twotailed test. 1% significance level . . . . . . . . . . . . . . . . . .
2
81
81
82
83
7.4
7.5
7.6
Table 2c. Critical values for the Mann-Whitney U test. For onetailed test. 5% significance level . . . . . . . . . . . . . . . . . .
Table 2d. Critical values for the Mann-Whitney U test. For onetailed test. 1% significance level. . . . . . . . . . . . . . . . . . .
Table 3. Critical values for Spearman’s rank correlation coefficient
3
84
85
86
4
Chapter 1
Statistical Distributions
1.1
Introduction
This short course on statistics is intended to provide an understanding of distributions, sampling, errors, and significance testing. A number of standard tests used in
pharmaceutical and medical science will be introduced. Emphasis will be placed
on a general understanding of statistical tests and acquiring the ability to select and
interpret the appropriate test for given experimental data.
Whenever possible the understanding and use of statistical tests will be demonstrated and carried out, with your own data acquired in other parts of the course
using a software package (M INITAB) typical of the type of package you might be
expected to use in industry or a research institution.
1.2
Some basic concepts
We are going to start off this course on statistics by considering a very simple
experiment. The data from this experiment will be used to illustrate several fundamental statistical principles and hopefully give some idea why statistics is useful
right from the start.
Suppose we are interested in comparing two possible feeding regimes in laboratory rats. One is more expensive but is claimed to lead to faster growth. We
would like to know whether the extra expense is justified.
To investigate this we take 11 rats which have been fed on regime 1, which we
shall call “cheap” and 11 rats which have been fed on “expensive”. We weigh the
rats before and after a period of time and record the change in weight for each rat.
The data are listed below.
Cheap feeding regime.
29.59
31.69
30.50
Mean = 32.28g
35.63
5
36.08
30.55
37.81
34.47
27.32
36.94
Expensive feeding regime.
39.56
29.01
35.60
36.48
36.75
40.13
24.48
Mean = 37.24g
43.46
36.63
44.71
36.15
31.20
What do these data tell us? The first thing to do is look at the average weight
increase for each feeding regime. For “cheap” rats adding all the weights up and
dividing by 11 gives an average increase of 32.28g. The average increase for “expensive” rats is 37.24g. Thus the “expensive” rats showed a 4.9g increase over the
“cheap” rats.
So, “expensive” rats have put on more weight, but how do we know that this
difference is real and not just a matter of chance? After all we would not expect
the two samples to have exactly the same average weight increase. Differences in
the characteristics of individual rats and differences in their living conditions will
lead to some variation which has nothing to do with feeding regime. If we repeated
the experiment we might get the opposite result. Looking at the two averages (or
means) does not answer this question. We have to look at the data in more detail.
Here are the same data summarised as box plots. Box plots show the minimum,
the maximum and the median value (the cross) of a set of data. The portion of the
sample enclosed by double lines includes half of the items. The median is the
central value. The Box plot is a good way to get a quick visual feel for the data.
6
----------------------( +
)---------------------*
---------( +
)------------------
You can see that the two samples are well separated. We would intuitively
feel fairly confident that the difference in mean weight increase between the two
treatments was real. Suppose, however that the box plots looked like this.
-------------------------------------(
+
)-----------------------------------------------------------------------------(
+
)---------------------------------------
These two medians are the same as in the previous boxplot but the variation
between rats within each treatment is much larger. In this case we might feel that
the observed difference is just a chance event.
This example should make it clear that it is the size of the observed effect
in relation to the variation within each experimental group that is important in
making a statistical inference. In the second case the difference in medians looks
rather small in comparison to the large variation between rats.
In fact, a statistical test on these data will not tell us whether the “expensive”
food is better than the “cheap” food, what it will tell us is how likely the observed
difference is on the assumption that the two feeding regimes are identical in their
effect on weight increase. If the observed difference (4.9g) is very unlikely to occur
by chance we can infer that the effect is real.
The decision as to how unlikely an effect we are willing to accept as real is up
to us (although there are commonly accepted levels, as you will see).
The appropriate statistical test on the rat data tells us that the probability of the
observed difference of 4.9g is 0.0085. In other words, assuming that the difference
is a chance event we would expect a difference as large as 4.9 only 1 time in 118
(1/0.0085) similar experiments. This would be a very rare event so we would
certainly conclude that the “expensive” regime was better.
The action you take after a statistical test depends on the situation. In this case
we might well decide that the small increase in weight does not justify the extra
7
expense even if we are convinced that the difference is real.
These sorts of practical decisions are ultimately up to us, statistics gives an
objective assessment of the available evidence which we can use to make rational
decisions.
The above discussion has introduced some important statistical concepts. First
of all, information on the underlying biological variation was fundamental to the
statistical test. This variation is often referred to as error or residual variation. An
estimate of the residual variation is achieved by replicating the basic experimental
unit (in this example, a rat) within each treatment. If we had only taken two rats,
one for each treatment, we would have no idea whether an observed difference
was due to the treatment or just error. We can only calculate a probability for the
observed result because we had replicates for each treatment.
You will have noticed that the probability that is calculated is for the observed
result, given that the treatments are identical. This assumption represents the so
called null hypothesis. All statistical tests have a null hypothesis which is adopted
temporarily. If the experimental results are considered to be improbable under the
null hypothesis it will be rejected in favour of an alternative hypothesis. In the
example the null hypothesis is that the two feeding regimes are equivalent in terms
of their effect on weight increase. This might be rejected in favour of the alternative
hypothesis that the “expensive” regime is better than the “cheap” regime.
Finally if, as in the example, a difference in means is considered to be too large
to have occurred by chance, it is said that the difference is statistically significant.
1.3 Frequency Distributions
The sample of 11 rats could be plotted on a histogram, as in fig 1.1, with suitable
weight increments along the horizontal axis and number of rats in each weight
class on the vertical axis. Figs 1.1 shows such histograms for larger and larger
samples from the same population of rats. As the samples get larger the frequency
histogram starts to take on a characteristic bell shape. Most of the rats are clustered around a mean value, with fewer and fewer rats at the extreme weights as
represented by the tails of the distribution.
This kind of bell shaped distribution is very common and is well represented
by a mathematical model called the normal distribution. The normal curve is superimposed on the frequency histogram for sample size 2000 in Fig 1.1.
The normal distribution is the most important of the theoretical distributions
you will come across because it applies so often in practice, but it is not the only
one. Two others, the binomial and the Poisson are going to be introduced below
before a more detailed look at the normal distribution.
8
11
10
9
n = 20
n = 50
8
7
6
5
5
4
4
3
3
2
2
1
1
21
25
29
33
37
17
n = 300
21
25
29
33
37
41
n = 2000
200
150
50
100
30
50
10
17
21
25
29
33
37
41
17
21
25
29
33
Figure 1.1: Histogram for sample size 20 to 2000
9
37
41
0.4
n=6
p = 0.33
0.3
0.2
n = 10
p = 0.5
p
0.1
0.2
0.1
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8 9 10
Number of brown eggs
Number of heads in 10 throws
Figure 1.2: Binomial Distribution
1.3.1
Binomial distribution
Frequency distributions fall into two categories, continuous and discontinuous. The
normal distribution is an example of the former. In a continuous distribution your
measurements can theoretically take any value within a particular range. In other
situations your measurements can only take one of a limited number of values,
giving rise rise to a discontinuous distribution.
Suppose you work at the egg marketing board and you are looking at the number of brown eggs in a box of six. The number of brown eggs can take any integer
value from 0 to 6. If you sample many boxes and record the number of brown eggs
per box on a bar chart with seven categories (Fig 1.2) you will build up a discontinuous frequency distribution. This type of discontinuous distribution is modelled
mathematically by the binomial distribution.
The binomial distribution applies whenever you have a number of trials (say
n) each of which has a fixed chance of being in one of two categories (hence binomial). The distribution for a particular number of trials (n) gives the chance of
getting each of the n + 1 outcomes. That is the chance of 0 successes in n trials, 1
success in n trials up to n successes in n trials.
If this seems confusing, consider the egg example. The six eggs in a box represent six trials. Each trial can be a white egg or a brown egg. If the average number
of brown eggs per packet of six is two, then the probability of getting a brown egg
in a single trial would be 0.33 (2/6). The binomial gives the chance of getting 0
brown eggs in 6 trials, 1 brown egg in 6 trials and so on up to a box full of brown
eggs.
A simpler example would be the number of heads you would get if you tossed
a coin 10 times (Fig 1.2). There are 10 trials and two possible outcomes for each
trial, heads or tails. Again the binomial gives the chance of getting 0 heads in 10
throws (not very likely) up to all heads (also unlikely). The most common category,
as you would expect, is 5.
Note that the vertical axis in Fig 1.2 gives the probability for each category and
10
hence the height of all the bars must sum to 1.
1.3.2
Poisson Distribution
The Poisson distribution is named after the man who first described it. It crops
up in situations where you are counting the number of random events which occur
in a given time period, for example the number of radioactive decays per second.
The classic example in the text books is the number of soldiers kicked to death by
horses for each year of the Prussian war.
The Poisson also occurs where you are counting items in a unit volume. Thus
the number of bacteria in a millilitre of medium will follow a Poisson distribution
if the culture is well mixed.
1.4
The Normal Distribution
The Binomial distribution and the Poisson distribution have been mentioned because you are likely to come across them from time to time and it is just as well
to realise that not all distributions are bell shaped, symmetrical and continuous.
However the Normal distribution is so central to statistics that it must be described
in more detail.
Let’s go back to the frequency histogram in Fig 1.1. The vertical axis gives the
numbers of rats in each of the weight classes. If the number of rats in each class
is added up this will give the total number of rats measured. It would be possible
to divide the number of rats in each class by the total number measured so that the
total summed to one. The vertical axis would now give the probability of a rat,
selected at random from those measured, being in that weight class. Thus such a
histogram would be equivalent to the discontinuous distributions of Fig 1.2. When
the smooth normal curve is scaled in this way, so that the area under the curve is
unity, it is known as a probability curve.
1.4.1
Mean and Variance
The mathematical formula which describes the normal curve (don’t worry, you will
not even see this) has two parameters, one which gives the position of the peak, the
mean; and one which describes how spread out the curve is, known as the variance.
If you know these two parameters you can fully describe the normal curve.
Given a large series of measurements, how do we specify the normal curve
which best approximates the frequency distribution of our measurements? We must
calculate the two parameters of the normal distribution from our data. First the
mean. This is simply the average of our measurements. Mathematically this is
written:
P
x
µ=
n
11
P
If you have not seen the summation symbol ( ) before it means add together
all instances of whatever occurs inside it (here x). So the mean (µ) is the sum of
P
all the measurements ( x) divided by the number of measurements (n).
The second parameter, the variance, is less obvious. A large value should indicate a spread out distribution. The measure should not depend on the sample size.
A sample of 10 from a particular population should give about the same value as a
sample of 100, since the underlying curve we are trying to specify is the same in
both cases. It would be a good idea if you first consider how you would devise a
measure of the spread of a data set.
One way to measure the spread would be to take each individual and see how
far it departs from the mean (ie the centre of the distribution). If we add all these
up they would sum to zero, because all the negative differences would cancel out
the positive differences. We could just knock off the negative signs and then add
them up. This would give a large value for spread out distributions and a small
value for narrow distributions, but it would not be independent of sample size. In
fact a sample of 100 would have a value roughly 10 times that of a sample of 10.
The solution is to divide by the sample size to get the average absolute deviation
from the mean. You could provide the following formula to describe it.
P
| µ−x |
σ =
n
The vertical bars | mean “the absolute value of”. This is a reasonable measure
of spread which has been used in the past.
The measure which is actually used is called the standard deviation. This is calculated in a very similar way, except that the deviations from the mean are squared
before they are added. The sum of the squared deviations is divided by n to give
the mean squared deviation or mean square, also known as the variance. The variance, which is given the symbol σ 2 , is square rooted to give the standard deviation,
which is given the symbol σ . Hence the standard deviation is on the same scale as
the original measurements.
s
P
(µ − x)2
σ=
n
Notice that once you square the deviations they all become positive and there
is no need to take the absolute value. This is not the only reason that this measure
of spread was adopted, there are “good theoretical reasons” beyond the scope of
this course.
The conventional way of describing a particular normal distribution in mathematical notation uses the form N (µ, σ 2 ). Thus a normal distribution with a mean
of 10 and a variance of 5 would be written as N (10, 5).
There is an important difference between a probability graph for a continuous
distribution and the equivalent probability graph for a discontinuous distribution.
12
0.04
0.03
0.02
0.01
0.00
-3
-2
-1
0
1
2
3
Figure 1.3: Area of the normal curve larger than µ
For example, in the binomial shown in Fig 1.2 the probability of a particular outcome (say 5 heads in 10 throws) can be read off the graph, as the height of the bar
for 5 heads (p = 0.2461). You can not treat the normal probability curve this way
however.
Suppose x is a measurement on an individual chosen at random from a normal
population. A continuous variable can take any value between certain limits and
not just particular integer values. So if x is measured to sufficient decimal places
it will never be exactly the same as a particular chosen value of x. Put in another
way, the probability of getting a particular value exactly will be zero. However, it is
possible to use probability graphs to determine the chance of x lying between certain limits, in which case the area bounded by the two limits gives the appropriate
probability. For example, what is the probability that x is greater than the mean?
The region of the normal probability curve which corresponds to values larger than
the mean is shown shaded in Fig 1.3.
The area of this shaded region gives the probability that we are after. Bearing
in mind that the area under the whole curve is 1 and that the curve is symmetrical
it should be clear that the area (and hence the probability) of the shaded region is
0.5.
13
1.4.2
Standard Normal Curve
Normal distributions differ in their mean and variance. We can reduce any normal
distribution to a standard form by first centering it about zero. This amounts to
moving the whole normal curve along the horizontal axis until it stradles zero. We
can achieve this by subtracting the mean from all measurements. Next, if we scale
our measurements by dividing each by the standard deviation we will end up with
a variance of one. The original distribution has now been transformed to N (0, 1)
which is known as the standard normal curve.
The normal curve has some properties which you will come across again and
again.
• The area of the curve which lies between one standard deviation on either
side of the mean is 68% of the total area of the curve , that is 68% of the
individuals in the population which the curve represents lie within 1 standard
deviation of the mean.
• 95% of the area lies within 1.96 standard deviations of the mean.
• Nearly all of the normal probability curve lies within 3 standard deviations
on either side of the mean.
This is the kind of information we need in order to access the likelihood of our
results under a particular null hypothesis.
Suppose you take a single measurement on an individual who has undergone
some kind of treatment. You have a very large number of measurements on people
who have not had the treatment, so you know the mean and variance of the distribution accurately. You would like to know whether there is any evidence that the
treated person differs from the untreated patients.
Our null hypothesis is that the treated patient does not differ from the untreated
and is in effect a random individual drawn from our normal distribution. We should
decide beforehand what level of probability we will accept as evidence that there is
a difference. A common level used in science is a probability of 0.05. (1 in 20). To
put this in another way; if a result as different from the mean as ours would only
occur by chance one time in twenty, or less, under the null hypothesis, then we will
reject the null hypothesis and conclude that we have a significant effect.
Suppose our mean is 112 and standard deviation 4. In Fig 1.4 we have shaded
in 2.5% of the total area in each tail.
If we select at random from the population 1 in 20 will fall within the shaded
area, so if our observed value lies in this area we will reject the null hypothesis at
the 5% level. The shaded area represents our rejection region or critical region.
We have said that 95% of the normal curve is enclosed by 1.96 standard deviations. This is an important figure because 5% is the commonly adopted level of
significance. So in this example any measurement which deviates from 112 (the
mean) by more than 1.96 standard deviations , i.e. 7.84 (1.96×4 = 7.84) will lie in
the rejection region. Let’s suppose we have an individual with a value of 102. This
14
0.04
0.03
0.02
0.01
0.00
100
104
108
112
116
120
124
Figure 1.4: 2.5% of the area of the normal curve in each tail
deviates from the mean by 112 − 102 = 10. More than 7.84, so this individual lies
in our rejection region and we would certainly conclude that this individual does
not belong to our population.
To find the exact probability of a value of 102 we can standardise it so that it is
expressed as a standard normal deviate and then refer to tables of normal deviates.
To standardise 102 we express it as a deviation from the mean and then divide
by the standard deviation, as follows.
(x − µ)
σ
(102 − 112)
z =
4
z = −2.5
z =
It is conventional to refer to standard normal deviates by the letter z. We could now
refer to tables of normal deviates (z tables) but it is simpler to use M INITAB. (see
Exercise 1 at the end of the next Chapter)
15
16
Chapter 2
Sampling
2.1
Sample Mean and Variance
In chapter 1 we dealt with the normal distribution and its two parameters the mean
and variance (µ and σ 2 ). We have given formulae with which µ and σ 2 can be
calculated from the population of measurements. In practice we are almost always
faced with a different situation, in which we are estimating population parameters
from a sample.
Suppose we want to find out the systolic blood pressure of students at Bath
University. We would not be able to determine the blood pressure of every single
student, so we would choose a “representative sample”, measure them and estimate
the mean student blood pressure from these.
Population The population consists of all items which are of interest.
Sample The sample is the subset of the population for which we have measurements.
Leaving aside the problem of how to choose a representative sample, how can
we best estimate the population mean and variance from our sample?
Clearly the sample mean is the best estimate we have of the population mean.
It should be obvious that the larger the sample, the more accurate this estimate will
be.
P
x
x̄ =
n
This is the same as the formula for the population mean except that the symbol x̄
is used to distinguish it from the population mean µ. Although the sample mean is
subject to some error it is said to be an unbiased estimate of µ. This means that if
many samples were taken their means would vary about the true mean.
What about the variance? If we use the same formula that we used for the
population variance we do not get an unbiased estimate of σ 2 . If we could calculate
all our deviations from the mean by subtracting µ, the true population mean, our
17
estimate would be unbiased, but we do not know µ, we only have an estimate of
it in x̄. Since x̄ is calculated from our sample the deviations from this mean are
bound to be less than they would be from µ. It follows that our estimate of the
variance will be too low, especially for small samples.
It can be proved mathematically that if we divide the sum of squares by n − 1
instead of n this gives us an unbiased estimate of the variance. hence the formula
for calculating the sample variance is :
P
(x − x̄)2
2
s =
n−1
Again notice that the Roman letter s is used for the sample standard deviation
rather than σ . It is a convention in statistics to use Greek letters for population
parameters (e.g. µ and σ ) and Roman letters for estimates of parameters (e.g. x̄
and s). Estimates of parameters are known as statistics.
2.1.1
Degrees of Freedom
The term n − 1, in the formula above, is called the degrees of freedom. Remember
that when calculating a statistic you loose one degree of freedom for each term
which is calculated from the sample. Thus the calculation of s involves subtracting
each item from the term x̄ which has itself been calculated from the sample, hence
one degree of freedom is subtracted from the total n.
You might find an alternative explanation of Degrees of freedom easier to
grasp. Consider a sample of 8 students which have been measured for systolic
blood pressure. You can see that the 8 deviations from the mean must, by definition sum to zero. Hence as soon as we know 7 of the deviations the last one
is determined. In other words only 7 deviations can vary independently, hence 7
degrees of freedom.
Exercise: Calculation of Mean and Standard Deviation
Eight students were selected at random and measured for systolic
blood pressure. The values (in mmHg) were 130, 141, 120, 110,
118, 124, 146, 128.
• Estimate the mean systolic blood pressure of the student
population.
• Estimate the standard deviation of systolic blood pressures.
• Estimate the proportion of the student population with systolic blood pressures over 135.
18
2.2
Interval Estimation
The estimate of mean student systolic blood pressure in the example above is a
single value sometimes called a point estimate. Stated more formally – The sample
mean (x̄) is an unbiased point estimate of the population mean (µ). The sample
standard deviation (s) is an unbiased point estimate of the population standard
deviation (σ ) provided the sums of squares is divided by the degrees of freedom
n − 1.
2.2.1
The Sampling Distribution of x̄
In published work you will usually see quoted means accompanied by a measure
of reliability, x̄ ± a. These are commonly so called 95% confidence limits which
means “we are 95% certain that the range x̄ −a to x̄+a encloses the true population
mean”. Such a statement is an interval estimate of the mean.
You may also see means quoted as a confidence interval. For example a mean
and confidence limits of 4.5 ± 0.4 could be expressed as 4.1 – 4.9. This is just an
alternative to confidence limits.
How are confidence limits calculated? Imagine that the above experiment (take
8 students at random and calculate the mean) was repeated many times. We would
end up with a collection of means which would themselves be normally distributed
but with a much smaller variance than that of the individuals in the population. The
distribution of the means is called the sampling distribution of the mean.
We would correctly guess that the larger the samples that our means were based
on, the less would be the spread of our means, and hence the more confident we
could be of the accuracy of our estimate of the true mean.
Have a look at Fig 2.1. The top figure represents the population that we are
sampling from. The figure below shows the sampling distribution of samples of
size 10 from this population. This has a much narrower spread than the parent
population above. If we took lots of samples of size 10, this is how the means
would be distributed. The bottom figure shows how the sampling distribution of
samples of size 20 would look.
The quantitative relationship between sample size, the population variance and
the variance of the sampling distribution of the mean is investigated in the next
exercise.
Confidence limits for a statistic are derived from the sampling distribution of
the statistic, so once we know how to estimate the variance of the sampling distribution of our mean we will go on to show how confidence limits are calculated.
The variance of the mean is a function of the population variance and the
sample size, so for a single sample we can predict how much error to expect on
our estimate. In the next exercise the relation between the variance of the sample
mean, population variance and sample size will be investigated by using M INITAB
to simulate the taking of many samples of a particular size from a population and
looking to see how our sample means vary.
19
Figure 2.1: Sampling Distribution of Means
Population
mean = 120
standard deviation = 10
100
120
140
n = 10
100
120
140
n = 20
100
Sampling distribution of
the means of samples
of size 10 from
the population above.
120
140
20
Sampling distribution of
the means of samples
of size 20 from
the population above.
Exercise: Sampling Distributions
• Enter M INITAB
• Try the following command :
RANDom 10 C1;
NORM 120 10.
M INITAB will generate a random sample of size 10 from the distribution specified in the sub-command, in this case a normal distribution with a mean of 120 and a standard deviation of 10; that
is N(120, 100). HELP RAND will give more details of what this
command does.
• Generate 10 samples of size 40 from N(120, 100) and store them
in columns C1 to C10.
RAND 40 C1-C10;
NORM 120 10.
We have filled 40 rows and 10 columns with values sampled at
random from N(120, 100). We are going to investigate how the
mean of samples of size 10 varies compared with the variance
of the population from which the samples have been taken (σ 2 =
100). Each of the 40 rows represents a sample of size 10.
• The following command is a quick way of calculating the necessary statistics.
RMEAN C1-C10 into C21
This collects the means of the 40 samples and puts them into C21.
• Look at C21. and compare the spread of values compared with
say C1 and C2.
DOTP C1 C2 C21;
SAME.
continued on next page
21
continued from previous page
• Calculate the standard deviation and variance of C21.
LET K1 = STDEV(C21)
LET K2 = K1**2
PRINT K1 K2
Mean of the 40 means
Standard deviation of the 40 means
Variance of the 40 means
The mean of the means should be close to 120 (it is based on a sample of
40 x 10 = 400!).
• We are interested in the sampling distribution of the 40 means.
Compare the variance of the sample means with the population
variance.
• Repeat this for samples of size 20 and 2. Fill in the table below
sample size
overall mean
variance of means
2
5
10
15
20
• What would the variance of the means be for samples of size 1?
• Try to deduce the relationship between variance of the sample
means, population variance and the sample size. You might find
it helpful to plot variance of the means against 1/n and try some
intermediate sample sizes such as 5 or 15. Remember that your
estimated variances will be subject to sampling error, this is especially true for small sample sizes, so if you are going to repeat a
sample size concentrate on the small sample sizes.
• If you can suggest a formula for calculating the standard deviation
of the mean (call this sem for now) in terms of the sample size n
and the sample variance (s 2 ) then so much the better.
2.3 Confidence Limits
If random samples of size n were to be taken from a distribution with mean µ and
standard deviation σ ; then the sample means wouldpform a distribution having the
same mean µ but with a smaller standard deviation s 2 /n; (where s is an estimate
of σ ).
The standard deviation of the sampling distribution of x̄ and is often called the
standard error of x̄, (sem), to distinguish it from the standard deviation (s) of the
22
√
sample. It can also be written as s/ n.
If our sampling distribution is normal and the sample size is large (over about
30), we know that 95% of the sampling distribution of x̄ lies within 1.96 standard
√
deviations of the mean. Since the standard deviation of x̄ is s/ n the 95% confi√
√
dence interval for x̄ is calculated as x̄ + 1.96 × s/ n to x̄ − 1.96 × s/ n.
To put this another way; given the sample mean x̄ we are 95% confident that the
interval calculated above will enclose the “true” population mean µ. The two end
points of the confidence interval are the confidence limits which are often written
√
as x̄ ± 1.96 × s/ n.
For small samples our estimate of µ varies from sample to sample so we need
√
to take a larger interval than 1.96 × s/ n. The appropriate value can be obtained
from the Student’s t-distribution. As n gets larger, t approaches 1.96, see below,
but increases quite rapidly for small samples. For example, for a sample of size 3
(degrees of freedom 2) a t of 4.3 would be used.
df (n − 1)
t95%
1 12.71
2 4.30
3 3.18
4 2.78
7 2.36
10 2.23
20 2.09
∞ 1.96
The formula for calculating confidence limits is:
s
s2
x̄ ± t
n
Where you will need to look up the appropriate value of t for the level of confidence
you want (usually 95%) and the sample size. For sample sizes over 20 a value for
t of 2 is a good enough approximation.
23
Exercise: Calculation of Confidence Limits
Measurements of systolic blood pressure were made in 8 students,
the values (in mmHg) were: 130, 141, 120, 110, 118, 124, 146,
128.
You have already found the mean and standard deviation of these
data. Now calculate the following.
sem
=
95% Confidence limits of population mean =
95% Confidence interval
=
Exercise: Confidence Limits for Titration Results.
• Go into M INITAB and open the worksheet with your titration
results
RETR ’GROUPA’
• You calculated the mean and variance for the class data last
time. Calculate the 95% confidence limits for this mean using the normal deviate 1.96. Do this for the “before instruction” and the “after instruction” set of results. Do not use
the TINT command.
95% confidence limits for class mean before instruction =
95% confidence limits for class mean after instruction =
• What is the appropriate normal deviate for calculating the
99% confidence interval? The M INITAB command INVCDF
will help (Inverse Cumulative Distribution Function).
• Does the “correct value” as given by the chemists lie within
the class confidence intervals? If not, why not?
• Check your results using the M INITAB TINT command.
24
Confidence Intervals with M INITAB
You can find confidence intervals easily with M INITAB. If your
data are in C1 type:
TINTerval C1
2.4
Central Limit Theorem
In calculating confidence limits we assume that the sampling distribution of the
statistic is normal. The central limit theorem means that this is usually a safe
assumption to make. It may still be necessary to check on normality with small
samples.
The Central Limit Theorem For reasonably large samples x̄ is approximately
normally distributed whatever the parent distribution.
2.5
Coefficient of Variation
As we have seen the standard deviation is measured in the same units as the individual measurements. Quite frequently in experimental procedures large measurements tend to have large errors and low measurements low errors. If in fact
the errors are proportional to the mean it would be valid to calculate a quantity
known as the coefficient of variation (cov), which is simply the standard deviation
expressed as a proportion of the mean. This is often expressed as a percentage.
The coefficient of variation provides a measure of variability which can be used
to compare samples with very different means. The coefficient of variation is also
known as coefficient of error. This is calculated by:
cov =
s
× 100%
x̄
For example a sample with mean 109 and standard deviation 9.8 has a coefficient of
variation of 8.9%. A sample mean 11.4 and standard deviation 3 has a coefficient of
variation of 26.3%, thus the second sample has a larger error associated with it than
the first. Remember that we have assumed that the error increases in proportion to
the mean. This is not always the case.
One benefit of using the coefficient of variation is that it is independent of the
units of measurement. Whether you measure a sample of heights in centimetres
or yards the coefficient of variation will be the same. One warning; you can not
use the coefficient of variation if measurements can take negative values. Thus you
would not use it where temperatures are in Fahrenheit or centigrade.
25
Exercise: Coefficient of Variation
In an assay for serum ACTH, a reference sample was assayed 12
times in laboratory A and 12 times in laboratory B. Laboratory
A returned a mean of 17.5 pg/ml with a standard deviation of 2.5
pg/ml, whereas laboratory B found a mean of 19 pg/ml with an s
of 5 pg/ml.
• Thus the coefficient of variation for laboratory A would be
?
• for laboratory B ?
• Whose results are more reliable ?
• What assumption have you made ?
26
Chapter 3
Comparing Two Samples
3.1
Hypotheses
Whenever you are assessing whether an individual value has come from a population of known mean and standard deviation or whether a treatment with a new drug
is having a significantly better effect than an old one, you are testing a hypothesis.
By their nature hypotheses cannot be proven but they can be rejected or accepted
on the basis of available evidence. For example, we might observe an improvement in recovery times in a sample of patients treated with a new drug compared
to a sample treated with the usual drug. However, it is always possible that there
is, in fact, no difference between the two drugs and that our apparent improvement
is simply due to chance. The larger the difference the less likely it is to be a chance
event, but we can never prove, in the strict sense, that it is not due to chance.
What we can do is calculate the probability of obtaining the observed improvement by chance, since we can find the mean and variance of the two samples.
The general procedure in a statistical test is to formulate what is known as a
null hypothesis. In the case of an experiment to test whether a new drug is better
than the usual drug, the null hypothesis would be stated in the following form.
The new drug is no better than the usual drug
An alternative hypothesis in this case might be
The new drug is better than the usual drug.
The important property of a null hypothesis is that we can calculate the chance
of the observed measurement assuming that the null hypothesis is true. We can use
this information, which is expressed as a probability, to decide whether we have
evidence that would lead us to reject the null hypothesis in favour of an alternative
hypothesis. There is an arbitrary element to this decision but it is conventional to
reject the null hypothesis if the probability of the observed result is less than 0.05,
or in other words, if the observed result would occur less than once every twenty
experiments in the long run.
27
The distribution of the difference between two means under the null hypothesis
2.5%
2.5%
5%
0
0
Critical region corresponding
to the one-tailed alternative
hypothesis.
Critical region corresponding
to the two-tailed alternative
hypothesis
A result which is considered too unlikely to have occurred under the null hypothesis is said to be statistically significant.
The statistical test which is used for comparing two means is called Student’s
t-test. The appropriate test to use when there are more than two means to be compared is Analysis of Variance which is dealt with in the next chapter.
The abbreviation H0 is used for the null hypothesis and H1 for the alternative
hypothesis.
3.2 One and Two-tailed tests
Look again at the alternative hypothesis in the example above.
The new drug is better than the usual drug.
The implication here is that we would only consider a result to be significant
if the new drug had a higher mean than the usual drug. Under the null hypothesis
(no difference in mean) the observed difference in means for two samples of any
given size will form a normal distribution with a mean of zero. The variance of
this distribution will depend on the sample sizes. The critical region of this distribution, for a 5% significance level, will be in one tail and form 5% of the total
area. Any observed difference in means which fall in this region will be considered
evidence of a significant difference between the two samples. When the alternative
hypothesis is in this form we are carrying out a one-tailed test.
In scientific research it is far more common to carry out two-tailed tests, where
the alternative hypothesis is in the form:
The new drug differs in effect from the usual drug.
In this case both tails of the distribution form the critical region and both have an
area of 2.5% of the total (summing to 5%).
28
3.3
Unpaired t-Test
Let us first consider the t-test which is used when data are unpaired. The effect
of two anti-hypertensive drugs are studied on the blood pressure of two different
groups of subjects. One group is given the first drug and the other group the second. After measuring the blood pressure of each subject, we wish to compare the
mean blood pressure of one group with the mean of the other group. If there is
a difference in mean we need to know whether this constitutes evidence that the
two drugs differ in effect, or whether the observed difference can be attributed to
chance. This is a very common experimental design and Student’s t-test is the appropriate test providing that certain assumptions, which are discussed below, are
valid.
A similar common experimental design, which requires a different test, involves each subjects being matched in some way and each member of the pair
being given one of the drugs. Indeed the same subject may be given both drugs at
different times. This requires a t-test for paired data Section 3.4.
3.3.1
Theory
Suppose x¯1 and x̄2 are the means of our two samples and s1 , s2 are the standard
√
√
deviations. The standard error of the two samples would be x̄1 / n1 and x̄2 / n2 .
Our null hypothesis is µ1 = µ2 . In other words µ1 − µ2 = 0. What we would
like to know then is the sampling distribution of x̄1 − x̄2 so that we can calculate
the probability of any observed difference in means under the null hypotheses. We
know the standard error of both x̄1 and x̄2 so what is the standard error of one
subtracted from the other? In general if two normal distributions are added the
variance of the new distribution is the sum of the two variances. Perhaps surprisingly, if we subtract two distributions the variance of the new variable is also the
sum of the two variances. The following short exercise should convince you.
29
Enter M INITAB and generate two random samples from the
N (50, 100) distribution.
RAND 50 C1 C2;
NORM 50 10.
Now add the two columns into C3 and subtract them into C4.
LET C3 = C1 + C2
LET C4 = C1 - C2
Now check the variances of the 4 columns.
LET K1=STDEV(C1)**2
LET K2=STDEV(C2)**2
LET K3=STDEV(C3)**2
LET K4=STDEV(C4)**2
PRINT K1-K4
The variances of C3 and C4 should both be approximately twice
those of C1 and C2.
So we can find the standard error of the difference in means by adding the variances
of the two means and taking the square root. For large samples, if the difference
between the means is more than 1.96 standard errors the difference is significant at
the 95% level. For smaller samples we use the appropriate value of t instead.
In the formula below a pooled sample variance is formed by combining the
sample variance of the two groups into a single estimate.
The pooled variance, sp2 , is given by :
(n1 − 1)s12 + (n2 − 1)s22
n1 + n2 − 2
Each sample variance is multiplied by the degrees of freedom to give the sums of
squares. The sums of squares are added and divided by the total degrees of freedom
to give a pooled sample variance.
sp2 =
t is then calculated by:
t=
x̄1 − x̄2
q
sp × n11 +
1
n2
The procedure above, in which a pooled estimate of the variance was calculated,
assumes that the variance of the two samples is the same.
It is important to check this assumption before carrying out a t-test. A formal
method is the variance ratio test also known as Fisher’s F -test. Simply calculate
s12 /s22 and look up the value of this ratio in the appropriate table. Alternatively use
a statistics package.
30
3.3.2
Example of Student’s t-test
In order to test the effectiveness of a new analgesic drug, two groups of mice were
used. The first group received saline as a control, whereas the second group received a dose of the drug. To test for analgesia the animals were placed in turn on a
hot-plate maintained at 56◦ C. The time taken for the mouse to rear on its hind legs
and lick its forepaws was taken as an indicator that the animal was aware of the
pain stimulus, and was recorded. As soon as the response was observed the mouse
was removed from the hot-plate.
We have designed the experiment to answer the question... does the analgesic
significantly affect the animals response to pain? In this case we are prepared to
accept the possibility that the drug could actually make the mice more sensitive to
pain, so we are going to use a two-tailed test.
H0 the analgesic does not effect the animals
H1 the drug does affect the animals response to pain
The data are shown below.
Analgesia (Seconds)
Treatment
Saline control (x1 )
18
14
16
11
21
24
19
20
Test Drug (x2 )
22
18
31
38
26
28
29
40
24
15
In M INITAB type the data into C1 and C2. Have a look at the data using the
describe and dotplot commands.
DESC c1 c2
DOTP c1 c2;
SAME.
Or use a boxplot as shown below.
response
40
30
20
10
drug
control
The spread (variance) of the two samples is fairly similar. The control group
has a mean time of 18.20 whilst the treatment group has a mean time of 29.00.
31
Is this increase in response of 10.8 seconds significant? Looking at the boxplot,
this difference certainly looks real but to answer this objectively we can carry out
a t-test.
3.3.3
Checking equality of variance
If you are worried that the variances might be different a formal check can be made.
The ratio of the two variances is calculated by dividing the larger variance by the
smaller. Here C2 has the larger variance. If this is not the case reverse C1 and C2.
This gives an F -ratio, as the ratio of two variances is known, of 3.03 with 7 and 9
degrees of freedom.
To find the significance of this value you can use Minitab’s CDF command
(Cumulative Distribution Function) which gives the area of the distribution up to
the specified value (see the diagram).
93.76%
6.24%
0
F(7,9)
3.03
Figure 3.1: The area of the F -distribution given by the CDF command.
We are interested in the probability of values of F greater then 3.03 so we must
subtract the value given be CDF from 1. Furthermore this would need to be doubled
for a two-tailed tests. The complete set of Minitab commands to carry out this F
test is given below.
LET K1 = STDEV(C2)**2 / STDEV(C1)**2
CDF K1 K2;
F 7 9.
LET K3 = (1 - K2)*2
PRINT K1 K3
CDF K1 K2 gives the probability for a value in K1 and stores it in K2. The
subcommand F 7 9 specifies an F distribution with 7 and 9 degrees of freedom.
This gives an F -ratio of 3.03 with 7 and 9 degrees of freedom the two-tailed
probability of which is 0.124. Nothing to worry about.
A value of K1 significantly different from 1 (K2 < 0.05) indicates that the two
samples differ in variance, in which case the POOLED subcommand should not be
32
used, see below. Another solution is to use a non-parametric test (section 3.5) in
preference to a t-test. If the F -ratio is sufficiently close to 1 we can proceed to
calculate the value of t.
3.3.4
Calculating Student’s t
Assuming the variances are similar the following M INITAB command will carry
out Student’s t test. If the variances differ omit the POOLED subcommand or use a
non-parametric test such as the Mann-Whitney U test.
TWOSample C1 C2;
POOLED.
The output from these commands is shown below.
Twosample T for control vs drug
N
Mean
StDev
control 10
18.20
4.26
drug
8
29.00
7.43
SE Mean
1.3
2.6
95% C.I. for mu control - mu drug: ( -16.7, -4.9)
T-Test mu control = mu drug (vs not =): T= -3.88 P=0.0013 DF=16
Both use Pooled StDev = 5.86
The output gives the 95% Confidence Interval for the observed difference between the means as -16.7 to -4.9. This range does not include zero so we can
conclude that the difference is real. The probability of observing this difference
between the means by chance if H0 is true is only 0.0013; so the difference is
highly significant.
3.4
Paired t-Test
In many experimental situations control and experimental measurements are carried out on the same subject, e.g. when studying the action of a drug on blood
pressure, we may first measure the patient’s normal blood pressure, administer the
drug and then measure the pressure again to see whether the drug has had an effect.
In such situations, where both control and test measurements are carried out on the
same or similar subjects, the data are said to be paired. In such cases it is the difference between the measurements that we are interested in. For each subject we
have two values. x1i and x2i and the difference di = x1i − x2i
33
Before Treatment
After Treatment
Difference
x11
x12
.
.
.
x1n
x21
x22
.
.
.
x2n
d1
d2
.
.
.
dn
The null hypothesis tested is that the mean difference is zero. The mean of the
set (di ) is found along with the standard deviation of (di ).
thus
P
( di )
Mean d̄ =
s n
P
(di − d̄)2
Standard Deviation sd =
n−1
The t value (for n − 1 degrees of freedom) is given by
t=
d̄
Sd
√
n
A paired t-test carried out with M INITAB
A paired t-test is carried out with M INITAB by first forming a column of differences
between the two sets of data. Then the null hypothesis that the mean of this column
of differences is zero can be tested as follows.
LET C3 = C1 - C2
TTEST 0 C3
34
Paired t-Test
The table shown gives the heart rates of 8 subjects (beats/min) before and after treatment with a test drug.
Subject
Before
After
Difference
1
2
3
4
5
6
7
8
75
81
68
70
85
76
70
73
73
78
69
64
75
71
63
72
2
3
-1
6
10
5
7
1
Calculate
d̄ =
sd =
Then
t =
Is this significant ?
If so, on what level?
3.5
Non-parametric tests
The variables we have dealt with so far have been measured on a so called interval
scale, that is the magnitude of the difference between any two measurements is
important. Sometimes we can only meaningfully place a series of items in a rank
order but the size of the difference between successive items has no meaning. This
is an ordinal scale.
Where precise numerical measurements can be made on the observations, and
where the samples are large enough to ensure that the central limit theorem is
applicable (and hence that the distribution of the difference between the means is
normal), t-tests are used to test the null hypothesis (H0 ). Where measurements
are not very precise, but can at least be arranged meaningfully in a rank order, or
where there is a danger that the central limit theorem is violated, perhaps because
the sample sizes are very low, non-parametric tests are used. Most non-parametric
35
tests only make use of the rank order of the data. It follows that whereas parametric
tests are based on arithmetic means, non-parametric tests concentrate on median
values. Since non-parametric tests make no assumptions about the form of the
underlying distributions they are often referred to as distribution free methods.
Since non-parametric methods use less of the information in the data it is only
to be expected that they are less powerful than the corresponding parametric technique. That is, they are less likely to detect a real effect should it exist.
3.6 The Mann-Whitney U test
This is the non-parametric counterpart of the unpaired t-test and can be used to test
whether two independent groups have been drawn from the same population. The
statistic calculated is U .
There is an equivalent test called the Two-Sample Wilcoxon Test for Independent Data which uses the sum of ranks (T ) as the test criterion. However the
Mann-Whitney U test is probably more widely used and is described here.
Although M INITAB will carry out these test for you, non-parametric tests are
often quite simple to carry out by hand and doing so will help to make the procedure
clear. The ‘long hand’ procedure is given below.
Procedure
1. Rank data taking both groups together, giving rank 1 to the lowest score etc.
Algebraic size is considered, lowest ranks assigned to the largest negative
number (if any present). Tied ranks are given the average of the tied ranks.
2. Find the sum of the ranks (T ) for both samples.
3. Calculate U for each sample. For sample 1 with number of scores N1 and
sum of ranks T1 .
N1 × (N1 + 1)
U = (N1 × N2 ) +
(3.1)
−T1
2
4. Find the critical value of U with N1 and N2 from Table 7.2 and compare this
with the smaller of the two calculated U values. If the observed U is less
than, or equal to, the tabulated value the result is significant.
Where N1 or N2 is more than 20 see section 3.6.3.
36
3.6.1
Example 1
Five rats were trained to imitate leader rats in a maze. They were trained to follow
the leader when hungry. Then the 5 rats were transferred to a shock-avoidance situation, where imitation of leader rats would enable them to avoid electric shock. The
experiment was designed to see whether rats could generalise learned behaviour
when placed under a new drive and in a new situation.
The number of trials it took for each rat to reach a criterion of 10 correct responses in 10 trials was recorded on the 5 trained rats and on 4 untrained controls.
The hypothesis was that the 5 rats who had already been trained to imitate a leader
(E rats) would transfer this training to the new situation and reach the criterion in
fewer trials than the control (C rats). The data are tabulated below.
Trials
E rats
75
63
70
45
C rats
110
85
64
77
81
The data (number of trials to reach criterion) are probably only on an ordinal
scale of measurement and the sample sizes are small so that it is possible that the
difference in means would not follow a normal distribution. A non-parametric test
was used.
Null Hypothesis H0 the number of trials to the criterion in the shock avoidance
situation is the same for rats previously trained to follow a leader through a maze
for food.
Alternative hypothesis H1 previously trained rats will reach the criterion in
fewer trials. Hence a one-tailed test was used.
We arrange these scores in order of size, retaining the identity of each.
rank
1
2
3
4
5
6
7
8
9
score
45
63
64
70
75
77
81
85
110
group
E
E
C
E
E
C
E
C
C
Find the sum of ranks (T ) for each group. Tc = 26 and Te = 19. Find U from
equation 3.1. Uc = 4 and Ue = 16. Take the smaller of these two (4) and look in
Table 2c. The tabulated value of U is 2 for N1 = 4 and N2 = 5. Our observed
value of 4 is not less than or equal to this, so we accept the null hypothesis. There
is no evidence from the data that rats can generalise learned behaviour when placed
under a new drive and in a new situation.
3.6.2
Procedure for small samples
For small samples the following “short-cut” method can be used:
1. Rank the data as above
37
2. Consider the Control Group and count the numbers of E scores that precede
(ie. fall to the left of) each score in the control group.
Thus,
U = 2 + 4 + 5 + 5 = 16
3. Repeat this for the number of C scores that precede each E score (U = 4).
Look up the smaller of the two U values.
3.6.3
Large samples (N1 or N2 > 20)
As N1 and N2 increase in size, the sampling distribution of U rapidly approaches
the normal distribution, with
mean = µU
=
N1 N2
2
and standard deviation = σU
=
N1 N2 (N1 + N2 + 1)
12
Thus when N2 > 20, the significance of an observed value of U is determined
by:
z =
U − µU
σU
=
U−
N1 .N2
2
N1 .N2 (N1 +N2 +1)
12
which is practically normally distributed with zero mean and unit variance, i.e.,
the probability associated with the occurrence under H0 values as extreme as an
observed z may be determined by reference to tables of z. If a two-tailed test is
being used, then the observed z is significant at p = 0.05 if it is > 1.96. (One-tailed
test, p = 0.05, z >1.64)
Mann-Whitney U test with M INITAB
Type the data into two columns, say C1 and C2. Then give the command
MANNU C1 C2
The output from this command includes the Mann-Whitney U statistic and the exact probability of the observed ranking under a one-tailed null hypothesis. Double
the probability for a two-tailed test.
38
3.7
Wilcoxon signed-rank test
This is the non-parametric counterpart of the paired t-test for equality of means.
It is used to compare medians of two samples when each observation in the first
sample is paired with an observation in the second sample.
Procedure
1. Obtain the difference between each pair of readings taking sign into account.
Eliminate cases with no difference and reduce N accordingly.
2. Rank these differences, ignoring the sign, giving rank 1 to the smallest absolute difference.
3. Calculate T , the sum of ranks for the less frequent sign.
4. Consult Table 1. If the observed T is equal to, or less than the tabulated
value, then there is a significant difference between the two conditions and
H0 should be rejected.
3.7.1
Example 2
Eight pairs of twins were exposed to complex reaction time tests: one of each pair
of twins was tested after 3 double whiskies, the other while completely sober.
H0 : drink does not affect reaction time.
Should we use a one-tailed test or a two-tailed test? In this example a case could
be made for either. We might expect that drink would adversely affect reaction time
and that our alternative hypothesis is therefore that the median of the Sober group
is less than the median of the Whiskey group. In other words we could carry out a
one-tailed test.
If you are uncertain whether a one-tailed or two-tailed test is appropriate ask
yourself the following question. If the reaction times of the Sober group turn out
to be greater than that of the Whiskey group will you reject this as chance or will
you consider this a potentially interesting result and carry out a test? If the latter,
and this is probably the most common situation, you should carry out a two-tailed
test.
In this example we will adopt H1 median Sober differs from medium Whiskey
and carry out a two-tailed test.
39
Reaction time
Sober Group
Whiskey Group
d
Rank of d
Rank with less
frequent sign
310
340
290
270
370
330
320
320
300
320
360
320
540
360
680
1120
-10
-20
70
50
170
30
360
800
1
2
5
4
6
3
7
8
1
2
T =3
Table 1 shows that for N = 8 the critical value of T for a two-tailed test is 3.
Hence H0 can be rejected.
Conclude H0 is incorrect : drink does affect reaction time (increases).
3.7.2
Large Samples
For samples > 20, you cannot use the Table 1. However, it can be shown that the
sum of the ranks, T , is normally distributed with:
mean, µT
and standard deviation, σT
z =
T − µT
σT
N (N + 1)
4
r
N (N + 1)(2N + 1)
=
24
=
=
T −
q
N(N+1)
4
N(N+1)(2N+1)
24
is approximately normally distributed with zero mean and unit variance. Thus,
tables of z give the probabilities associated with the occurrence under H0 of various
values as extreme as an observed z computed from the above formula.
Wilcoxon signed-rank test with M INITAB
Type the data into two columns, say C1 and C2. Form a column of differences.
LET C3 = C2 - C1
WTEST 0 C3
40
The output from this command includes T and the exact probability of the observed
ranking under a one tailed null hypothesis. Double the probability for a two-tailed
test.
41
42
Chapter 4
Analysis of Variance
4.1
Introduction
So far, we have looked at tests of statistical significance designed to compare two
sample means.
Analysis of variance (or A NOVAR) is used when we want to compare more
than two sample means, or to investigate the way in which several treatments interact with each other. For example, if we extend the example used to illustrate the
Student’s t-test to five drugs and hence five groups of subjects, instead of just two,
we would use One-way A NOVAR to analyse the five means.
Why can’t we just carry out t-tests between each pair of means to find out
which differ from which? Consider the number of tests you would make. There
are ten pairwise comparisons that you could make amongst five means. Let us
suppose that in fact there are no significant differences between the means. Each
of the t-tests would have a chance of 0.05 of showing a significant result by chance
(a so called type 1 error). But we are going to carry out ten such tests in this single
experiment; so the chance of at least one of them being significant by chance is
considerably more than 0.05. The correct procedure is to carry out a single analysis
of variance on all the data and only after that start to look at individual comparisons.
4.2
Within Sample Variance and Between Sample Variance
Let us consider a simple example, that of three samples taken from normally distributed populations with the same variance. Our null hypothesis is that the samples
come from the same population with mean µ.
43
For sample 1, n = 4 and the sample mean = x̄1
For sample 2, n = 5 and the sample mean = x̄2
For sample 3, n = 5 and the sample mean = x̄3
The total variance of the 14 observations making up the three samples is made up
of two components;
1. The variance due to the difference between three sample means, (x̄1 , x̄2 , x̄3 )
and the population mean, µ, i.e. the between sample variance
2. The variation due to differences within the samples, i.e. the deviation of the
four values of x in sample 1 from x̄1 , and the deviation of the values of x in
sample 2 from x̄2 and the deviation of the values of x in sample 3 from x̄3 ,
the within sample variance.
If all three samples are drawn from the same population we would expect the
within and between sample variances to be approximately equal. Our test statistic is thus the between sample variance divided by the within sample variance.
This variance-ratio is known as the F -ratio after R. A. Fisher who developed
A NOVAR. The expected value of the F -ratio under the null hypothesis is unity.
If the three sample means differ then the between sample variance will exceed the
within sample (error) variance and the F -ratio will be more than one.
In the previous chapter we used an F -test to compare the within sample variance of one sample with the within sample variance of another sample in order
to check that they were the same, before carrying out a t-test. Here, although we
are also comparing variances, a high F -ratio indicates that there is a difference
between the sample means. This is because we are comparing the pooled within
sample variance with the between sample variance, rather than two within sample
variances.
If the difference between the two variances is such that the F -ratio exceeds a
critical value we will conclude that the between sample variance is greater than we
would expect from the observed within sample variance. We might therefore be
lead to reject the null hypothesis i.e. to conclude that the three samples were not in
fact drawn from the same population and differed in their means.
4.2.1
Example
Fourteen rats from three strains were given the same dose of a hypnotic drug, and
the sleeping time of each animal was measured. Do the strains differ in sleeping
time?
44
Sleeping Time (min)
strain 1
strain 2
strain 3
(x1 )
(x2 )
(x3 )
13
16
19
16
P
x1 = 64
x̄1 = 16.0
20
17
23
26
19
P
x2 = 105
x̄2 = 21.0
Overall mean, x̄ =
14
21
16
19
13
P
x3 = 83
x̄3 = 16.6
64 + 105 + 83
14
= 18.0
Total Sum of Squares
We can compute the total sum of squares about x̄:
x1
13
16
19
16
x̄ − x1
5
2
−1
2
(x̄ − x1 )2
25
4
1
4
P
(x̄ − x1 )2 = 34
x2
20
17
23
26
19
x̄ − x2 (x̄ − x2 )2
−2
4
1
1
−5
25
−8
64
−1
1
P
2
(x̄ − x2 ) = 95
x3
14
21
16
19
13
x̄ − x3 (x̄ − x3 )2
4
16
−3
9
2
4
−1
1
5
25
P
2
(x̄ − x3 ) = 55
Total sum of squares about x̄ = 34 + 95 + 55 = 184
Degrees of freedom = No. of samples−1 = 14 − 1 = 13
Within Sample Sum of Squares (also called the error sum of squares)
1. Find the squared deviation of each observation from its sample mean.
2. Sum the squared deviations from each sample.
x1
13
16
19
16
x̄1 − x1
3.0
0.0
−3.0
0.0
P
(x̄1 − x1 )2
9.0
0.0
9.0
0.0
(x̄1 − x1 )2 = 18.0
x2
20
17
23
26
19
x̄2 − x2 (x̄2 − x2 )2
1.0
1.0
4.0
16.0
−2.0
4.0
−5.0
25.0
2.0
4.0
P
(x̄2 − x2 )2 = 50.0
45
x3
14
21
16
19
13
x̄3 − x3 (x̄3 − x3 )2
2.6
6.76
−4.4
19.36
0.6
0.36
−2.4
5.76
3.6
12.96
P
(x̄3 − x3 )2 = 45.2
Within sample SS = 18.0 + 50.0 + 45.2 = 113.2
Degrees of freedom
= No. of observations − No. of groups
= 14 − 3
= 11
Between Sample Sum of Squares
1. In effect each individual is replaced by its sample mean (thus removing
within sample variation).
2. The between sample sum of squares is the difference of all these values from
the overall mean (x̄), squared and summed.
x̄ − x̄1
(x̄ − x̄1 )2
N (x̄ − x̄1 )2
strain 1 (x1 ) 18 − 16.0 = 2.0
4.00
4 × 4.00 = 16.0
x̄ − x̄2
(x̄ − x̄2 )2
strain 2 (x2 ) 18 − 21.0 =-3.0
9.0
N (x̄ − x̄2 )2
5 × 9.0 = 45.0
x̄ − x̄3
(x̄ − x̄3 )2
strain 3 (x3 ) 18 − 16.6 = 1.4
1.96
N (x̄ − x̄3 )2
5 × 1.96 = 9.8
Thus, Between sample SS = 16.0 + 45.0 + 9.8 = 70.8
Degrees of Freedom = No. of groups−1 = 3 − 1 = 2
We can then summarise the analysis of variance in the following table, in which the
within and between sample variances are calculated by simply dividing the sum of
squares by the degrees of freedom.
SS
df
variance
F
p≥F
Between samples
Within samples (error)
70.8
113.2
2
11
35.4
10.291
3.4
0.069
Total
184.0
13
Source
Then
F =
35.4
= 3.4 (df = 2, 11)
10.291
46
The probability of F2,11 ≥ 3.4 is 0.069. We can conclude that the three strains
are drawn from the same population, and therefore that the difference between the
three means is due to chance.
Note: Since the within sample and between sample sum of squares add up to
the total sum of squares it is only necessary to calculate the total sum of squares
and one of the others. If you were doing these calculations by hand it would be
easiest to obtain the between sample sum of squares, obtaining the error sum of
squares by difference.
Graphical Illustration of One-Way Anovar
• Choose
ANOVA and regression
Statistics menu.
from
the
• Choose ONEWAY.
• Answer Rats to filename? This loads the data of the example you have just worked through. Answer 1 to Number
of response variable?. Answer 2 to Number of
factor variable?. Answer strain 1 for level 1,
strain 2 for level two and strain 3 for level 3.
• The data are displayed. You will see the data for the three
strains displayed in different boxes, with their respective
means shown as a horizontal yellow line.
• Now go through the previous section again in conjunction
with the graphical demonstration. Just press Enter to start.
Note that in this demonstration ms stands for mean square,
an alternative term for variance.
• Press Escape followed by any key to exit.
47
4.2.2
Exercise
The irritant activity of 5 drugs, A, B, C, D and E was tested by applying each drug
in solution to the eye of a rabbit. The number of blinks occuring over the following
minute was recorded. The table shows the number of blinks obtained in 6 different
animals for each drug.
A
Drug
B C D
E
3
5
5
2
4
5
8
6
7
8
9
7
7
2
4
5
6
6
2
3
4
3
3
5
6
3
5
5
2
4
Carry out an analysis of variance on these results. Use a calculator to help, but
carry out each step separately, and record your answer at each stage.
Total sum of squares
P
1.
N
x
P
(x̄ − x)2
– find the grand mean for all the observations
2. (x̄ − x) – find the difference between each observation and the grand mean
3. (x̄ − x)2 – square each of these differences
P
4.
(x̄ − x)2 – add all the squared values
48
Group
x
A
3
5
5
2
4
5
B
4
6
7
6
9
7
C
2
3
4
3
3
5
D
6
3
5
5
2
4
E
8
2
4
5
6
6
(x̄ − x)
Total sum of squares
(x̄ − x)2
P
(x̄ − x)2 =
49
Between sample sum of squares
1. x̄m – find the mean for each sample
2. (x̄ − x̄m ) – find the difference between each sample mean and the grand mean
3. (x̄ − x̄m )2 – square each of these differences
4. N (x̄ − x̄m )2 – multiply the squared difference by the number of observations
in each sample
P
5.
N (x̄ − x̄m )2 – add the values from each of the groups
A
B
C
D
E
3
5
5
2
4
5
8
6
7
8
9
7
2
3
4
3
3
5
6
3
5
5
2
4
7
2
4
5
6
6
x̄m =
(x̄ − x̄m ) =
(x̄ − x̄m)2 =
X
N (x̄ − x̄m)2 =
N (x̄ − x̄m)2 =
Summary of Analysis of Variance:
Source
SS
df
Between samples
Within sample (error)
Total
between samples variance
F = within sample variance
df =
=
Is the Null hypothesis accepted or rejected?
50
Variance estimate
F
4.3
Comparison of Means
You will see that analysis of variance tells us whether or not a difference exists,
but not where the difference is. To locate the significant differences, – if it is not
obvious by eye – a further test must be applied e.g.
• Fisher’s Least Significant Difference (LSD)
• Tukey’s Honestly Significant Difference (HSD)
• Dunnett’s Multiple Range Test
4.3.1
Fisher’s Least Significant Difference (LSD)
Fisher’s LSD is equivalent to carrying out a t-test between each mean and every
other mean, except that the error variance used in the calculations is taken from the
ANOVAR and is therefore more accurate, since it is based on all the data, not just
the data from two treatments.
At the start of this chapter it was pointed out that there is a danger of identifying false significant differences if lots of t-tests are carried out. However if the
ANOVAR shows that there are at least some significant differences (ie the F-ratio
is significant) Fisher’s LSD can be used to identify those differences, provided that
unclear conclusions are treated with caution.
Fisher’s LSD is the standard error of the difference between two means multiplied by t. Any two means which differ by more than this can be considered
significantly different.
The variance of the difference between two means is twice the variance of the
means (see Chapter 3). And the variance of any mean is s 2 /n, where s 2 is the
residual variance and n is the sample size. So Fisher’s LSD is calculated as follows
s
LSD = t 2
s2
n
An alternative form which allows for means based on different sized samples
is
s
LSD = ts
4.3.2
1
1
+
ni
nj
Tukey’s Honestly Significant Difference
Fisher’s LSD is easy to understand and easy to apply but some people feel that is
not conservative enough. A very conservative alternative is Tukey’s Honestly Significant Difference. This test takes into account the number of possible comparisons you might make. The more groups you have the wider will be the difference
51
between any two means which will be considered significant. If you have many
means and you are “trawling” for all and any significant difference, this test will
guard against false positive conclusions.
√
Tukey’s HSD replaces t with Q/ 2. This gives the same result if there are
only two means but gives a larger value as the number of groups increases.
s
Q
1
1
R=√ s
+
nj
2 ni
Where
• R = a minimum significant range
• Q = a preliminary factor obtained from tables, and related to the number of
samples (a) and number of degrees of freedom (f ) for error variance.
• s = error standard deviation (Within sample standard deviation)
• n = number of observations per sample
4.3.3
Dunnett’s Multiple Range Test Using Minitab
The third Multiple Range test, Dunnett’s Test, can be applied when the only comparisons to be made are between a number of treatments and a control.
The change in blood pressure of volunteers was measured following the administration of new experimental drugs. Are any likely to be dangerous if given to
patients with high blood pressure? Which might be beneficial?
drug 1
(placebo)
drug 2
drug 3
drug 4
3
−2
−1
5
0
−5
−20
−17
−12
−21
20
22
18
25
19
3
1
2
0
5
Table 4.1: Change in blood pressure (mm Hg)
Enter Minitab and type the data into columns C1–C4
read c1-c4
Now stack C1–C4 into column C5 and create a column C6 which indicates which
rows of C5 belong to which treatment.
52
stack c1-c4 c5;
subscripts c6.
First have a look at the data using the boxplot command.
boxp c5;
by c6.
The following plot is produced.
C6
-----I+ I------
1
2
---------I +
I------------------I+ I------
3
4
---I+I---------+---------+---------+---------+---------+---------+C5
-20
-10
0
10
20
30
It is clear from this that drug 2 has the effect of reducing blood pressure compared to the placebo. Drug 3 increases it and drug 4 has no effect. There is no need
to carry out a full Analysis of variance where the conclusions are as clear as this,
but we will do it anyway.
Minitab has several Multiple Comparison Tests, including fisher, tukey
and dunnett as subcommands of the oneway command. We are comparing
the other drugs to drug 1, which is a placebo, hence Dunnett’s Test is apropriate.
The Minitab subcommand dunnett 1 indicates that level 1 of C6 (ie drug 1) is
the control.
oneway c5 c6;
dunnett 1.
53
The dunnett subcommand gives the following output.
Dunnett’s intervals for treatment mean minus control mean
Family error rate = 0.0500
Individual error rate = 0.0196
Critical value = 2.59
Control = level 1 of C6
Level
2
3
4
Lower
-22.521
13.279
-5.321
Center
-16.000
19.800
1.200
Upper ---------+---------+---------+---------+--9.479 (-----*----)
26.321
(-----*----)
7.721
(----*----)
---------+---------+---------+---------+--12
0
12
24
Drug 2 differs from the placebo by −16 mmHg. The 95% confidence interval
for this difference is −22.521 to −9.479, clearly a significant reduction in blood
pressure. Similarly drug 3 increases blood pressure. However the range for drug 3
includes zero and is therefore not significantly different from the placebo.
54
Chapter 5
Correlation and Regression
5.1
Introduction
Correlation and Regression analysis are both concerned with the relationship between variables but they are used in different circumstances. Correlation is less
useful but is simpler and we will discuss it first.
Consider a group of students for each of which we have a school exam mark
and a university exam mark, for a particular subject. Suppose we are interested in
knowing whether there is any relationship between the two sets of grades. We can
calculate a quantity called the correlation coefficient which measures the strength
of the relationship between the two sets of results.
The value of the correlation coefficient can range from 0, if there is no relationship between the two sets, to 1, if there is a perfect positive correlation between the
two, or −1, if there is a perfect negative correlation; that is students with the best
school grades have the worst university results.
More often the calculated correlation coefficient lies between −1 and +1 and
we have to decide whether the observed value is just a matter of chance, or, in
statistical jargon, whether the correlation coefficient is statistically significant.
Here is a second example of a situation where we are interested in the relationship between two variables.
We wish to calibrate an ultra-violet spectrometer. We have made up a set of
solutions of known concentration and have measured the absorbence at each concentration. We would like to use this information to estimate the concentration of
an unknown solution from its absorbence.
There are important differences between this situation and the previous example. To start with the information we want from the data is different. We are not
just interested in whether there is a significant relationship between absorbence
and concentration. (we would hope there was). We want to know the form of this
relationship and we would like to describe it mathematically. This would allow us
to substitute the measured absorbence of an unknown sample into the formula to
obtain a predicted concentration.
55
The other difference is the control we have over the values of one of the variables. We have made up the standard solutions to known concentrations. We would
probably choose concentrations at regular intervals over the range we are interested
in so that we could estimate the concentration of any unknown falling in that range
with equal precision. In practice such a calibration might be more complicated
than this but the important point is that we have control over one variable and we
can measure it accurately. This variable is known as the independent variable. The
other variable (here absorbence) is dependent on concentration and is called the
dependent variable. We do not know the value of our dependant variable until we
do the experiment.
The mathematical technique used to quantify the relationship between a dependant variable and an independant variable is known as regression analysis. It
results in a mathematical model of the relationship which, in linear regression, is
the equation of a straight line. The analysis furnishes estimates of one or more
parameters which characterise the equation.
Other examples of regression are i) dose of drug and response and ii) the extent
of a chemical reaction with time.
Compare these examples with the exam result example. There we had a sample
of people on which we had taken two measurements. We had no control over the
values of either variable. Moreover there would be considerable error associated
with each exam result. If it were possible to give the same person the same exam
many times we would get a different result each time. This contrasts with regression where one variable is usually under our control and is measured accurately.
It is important to be clear about the difference between correlation and regression because it is usually invalid to calculate a correlation coefficient and quote its
significance when regression is properly involved. This is because, as we will see,
the calculation of a correlation coefficient involves the assumption that both variables are normally distributed. Where one variable is under our control this will
not usually be the case.
It is also dangerous to use a regression method where the independant variable
is subject to considerable error of measurement, although regression can often be
used where the independant variable is not strictly under our control if we happy
that it is measured accurately.
Summary
Correlation The value of neither variable is under the control of the experimenter. Both variables usually associated with error of measurement. Correlation
analysis is used to assess whether the two variables are associated.
Regression One variable, the independent variable, is measured accurately and
its value is often chosen by the experimenter. The other variable, called the dependent variable, may be subject to experimental error, and its value depends on the
value chosen for the independent variable. The line fitted to the points is called a
56
regression line, and is described by a regression equation which contains one or
more parameters. Regression analysis is used to predict one variable from another.
5.2
Some useful formulae
In this chapter the following abbreviations will be used.
ssx Total sum of squares for x,
P
(x − x̄)2
P
ssy Total sum of squares for y, (y − ȳ)2
P
sxy Sum of cross products, (x − x̄)(y − ȳ)
Each of these has an associated computational formula which provides a quicker
and less error prone method of calculating these quantities if you are using a calculator. These are as follows:
X
P
x)2
Pn 2
X
( y)
ssy =
y2 −
Pn P
X
( x. y)
sxy =
xy −
n
ssx =
5.3
x −
2
(
(5.1)
Product-Moment Correlation Coefficient.
Suppose we have a sample of n individuals each of which has been measured for
two variables x and y. If the variables have been measured on an interval scale
and both are normally distributed then the degree of association between the two
variables can be assessed by calculating the correlation coefficient (r).
The basic quantities we need to calculate to do this are the following:
n
P
x
P 2
x
P
y
P 2
y
P
xy
Substituting these into the equations 5.1 gives us the sums of squares for x and
y and the sum of cross products.
The correlation coefficient, r, is then calculated as :
sxy
r=√
ssx × ssy
57
Fisher showed that r is related to Student’s t as follows
s
(n − 2)
t =r
(1 − r 2 )
(5.2)
with n − 2 degrees of freedom. Here n is the number of data pairs.
So the exact probability of a value of r can be found by converting to t and
looking up the probability of t. Alternatively, tables of critical values of r for a
range of values of degrees of freedom are published.
5.3.1
Correlation coefficient: Example
The data in table 5.1 were collected in a medical study on the blood concentrations of different proteins in a group of males aged between 40 and 60. We are
interested in whether there is a link between the levels of these proteins. The existence of a link would help to identify particular biochemical reactions which may
be taking place in these patients. All measurements were made in mg ml−1 and
then converted to a loge scale.
58
Testosterone
5.85
5.91
6.20
6.39
6.63
6.63
6.32
6.30
6.20
6.41
6.40
5.89
6.43
6.48
5.83
6.12
6.23
6.36
6.20
6.49
5.96
SHBG
3.50
3.81
3.89
3.14
3.14
3.09
2.64
3.37
3.40
3.26
2.94
3.30
3.00
3.00
3.81
3.47
3.58
3.53
3.33
3.56
3.64
Table 5.1: Protein Data
59
• Enter M INITAB.
• Retrieve ’TESTOST’
• The data from table 5.1 have been entered in columns C1 - C2.
• Plot C1 against C2.
This scatter plot is very useful for assessing the presence of a relationship between
two variables visually. Any gross departure from normality, in either variable, will
be also be obvious from this plot. In this case the scatter plot suggests that there
might be some negative correlation between the two variables, but this is partially
obscured by random variation.
• Calculate the correlation coefficient . . .
CORR C1 C2
• You should get a value of −0.591.
• Using equation 5.2 this can be converted to t . . .
s
19
t = −0.591
= −3.19350
(1 − 0.5912 )
• In M INITAB this can be calculated as follows . . .
LET K1 = -0.591
LET K2 = K1 * SQRT(19/(1 - K1**2))
PRINT K2
Giving a t value of −3.19 with 19 degrees of freedom, the probability of
which can be found
CDF 3.19 K1;
T 19.
LET K1=1-K1
PRINT K1
Showing that the probability of a t > 3.19 = 0.0024. So the probability of t > 3.19
or t < −3.19 = 2 × 0.0024 = 0.0048. Strong evidence of a negative correlation
between Testosterone and SHBG.
60
5.4
Spearman’s Rank Correlation Coefficient - rs
This coefficient, like other methods based on ranks, does not depend on assumptions about normal distributions. As with Pearson’s product-moment correlation,
a value of +1 corresponds to perfect correlation between x and y, a value of 0
corresponds to no correlation and a value of −1 corresponds to a perfect negative
correlation (y ↓ as x ↑). However, what is meant by “perfect correlation” is not the
same for different coefficients. For ranking coefficients, it means that the ranking
of individuals is the same for both criteria, i.e, concordance of ranking.
Spearman’s Rank Correlation Coefficient may be used for ordinal or interval
scale data.
5.4.1
Procedure
To compute rs , make a list of the n subjects. Next to each subject’s entry, enter
his rank for the x variable and his rank for the y variable. Determine the various
values
P 2 of the difference between the two ranks (di ). Square each di , then calculate
(di ).
rs
P
6 × (di2 )
= 1−
n × (n2 − 1)
Table 3 gives critical values of rs for n up to 12. For n > 12 use the following
approximation to the t-distribution with n − 2 degrees of freedom. Do not forget
to double the probability for a two-tailed test.
s
t = rs
n−2
1 − rs2
Calculating Spearman’s rs with M INITAB
READ
RANK
RANK
CORR
C1
C1
C2
C3
C2
C3
C4
C4
Enter data in C1 and C2
Rank Correlation Example
Rarity of doctors in ith district (yi ) and the number of days lost due to illness (xi )
in that area.
61
H0 : These two factors are not associated
H1 : The two factors are associated. Two-tailed test
Area
1
2
3
4
5
6
Score
xi yi
160
165
169
175
180
186
59
54
64
66
85
78
Rank
xi yi
1
2
3
4
5
6
2
1
3
4
6
5
P
di
di2
−1
+1
0
0
−1
+1
1
1
0
0
1
1
(di2 )
=4
rs = 1 −
6×4
6(36 − 1)
= 0.886
From Table 3 the probability of rs = 0.886 (n = 6) is exactly 0.05.
Conclude: reject H0 at p = 0.05. Lack of doctors and incidence of illness are
connected.
5.5 Linear Regression Analysis
As discussed in the introduction, variables such as age, time, drug dose etc. are
independent variables. By convention these independent variables are plotted on
the horizontal, or x-axis, of a plot and are thus known as x variables. The dependent
variables, those which may be related to the independent variables, are plotted on
the vertical or y-axis, and are known as y variables.
In regression the x variable is not usually normally distributed. Regression
analysis does, however, assume a normal distribution for y for any given value of
x; moreover the variance of y should be independent of x. For example, the scatter
about the fitted line should not increase as x increases.
This assumption, that the variance of the distribution of y for any given x,
does not change over the range of x, is frequently invalid, especially when y is
a biochemical variable such as a hormone concentration. You should always plot
your data and look at it, before doing a regression. This allows you to assess visually whether there may indeed be an association between the variables, and if
so, whether it is in the form of a straight line or a curve. It also lets you check
that the scatter of y is not related to x. The computer will always give you an
62
answer, but it may be meaningless! If the relationship is non-linear or if the variance of x increases with y, a common situation, it is usually possible to transform
one or both variables before carrying out regression analysis. An example of data
transformation is given in Example 2.
After a model is fitted the observed points will not necessarily lie on the fitted
line. The scatter about the line will depend on how good the model is. If we draw
a vertical line from each observed point to the fitted line, square all these distances
and add them up we obtain the residual sum of squares. The expression “fitting
a line to the observed points” means the process of finding estimates of the slope
and intercept that result in a calculated line which fits the observations “best”. By
“best” we mean the estimates which minimise the residual sum of squares. Hence
this method is called the method of least squares.
• Select program SCATTER
regression menu
from
the
ANOVA and
• Press Enter to accept the default data set (Protein)
• A scatter plot of protein level against Gestation period is
shown.
• Press Enter to Display regression model. A
horizontal red line represents the currently fitted regression
model. The equation of this model is shown above the plot.
Initially this is simply y = 0.5, i.e. y = 0.5 for all values of
x.
• Press R. A vertical line is drawn from each data point to the
fitted line. The sum of all these distances squared (ss) is
shown next to the equation and this represents the the mismatch between the data and the model, that is the residual
sum of squares. The best possible model will by such that
these residuals are as low as possible. By pressing the four
arrow keys you can move and rotate the fitted line. At the
same time the equation for the current line and the residual
sum of squares will be shown. Try and find the optimum
fit, that is the fit with the lowest possible residual sum of
squares.
• When you think you have done this make a note of your
fitted equation, then check your answer by pressing F for
the best fit model calculated mathematically as below.
63
5.5.1
Regression by hand
To perform a regression analysis on paper first calculate the quantities x̄, ȳ, ssx,
ssy and sxy (equations 5.1)
Calculate the slope (b) and the intercept (a):
sxy
ssx
a = ȳ − bx̄
b =
If you need to test whether your model explains a significant proportion of the
total variation in y the best approach is to partition the sum of squares of y into
two parts, the part which is explained by the model and the part which is not ie.
the residual variation about the line. These sums of squares can be laid out in
an analysis of variance table. An F -test between the residual and the regression
variances will tell you whether the fitted model helps to account for a significant
proportion of the variation in y. See the example below. In some cases you know
beforehand that there is a good relationship between x and y and all you need to
know is the value of the slope and intercept. In this case an analysis of variance
would not be appropriate.
Total sum of squares for y (ssy) has n − 1 degrees of freedom.
The regression sum of squares
is given by . . .
rss =
sxy 2
ssx
with 1 degree of freedom.
The residual sum of squares is best obtained by subtracting the regression sum
of squares from the total. This has n − 2 degree of freedom.
Confidence limits for the parameters.
The standard error of b (seb) is:
s
seb =
s2
ssx
where s 2 is the residual variance.
95% confidence limits for b where n is large would be b ± 1.96 × seb.
Where n is less than, say, 30 use t with n − 2 degrees of freedom rather than
1.96.
64
The confidence limits for y at any value of x can be calculated as follows:
s
1
(X − x̄)2
2
standard error of Y (seY ) = s ×
+P
(x − x̄)2
n
where Y is the value of y at a particular X and s 2 is the residual variance.
The 95% confidence limits for Y are then Y ±1.96×seY . Again where n < 30
use t with n − 2 degrees of freedom.
Note that these limits will be narrowest at the mean value of x and will get
wider the further X is from x. The confidence limits for the intercept (a) are the
confidence limits for y when x = 0.
5.5.2
Assumptions involved in linear regression by least squares
There are five assumption involved in fitting a linear regression model by the
method of least squares. These are
1. The relationship between x and y is linear.
2. The variability of the errors is constant for all values of x.
3. The errors are independent.
4. The errors are normally distributed.
5. The values of the explanatory variable (x) are measured without error.
These assumptions should all be considered before carrying out a regression
and the best way to assess the first four is by looking at a scatter plot of the data.
Notice that three of the assumptions involve the errors (or residuals). A plot of the
residuals against x (or often the fitted values for y) is known as a residual plot.
This is a very useful way to spot patterns in the residuals which indicate violation
of the above assumptions. Let’s consider each of these assumptions in turn.
The relationship between x and y is linear This is the most obvious assumption, note, however, that we are only assuming that the relationship is linear over
the range of our data. We have no information on the relationship outside this range
and it is therefore unwise to extrapolate to values outside the range of our data.
Enter M INITAB and retrieve the worksheet nonlin. Plot C1 against C2.
retr ’nonlin’
plot c1*c2
You will be able to see a certain amount of curvature in the scatter plot. This is
made clearer if the residuals are plotted against C2. The regr command carries
out a linear regression and stores the residuals (vertical deviations from the fitted
line) in a column called resid. The fitted values are stored in ’fits’. resid
is then plotted against C2.
65
name c3 ’resid’ c4 ’fits’
regr c1 1 c2 ’resid’ ’fits’
plot ’resid’*c2
The residuals are predominantly negative for extreme values of x. The residual plot
highlights this pattern by removing the trend due to the regression line. Clearly a
linear model is not the best in this case. The solution might be to fit a more complex
non-linear model or to transform one or both of the variables by taking logs, square
roots etc. which often has the effect of linearising the relationship. An example of
transformation is given in example 2.
The variability of the errors is constant for all values of x. Open the M INITAB
worksheet ’noncon’ and plot C1 against C2, as above. Notice how the spread of the
data points increases as the value of x increases. This is a common phenomenon,
large measurements tend to have more error. A log transformation of y will usually
make the residuals more uniform with respect to x.
The errors are independent. This means that the value of a particular residual
has no effect on the value of the next residual, or on any other residual. This
assumption is frequently violated where a particular item is being monitored in
time. If the measurement at time t has a particularly large value then the value at
time t + 1 will also tend to be large because the the value at time t is the starting
point for the subsequent changes.
Plot C1 against C2 from the M INITAB worksheet ’nonind’. Notice how the
measurements seem to wander from one side of the fitted line to the other. The errors here are not independent. It has been shown that correlated errors, or autocorrelation as it is known, does not effect the estimates of the parameters of the linear
model but it does have the effect of leading to underestimates of the residual error.
This means that the calculation of confidence limits or statistical comparisons of
lines will not be valid. See example 5.5.4 for more discussion of autocorrelations.
The errors are normally distributed.
This assumption is not usually a problem.
The values of the explanatory variable (x) are measured without error.
assumption is not one that we can check by examining the data.
66
This
5.5.3
Regression Example 1
The table below summarises data on the level of a protein in expectant mothers
throughout pregnancy.
x2
y2
xy
0.38
0.51
0.58
0.84
0.78
0.65
0.83
0.84
0.92
0.92
121
169
289
361
484
729
841
961
1156
1296
0.1444
0.2601
0.3364
0.7056
0.6084
0.4225
0.6889
0.7056
0.8464
0.8464
4.18
6.63
9.86
15.96
17.16
17.55
24.07
26.04
31.28
33.12
7.25
6407
5.5647
185.85
n
Time into pregnancy
(weeks)
Protein level
(mg ml−1 )
1
2
3
4
5
6
7
8
9
10
P
11
13
17
19
22
27
29
31
34
36
239
0.80+
0.60+
0.40+
-
level
*
*
*
*
*
*
*
*
*
*
--+---------+---------+---------+---------+---------+----time
10.0
15.0
20.0
25.0
30.0
35.0
67
Hand calculations
ssx =
ssy =
sxy =
rss =
residss =
x̄
ȳ
=
=
239
10
7.25
10
= 23.9
= 0.725
x2 −
=
2392
10
7.252
5.5647 − 10
185.85 − 239×7.25
10
12.5752
694.9
= 694.9
P
( x)2
P 2 (Pny)2
y − P nP
P
xy − ( xn y)
sxy 2
ssx
P
ssy − rss
=
=
6407 −
= 0.30845
= 12.575
=
= 0.227559
= 0.30845 − 0.22756 = 0.080891
Here rss is the regression sum of squares and residss is the residual sum of
squares.
Analysis of variance
df
SS
variance
0.22756
0.01011
Regression
Residual
1
8
0.227559
0.080891
Total
9
0.30845
F
Prob> F
22.51
0.0015
If time did nothing to explain protein levels we would expect the regression variance and the residual variance to be estimates of the same quantity and thus F to
equal 1. A higher value may be a chance event or it might indicate that time does
help to predict protein levels.
If we look up the critical value for F at the 1% confidence level with 1 and
8 degrees of freedom in tables we get a value of 11.26, smaller than our value.
Using tables we can only say that the chance of getting an F -ratio as high as this is
less than 0.01. Using a Statistics package we are given the exact probability of the
observed F -ratio which is 0.0015.
We conclude that a significant proportion of the variation in protein level is
explained by time from pregnancy. In fact if we express the regression sum of
squares as a percentage of the total we get a value of 100 × 0.227559/0.30845 =
74%. In other words 74% of the sum of squares of protein level is accounted
for by fitting the model. This is the coefficient of determination. As it happens
the coefficient of determination can also be obtained by squaring the correlation
coefficient between x and y and for that reason it is usually abbreviated to r 2 .
The parameters of the model, the slope (b) and the intercept (a) are obtained as
follows.
b = sxy/ssx
= 12.575/694.9
= 0.0181
68
a = ȳ − bx̄
= 0.725 − 0.0181 × 23.9
= 0.2925
Thus the equation describing the relationship between time (t) and protein level
(P ) is:
P = 0.2925 + 0.0181 × t
This tells us that an increase of one week in pregnancy is associated with an
average increase in protein level of 0.0181 mg ml−1
Confidence limits for b (the slope) are found as follows: Here s 2 is the residual
variance from the Analysis of Variance table.
s
seb =
r
s2
ssx
0.01011
694.9
= 0.0038
=
n is less than 30, so we need to multiply seb by t with n −2 degrees of freedom
to get 95% confidence limits. From a table of “Student’s t-distribution” you will
see that the value of t in the 0.05 column corresponding to 8 degrees of freedom
is 2.306. If we wanted, let’s say, 99% confidence limits we would look in the 0.01
column and multiply by t = 3.355.
95% confidence limits = b ± (0.0038 × 2.306)
b = 0.0181 ± 0.00876
Confidence limits for a (the intercept):
s
sea =
s2 ×
s
=
1
x̄ 2
+
n ssx
0.01011 ×
1
23.92
+
10 694.9
= 0.0966
95% confidence limits = a ± 2.306 × 0.0966
a = 0.2925 ± 0.223
69
Regression with M INITAB
Enter M INITAB and retrieve the worksheet protein.
retr ’protein’
This worksheet contains the time into pregnancy and protein data
discussed above in columns C1 and C2. Plot C2 against C1 and
consider the five regression assumptions. No violations are obvious from the plot.
plot c2*c1
Model C2 by fitting the one variate C1.
name c3 ’resid’ c4 ’fits’
regr c2 1 c1 ’resid’ ’fits’
The ANOVAR table is shown, including Probability > F . Parameter estimates with their standard errors are also shown.
To produce a high resolution plot showing the data and the fitted
line, use the %regplot macro to produce a plot.
%regplot c2 c1 ’fits’
Type help regr and help plot for a full explanation of
these commands.
5.5.4
Regression Example 2
Various compounds were run through a High performance liquid chromatography
(HPLC) system and the time taken to drain through the column (‘capacity factor’ –
K 0) was recorded. It is thought that K 0 might be related to the ‘partition coefficient’
(P ) of the compound. The data are given below.
70
Compound
p-aminophenol
Anisole
Benzamide
Benzonitrile
o-cresol
p-cresol
2, 4-dimethylphenol
Methyl-4-hydoxy benzoate
Methylsalicylate
Phenol
p-phenylphenol
Capacity factor
K 0 (mins)
1.59
10.00
2.40
7.08
6.17
6.17
11.20
8.13
18.20
3.47
38.02
Partition Coefficient
P
1.1
128.8
4.4
36.3
91.2
87.1
199.5
91.2
288.4
28.8
1585.0
• Enter M INITAB, open the worksheet ’hplc’
RETR ’hplc’
Column C1 contains the ‘capacity factor’, (K 0 ). C2 contains
the ‘partition coefficient’ (P ).
• Plot K 0 against P . This plot looks rather odd, with one observation (p-phenylphenol) near the top right and all the others crammed into the bottom left. Not only are the values
for p-phenylphenol much larger than the rest but it does not
seem to be consistent with the relationship between K 0 and
P for the other compounds.
71
• Fit the line and plot it with
name c3 ’res’ c4 ’fits’
regr c1 1 c2 ’res’ ’fits’
%regplot c1 c2 ’fits’
You can see that the fitted line is very strongly influenced by
the position of the ‘outlier’.
• When data are skewed in this fashion the log transformation
often provides a more useful scale of measurement. Log
transform both variables and replot the data.
let c1=log(c1)
let C2=log(c2)
plot c1*c2
The resulting plot is a well behaved linear relationship between log10 K 0 and log10 P . On this scale of measurement
p-phenylphenol is quite consistent with the other observations.
• Calculate the parameters of the transformed regression
model and write down the model you have fitted in terms
of K 0 and P .
72
5.5.5
Regression Example 3
• Enter M INITAB and open the ’nonind’ worksheet.
• Fit a line and look at the plot.
name c3 ’res’ c4 ’fits’
regr c2 1 c1 ’res’ ’fits’
%regplot c2 c1 ’fits’
You have looked at this example before. It should be clear that the errors are not
independent. M INITAB has a useful command for checking non-independence of
residuals (autocorrelation). If the residuals are stored and the nth residual is plotted
against the n + 1th residual you would expect a random scatter of points unless a
particular error tends to be similar to the previous one, in which case most of the
points will be in the upper right and lower left quadrant.
• The residuals are stored in column ’res’. Create two columns in which the
residuals are out of phase by one. Plot these against each other.
lag ’res’ c5
plot ’res’*c5
Notice how most of the residuals are in the top left or bottom right quadrant. This is
because a high residual tends to be followed by another high residual and a low one
tends to be followed by a low one. The autocorrelation function (ACF) measures
the correlation between lagged residuals. In this case we have looked at the ACF
for lag one. The M INITAB command ACF calculates the ACF for lags going from
1 upwards and plots these against lag number.
• Make an autocorrelation plot (correlogram).
acf c3
In this case there are autocorrelations for lag 1 and negative autocorrelations at lag
8.
As noted previously, the presence of autocorrelation does not mean that your
parameter estimates will be biased but it does mean that your estimates of the
residual variance will be wrong. Hence any calculations, such as confidence limits,
73
which depend on the residual variance can not be carried out. The analysis of data
sets with strong autocorrelation is outside the scope of this course but you should
at least be aware if your data shows this pattern.
As noted previously, the presence of autocorrelation does not mean that your
parameter estimates will be biased but it does mean that your estimates of the
residual variance will be wrong. Hence any calculations, such as confidence limits,
which depend on the residual variance can not be carried out. The analysis of data
sets with strong autocorrelation is outside the scope of this course but you should
at least be aware if your data shows this pattern.
5.5.6
Exercise
The following data were obtained for the ability of liver slices from guinea pigs of
different ages to conjugate phenolphthalein with glucuronic acid.
Check the regression assumptions and calculate the equation of the regression
line quoting 95% confidence limits for each parameter.
age(days)
moles conjugated
0
12
29
43
63
76
85
93
8.98
8.14
6.67
6.08
5.83
4.68
4.2
3.72
74
Chapter 6
Analysis of Counts in
Contingency Tables
6.1
Introduction
We are often interested to know whether there is any association between attributes
which are categorical rather than continuous. For example we might need to know
whether there is an association between eye colour and hair colour in a sample of
people. If we have four eye colours and four hair colours we can form a 4 × 4 table
(16 cells) and count the number of people which fall into each cell. Such a table is
known as a contingency table.
If there were an association between hair colour and eye colour subjects would
tend to occur in particular cells of the table. For example, more subjects than
expected might occur in the fair hair–blue eye category. If there were no association
we would expect the proportion of subjects of various eye colours to be the same
for each hair colour, apart from chance differences of course.
A statistic called Chi-Squared (χ 2 ), which measures the degree of association
between the two categorical variables, can be derived from the contingency table.
6.2
Example
In a study, 290 people were selected at random and asked about their preferences
amongst various types of tablet. Table 6.1 summarises part of the information and
relates the age of the interviewees to their preferred colour of tablets from the range
pink, orange and white. On the basis of these data, can it be concluded that there is
a link between age and colour preference in the population as a whole?
The first step is to form the column and row totals and to use these to calculate
the counts in the body of the table which we would expect if there were no association between age and colour preference. These expected values correspond to the
null hypothesis.
75
Table 6.1: Contingency table. Relationship between age and tablet colour preference. Expected values in brackets.
Colour
P
Age group pink orange white
18–35
36–60
>60
P
26
(16.6)
14
(20.3)
9
(12.2)
40
(42.2)
57
(51.7)
28
(31.0)
32
(39.2)
49
(48.0)
35
(28.8)
49
125
116
98
120
72
290
Consider the 49 subjects who preferred pink tablets (first column of table 6.1).
How do we expect these to be distributed between the age groups under the null
hypothesis? The totals for the three age groups are 98, 120, 72 so the proportion
in row one is 98/290. The total for column one is 49 so we would expect 49 ×
98/290 = 16.6 18–35 year olds to prefer pink tablets.
The rule for finding the expected value for any cell is to multiply its row total
by its column total and divide by the grand total.
We can calculate expected values for all cells and add them to the contingency
table.
Chi-squared (χ 2 ) is then calculated using the following formula. Where O is the
observed count and E is the expected count.
χ2 =
X (O − E)2
E
In the case of a contingency table the number of degrees of freedom is the
product of (n rows −1) and (n columns −1). Thus in table 6.1 there are 3 rows and
3 columns, therefore there are 4 degrees of freedom.
Another way to look at this is to consider the number of independant pieces of
information in the body of the contingency table. Given that the row and column
totals are fixed, how many counts can we fill in before the remainder are determined. In a 3 × 3 table we can only fill in four counts before the rest of the table is
determined hence four degrees of freedom.
The probability of a χ 2 value of 11.78 with 4 degrees of freedom is 0.019. So
we can reject the null hypothesis.
In interpreting an analysis like this one, which gives a significant association,
it is helpful to examine the individual contributions to χ 2 from each cell of the
table to see which counts differ most from the expected. Values over 1 should be
76
O
E
26
40
32
14
57
49
9
28
35
16.6
42.2
39.2
20.3
51.7
48.0
12.2
31.0
28.8
P
Contribution
to χ 2
5.38
0.12
1.32
1.94
0.54
0.02
0.82
0.30
1.33
11.78
= χ2
examined. In this case there are four cells with values over 1. Row 1, column 1
has a value of 5.38. This indicates that in the 18–35 age group more than expected
preferred pink tablets. Similarly, fewer of this age group than expected preferred
white (χ 2 =1.32). The 36–60 age group showed the opposite effect, fewer than
expected preferred pink (χ 2 =1.94); and amongst the over 60s more than expected
preferred white.
Note. χ 2 -analysis Must only be carried out on actual counts, not on percentages,
mean values or other derived statistics.
6.3
Small expected values
Special care should be taken if one or more of the expected values is less than 5.
The following precautions are recommended.
See the next section for 2 × 2 tables. For tables with df larger than 1, χ 2 can
be used if fewer than 20% of the cells have expected frequencies less than 5 and if
no cell has an expected frequency of less than 1.
If these requirements are not met you must combine rows or columns to increase the expected frequencies.
6.4
2 × 2 Contingency Tables
It is common in pharmacy and pharmacology to investigate the effect of a drug by
recording whether it produces a particular side effect in two groups of subjects, a
drug treated group and a control group.
77
In this instance a 2 × 2 contingency table is employed.
Presence of drug
Absence of drug
P
Presence of
side effect
Absence of
side effect
P
a
c
b
d
a+b
c+d
a+c
b+d
n
χ 2 should be calculated using the following formula.
n(|ad − bc| − 12 n)2
(6.1)
(a + b)(c + d)(b + d)(a + c)
Note: |ad − bc| means ‘the absolute value of’.
This formula has the advantage that it contains a correction for continuity
(Yates correction) without which 2 × 2 tables can give values of χ 2 which are
too large. Check that none of the expected values are less than 5. If they are,
Fisher’s exact test must be used (see suitable text book).
χ2 =
6.4.1
Example of 2 × 2 analysis
Control
Aspirin
P
Gastric
Irritation
No Gastric
Irritation
P
5
15
30
12
35
27
20
42
62
Is aspirin associated with increased gastric irritation? The data suggest that this
might be the case but is this statistically significant?
62(|60 − 450| − 31)2
35 × 27 × 42 × 20
= 10.07, df = 1, p = 0.0015
χ2 =
Thus, aspirin is associated with increased gastric irritation.
6.5 Exercise 1
Calculate χ 2 for the following data on the effectiveness of innoculation against
cholera. Does innoculation give protection against cholera? Use MINITAB (see
box below).
78
Innoculated
Uninnoculated
P
P
Attacked
Unattacked
11
21
89
79
100
100
32
168
200
Analysis of Contingency table with M INITAB
For a contingency table with three columns and two rows READ
data into C1. . . C3. C1 contains data for the first column of the
contingency table and so on.
READ C1 C2 C3
CHIS C1 C2 C3
M INITAB for Windows gives the calculated χ 2 and it’s probability. On the UNIX machines you can calculate the probability as
follows . . .
CDF n K1;
CHIS d.
LET K1=1-K1
PRINT K1
Where n is the value of χ 2 and d is the degrees of freedom.
79
80
Chapter 7
Statistical Tables
7.1
Table 1. Critical values for the Wilcoxon signed-rank
test
Two-tailed test
N
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
5%
0
2
3
5
8
10
13
17
21
25
29
34
40
46
52
One-tailed test
1%
N
5%
1%
0
1
3
5
7
9
12
15
19
23
27
32
37
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
2
3
5
8
10
13
17
21
25
30
35
41
47
53
60
0
1
3
5
7
9
12
15
19
23
27
32
37
43
81
7.2 Table 2a. Critical values for the Mann-Whitney U
test. For two-tailed test. 5% significance level.
N2
N1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
0
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
5
6
8
.
.
.
.
.
.
.
.
.
.
.
.
.
0
2
3
6
8
10
13
.
.
.
.
.
.
.
.
.
.
.
.
0
2
4
7
10
12
15
17
.
.
.
.
.
.
.
.
.
.
.
0
3
4
8
11
14
17
20
23
.
.
.
.
.
.
.
.
.
.
0
3
5
9
13
16
19
23
26
30
.
.
.
.
.
.
.
.
.
1
4
6
11
14
18
22
26
29
33
37
.
.
.
.
.
.
.
.
1
4
7
12
16
20
24
28
33
37
41
45
.
.
.
.
.
.
.
1
5
9
13
17
22
26
31
36
40
45
50
55
.
.
.
.
.
.
1
5
10
14
19
24
29
34
39
44
49
54
59
64
.
.
.
.
.
1
6
11
15
21
26
31
37
42
47
53
59
64
70
75
.
.
.
.
2
6
11
17
22
28
34
39
45
51
57
63
69
75
81
87
.
.
.
2
7
12
18
24
30
36
42
48
55
61
67
74
80
86
93
99
.
.
2
7
13
19
25
32
38
45
52
58
65
72
78
85
92
99
106
113
.
2
8
14
20
27
34
41
48
55
62
69
76
83
90
98
105
112
119
127
82
7.3
N2
N1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Table 2b. Critical values for the Mann-Whitney U
test. For two-tailed test. 1% significance level
5
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
0
1
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
0
1
3
4
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
4
6
7
.
.
.
.
.
.
.
.
.
.
.
.
0
1
3
5
7
9
11
.
.
.
.
.
.
.
.
.
.
.
0
2
4
6
9
11
13
16
.
.
.
.
.
.
.
.
.
.
0
2
5
7
10
13
16
18
21
.
.
.
.
.
.
.
.
.
1
3
6
9
12
15
18
21
24
27
.
.
.
.
.
.
.
.
1
3
7
10
13
17
20
24
27
31
34
.
.
.
.
.
.
.
1
4
7
11
15
18
22
26
30
34
38
42
.
.
.
.
.
.
2
5
8
12
16
20
24
29
33
37
42
46
51
.
.
.
.
.
2
5
9
13
18
22
27
31
36
41
45
50
55
60
.
.
.
.
2
6
10
15
19
24
29
34
39
44
49
54
60
65
70
.
.
.
2
6
11
16
21
26
31
37
42
47
53
58
64
70
75
81
.
.
0
3
7
12
17
22
28
33
39
45
51
57
63
69
74
81
87
93
.
0
3
8
13
18
24
30
36
42
48
54
60
67
73
79
86
92
99
105
83
7.4 Table 2c. Critical values for the Mann-Whitney U
test. For one-tailed test. 5% significance level
N2
N1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
1
2
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
2
3
5
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
2
4
6
8
11
.
.
.
.
.
.
.
.
.
.
.
.
.
1
3
5
8
10
13
15
.
.
.
.
.
.
.
.
.
.
.
.
1
4
6
9
12
15
18
21
.
.
.
.
.
.
.
.
.
.
.
1
4
7
11
14
17
20
24
27
.
.
.
.
.
.
.
.
.
.
1
5
8
12
63
19
23
27
31
34
.
.
.
.
.
.
.
.
.
2
5
9
13
17
21
26
30
34
38
42
.
.
.
.
.
.
.
.
2
6
10
15
19
24
28
33
37
42
47
51
.
.
.
.
.
.
.
3
7
11
16
21
26
31
36
41
46
51
56
61
.
.
.
.
.
.
3
7
12
18
23
28
33
39
44
50
55
61
66
72
.
.
.
.
.
3
8
14
19
25
30
36
42
48
54
60
65
71
77
83
.
.
.
.
3
9
15
20
26
33
39
45
51
57
64
70
77
83
89
96
.
.
.
4
9
16
22
28
35
41
48
55
61
68
75
82
88
95
102
109
.
.
4
10
17
23
30
37
44
51
58
65
72
80
87
94
101
109
116
123
.
4
11
18
25
32
39
47
54
62
69
77
89
92
100
107
115
123
130
138
84
7.5
N2
N1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Table 2d. Critical values for the Mann-Whitney U
test. For one-tailed test. 1% significance level.
5
0
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0
1
3
4
6
.
.
.
.
.
.
.
.
.
.
.
.
.
0
2
4
6
7
9
.
.
.
.
.
.
.
.
.
.
.
.
1
3
5
7
9
11
14
.
.
.
.
.
.
.
.
.
.
.
1
3
6
8
11
13
16
19
.
.
.
.
.
.
.
.
.
.
1
4
7
9
12
15
18
22
25
.
.
.
.
.
.
.
.
.
2
5
8
11
14
17
21
24
28
31
.
.
.
.
.
.
.
.
0
2
5
9
12
16
20
23
27
31
35
39
.
.
.
.
.
.
.
0
2
6
10
13
17
22
26
30
34
38
43
47
.
.
.
.
.
.
0
3
7
11
15
19
24
28
33
37
42
47
51
56
.
.
.
.
.
0
3
7
12
16
21
26
31
36
41
46
51
56
61
66
.
.
.
.
0
4
8
13
18
23
28
33
38
44
49
55
60
66
71
77
.
.
.
0
4
9
14
19
24
30
36
41
47
53
59
65
70
76
82
88
.
.
1
4
9
15
20
26
32
38
44
50
56
63
69
75
82
88
94
101
.
1
5
10
16
22
28
34
40
47
53
60
67
73
80
87
93
100
107
114
85
7.6 Table 3. Critical values for Spearman’s rank correlation coefficient
Two-tailed test
N
4
5
6
7
8
9
10
11
12
5%
1.000
0.886
0.786
0.738
0.700
0.648
0.618
0.587
One-tailed test
1%
N
5%
1%
1.000
0.929
0.881
0.833
0.794
0.754
0.727
4
5
6
7
8
9
10
11
12
1.000
0.900
0.827
0.714
0.643
0.600
0.564
0.536
0.503
1.000
0.943
0.893
0.833
0.783
0.745
0.709
0.678
86