Download Module Evaluation Report

Document related concepts

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
1
Chapter 5 – The normal distribution
5.1 Probability distributions of continuous random variables
A random variable X is called continuous if it can assume any of the possible values in some
interval i.e. the number of possible values are infinite. In this case the definition of a discrete
random variable (list of possible values with their corresponding probabilities) cannot be used
(since there are an infinite number of possible values it is not possible to draw up a list of
possible values). For this reason probabilities associated with individual values of a
continuous random variable X are taken as 0.
The clustering pattern of the values of X over the possible values in the interval is described
by a mathematical function f(x) called the probability density function. A high (low)
clustering of values will result in high (low) values of this function. For a continuous random
variable X, only probabilities associated with ranges of values (e.g. an interval of values from
a to b) will be calculated. The probability that the value of X will fall between the values a
and b is given by the area between a and b under the curve describing the probability density
function f(x). For any probability density function the total area under the graph of f(x) is 1.
5.2 Normal distribution
A continuous random variable X is normally distributed (follows a normal distribution) if the
probability density function of X is given by
(x  )2
exp[
] for -∞ ≤ x ≤ ∞ .
2 2
2 2
1
f(x) =
The constants  and  can be shown to be the mean and standard deviation respectively of
X. These constants completely specify the density function. A graph of the curve describing
the probability function (known as the normal curve) for the case   0 and   1 is shown
below.
Graph of standard norm al distribution
0.45
0.4
0.35
0.3
p(z)
0.25
0.2
0.15
0.1
0.05
0
-4
-2
0
z
2
4
2
5.2.1 Properties of the normal distribution
The graph of the function defined above has a symmetric, bell-shaped appearance. The mean
µ is located on the horizontal axis where the graph reaches its maximum value. At the two
ends of the scale the curve describing the function gets closer and closer to the horizontal axis
without actually touching it. Many quantities measured in everyday life have a distribution
which closely matches that of a normal random variable (e.g. marks in an exam, weights of
products, heights of a male population). The parameter µ shows where the distribution is
centrally located and σ the spread of the values around µ. A short hand way of referring to a
random variable X which follows a normal distribution with mean µ and variance σ2 is by
writing X ~ N(µ, σ2). The next diagram shows graphs of normal distributions for various
values of μ and σ2.
An increase (decrease) in the mean µ results in a shift of the graph to the right (left) e.g. the
curve of the distribution with a mean of -2 is moved 2 units to the left. An increase
(decrease) in the standard deviation σ results in the graph becoming more (less) spread out
e.g. compare the curves of the distributions with σ2 = 0.2, 0.5, 1 and 5.
5.2.2 Empirical example – The normal distribution and the histogram
Consider the scores obtained by 4 500 candidates in a matric mathematics examination. The
histogram of the marks has an appearance that can be described by a normal curve i.e. it has a
symmetric, bell-shaped appearance. The mean of the marks is 59.95 and the standard
deviation 10. The histogram below shows the distribution of the marks.
3
Histogram
1000
900
800
freq 700
600
500
400
300
200
100
0
15
25
35
45
55
65
75
90 More
mark
5.3 The Standard Normal Distribution
To find probabilities for a normally distributed random variable, we need to be able to
calculate the areas under the graph of the normal distribution. Such areas are obtained from a
table showing the cumulative distribution of the normal distribution (see appendix). Since the
normal distribution is specified by the mean (µ) and standard deviation (σ), there are many
possible normal distributions that can occur. It will be impossible to construct a table for each
possible mean and standard deviation. This problem is overcome by transforming X the
normal random variable of interest [X ~ N(µ, σ2) ] to a standardized normal random variable
Z=
X 

.
It can be shown that the transformed random variable Z ~ N(0, 1). The random variable Z
can be transformed back to X by using the formula
X =   Z .
The normal distribution with mean µ = 0 and standard deviation σ = 1 is called the standard
normal distribution. The symbol Z is reserved for a random variable with this distribution.
The graph of the standard normal distribution appears below.
4
Various areas under the above normal curve are shown. The standard normal table gives the
area under the curve to the left of the value z. Other types of areas can be found by
combining several of the areas as shown in the next examples.
5.4 Calculating probabilities using the standard normal table
The first few lines of the standard normal table are shown below.
Z
-3.7
-3.6
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0001
0.0002
0.0001
0.0002
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
0.0001
.
.
.
‘
.
.
.
.
0.0
0.1
0.5000
0.5398
0.4960
0.5438
0.4920
0.5478
0.4880
0.5517
0.4840
0.5557
0.4801
0.5596
0.4761
0.5636
0.4721
0.5675
0.4681
0.5714
0.4641
0.5753
.
.
.
‘
.
.
.
.
3.7
0.9999
0.9999
0.9999
0.9999
0.9999
0.9999
0.9999
0.9999
0.9999
0.9999
The areas shown in the table are those under the standard normal curve to the left of the value
of z looked up i.e. P(Z ≤ z) e.g. P(Z ≤ 0.14) = 0.5557.
Note
1 For negative values of z less than the minimum value (-3.79) in the table, the probabilities
are taken as 0 i.e.
P(Z ≤ z) = 0 for z < -3.79.
5
2 For positive values of z greater than the maximum value (3.79) in the table, the
probabilities are taken as 1 i.e.
P(Z ≤ z) = 1 for z > 3.79.
Examples In all the examples that follow, Z ~ N(0, 1).
1 P(Z < 1.35) = 0.9115
2 P(Z > -0.47) = 1 - P(Z ≤ -0.47) = 1-0.3192 = 0.6808
3 P(-0.47 < Z < 1.35) = P(Z < 1.35) – P(Z < -0.47) = 0.9115-0.3192 = 0.5923
4 P(Z > 0.76) = 1 –P(Z < 0.76) = 1 – 0.7764 = 0.2236
5 P(0.95 ≤ Z ≤ 1.36) = P(Z ≤ 1.36) – P(Z ≤ 0.95) = 0.9131 – 0.8289 = 0.0842
6 P(-1.96 ≤ Z ≤ 1.96) = P(Z ≤ 1.96) – P(Z ≤ -1.96) = 0.9750 – 0.0250 = 0.95
In all the above examples an area was found for a given value of z. It is also possible to find a
value of z when an area to its left is given. This can be written as P(Z ≤ zα) = α (  is the
greek letter for a and is pronounced “alpha”). In this case zα has to be found where α is the
area to its left
Examples
1 Find the value of z that has an area of 0.0344 to its left.
Search the body of the table for the required area (0.0344) and then read off the value of z
corresponding to this area. In this case z0.0344 = -1.82.
2 Find the value of z that has an area of 0.975 to its left.
Finding 0.975 in the body of the table and reading off the z value gives z0.975 = 1.96.
3 Find the values of z that have areas of 0.95 and 0.05 to their left.
When searching the body of the table for 0.95 this value is not found. The z value
corresponding to 0.95 can be estimated from the following information obtained from the
table.
z
1.64
?
1.65
area to left
0.9495
0.95
0.9505
Since the required area (0.95) is halfway between the 2 areas obtained from the table, the
required z can be taken as the value halfway between the two z values that were obtained
6
from the table i.e. z =
1.64  1.65
 1.645.
2
Exercise: Using the same approach as above, verify that the z value corresponding to an area
of 0.05 to its left is -1.645.
At the bottom of the standard normal table selected percentiles zα are given for different
values of α. This means that the area under the normal curve to the left of zα is α.
Examples: 1 α = 0.900, zα = 1.282 means P(Z < 1.282) = 0.900.
2 α = 0.995, zα = 2.576 means P(Z < 2.576) = 0.995.
3 α = 0.005, zα = -2.576 means P(Z < -2.576) = 0.005.
The standard normal distribution is symmetric with respect to the mean = 0. From this it
follows that the area under the normal curve to the right of a positive z entry in the standard
normal table is the same as the area to the left of the associated negative entry (-z) i.e.
P(Z ≥ z) = P(Z ≤ -z) .
E.g. P(Z ≥ 1.96) = 1 – 0.975 = 0.025 = P(Z ≤ -1.96).
5.5 Calculating probabilities for any normal random variable
Let X be a N(μ, σ2) random variable and Z a N(0, 1) random variable. Then
1 P(X ≤ x) = P(
X 

2 P(a ≤ X ≤ b) = P(

a

x


) = P(Z ≤
X 


x
b


).
) = P(
a

Z
b

).
Examples:
1 The height H (in inches) of a population of women is approximately normally distributed
with a mean of   63.5 and a standard deviation of   2.75 inches. To calculate the
probability that a woman is less than 63 inches tall, we first find the z-score for 63 inches
z
63  63.5
 0.18
2.5
and then use P(H ≤ 63) = P(Z ≤-0.18)= 0.4286.
This means that 42.86% (a proportion of 0.4286) of women are less than 63 inches tall.
2 The length X (inches) of sardines is a N(4.62, 0.0529) random variable. What proportion
of sardines is
7
(a) longer than 5 inches? (b) between 4.35 and 4.85 inches?
(a) P(X > 5) = P(Z >
5  4.62
) = P(Z > 1.65) = 1 – P(Z ≤ 1.65) = 1 – 0.9505 = 0.0495.
0.23
(b) P(4.35 ≤ X ≤ 4.85) = P(
4.35  4.62
4.85  4.62
Z
)  P(-1.17 ≤ Z ≤ 1)
0.23
0.23
=P(Z ≤1) - P(Z ≤ -1.17)
= 0.8413 – 0.1210 = 0.7203.
5.6 Finding percentiles by using the standard normal table
The standard normal table can be used to find percentiles for random variables which are
normally distributed.
Example
The scores S obtained in a mathematics entrance examination are normally distributed with
  514 and   113 . Find the score which marks the 80th percentile. From the standard
normal table, the z-score which is closest to an entry of 0.80 in the body of the table is 0.84
(the actual area to its left is 0.7995). The score which corresponds to a z-score of 0.84 can be
s  514
found by solving 0.84 
for s. This yields s  608.92 i.e. a score of approximately
113
609 is better than 80% of all other exam scores.
Exercises: All these exercises refer to the normal distribution above.
1 Find P35 .
2 If a person scores in the top 5% of test scores, what is the minimum score they could have
received?
3 If a person scores in the bottom 10% of test scores, what is the maximum score they could
have received?
5.7 Computer output
Excel has a built in function that can be used to find areas under the normal curve for a given
z-score or to calculate a z-score that has a given area under the normal curve to its left.
8
The table below shows areas under the standard normal curve to the left of various z-scores.
z-score
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
area
0.0062
0.0228
0.0668
0.1587
0.3085
0.5
0.6915
0.8413
0.9332
0.9772
0.9938
2 The table below shows z-scores for certain areas under the standard normal curve to its
left.
area
0.005
0.01
0.025
0.05
0.1
0.2
0.8
0.9
0.95
0.975
0.99
0.995
z-score
-2.5758
-2.3263
-1.96
-1.6449
-1.2816
-0.8416
0.8416
1.2816
1.6449
1.96
2.3263
2.5758
9
Chapter 6 – Sampling distributions
6.1 Definitions
A sampling distribution arises when repeated samples are drawn from a particular
population (distribution) and a statistic (numerical measure of description of sample data) is
calculated for each sample. The interest is then focused on the probability distribution
(called the sampling distribution) of the statistic.
Sampling distributions arise in the context of statistical inference i.e. when statements are
made about a population on the basis of random samples drawn from it.
Example
Suppose all possible samples of size 2 are drawn with replacement from a population with
sample space S = {2, 4, 6, 8} and the mean calculated for each sample.
The different values that can be obtained and their corresponding means are shown in the
table below.
1st value/2nd value
2
4
6
8
2
2
3
4
5
4
3
4
5
6
6
4
5
6
7
8
5
6
7
8
In the above table the row and column entries indicate the two values in the sample (16
possibilities when combining rows and columns). The mean is located in the cell
46
 5.
corresponding to these entries e.g. 1st value = 4, 2nd value = 6 has a mean entry of
2
Assuming that random sampling is used, all the mean values in the above table are equally
likely. Under this assumption the following distribution can be constructed for these mean
values.
x
count
P( X  x )
2
1
1
16
3
2
1
8
4
3
3
16
5
4
1
4
6
3
3
16
7
2
1
8
8
1
1
16
sum
16
1
The above distribution is referred to as the sampling distribution of the mean for random
samples of size 2 drawn from this distribution.
The mean of the population from which these samples are drawn is µ = 5 and the variance is
σ2 = [ x 2  ( x) 2 / N ]  N = (22+42+62+82 -202/4)/4 = 5.
The sampling distribution of the mean has mean  X  5 and variance
10
 X2 =
x
2
P( X  x ) - µ2 =
440
 5 2 = 2.5 (verify this result).
16
Note that  X  5 = µ and that  X2 = σ2/2 = 5/2 =2.5.
Consider a population with mean µ and variance σ2. It can be shown that the mean and
variance of the sampling distribution of the mean, based on a random sample of size n, are
given by
 X   and  X2 = σ2/n.
X =

n
is known as the standard error.
In the preceding example n = 2.
Sampling distributions can involve different statistics (e.g. sample mean, sample proportion,
sample variance) calculated from different sample sizes drawn from different distributions.
Some of the important results from statistical theory concerning sampling distributions are
summarized in the sections that follow.
6.2 The Central Limit Theorem
The following result is known as the Central Limit Theorem.
Let X1, X2, . . . , Xn be a random sample of size n drawn from a distribution with mean µ
n
and variance σ2 (σ2 should be finite). Then for sufficiently large n the mean X   X i / n is
i 1
approximately normally distributed with mean =  X   and variance =  X2 = σ2/n.
This result can be written as X ~ N(µ, σ2/n).
Note:
1 The random variable Z =
X 
/ n
~ N(0, 1).
2 The value of n for which this theorem is valid depends on the distribution from which the
sample is drawn. If the sample is drawn from a normal population, the theorem is valid for all
n. If the distribution from which the sample is drawn is fairly close to being normal, a value
of n > 30 will suffice for the theorem to be valid. If the distribution from which the sample is
drawn is substantially different from a normal distribution e.g. positively or negatively
skewed, a value of n much larger than 30 will be needed for the theorem to be valid.
3 There are various versions of the central limit theorem. The only other central limit
theorem result that will be used here is the following one.
11
If the population from which the sample is drawn is a Bernoulli distribution (consists of only
values of 0 or 1 with probability p of drawing a 1 and probability of q = 1-p of drawing a 0),
n
then S   X i follows a binomial distribution with mean µS = np and variance  S2 = npq.
i 1
n
According to the central limit theorem, Pˆ  S / n   X i / n follows a normal distribution
i 1
with mean µ( P̂) = µS /n = np/n = p and variance σ ( P̂) =  S2 / n2 = npq/n2 = pq/n when n is
2
sufficiently large. P̂ is the proportion of 1’s in the sample and can be seen as an estimate of
p the proportion of 1’s in the population (distribution from which sample is drawn).
Using the central limit theorem, it follows that
Z=
Pˆ   ( Pˆ )
Pˆ  p
~ N(0, 1).

 ( Pˆ )
pq / n
Example:
An electric firm manufactures light bulbs whose lifetime (in hours) follows a normal
distribution with mean 800 and variance 1600. A random sample of 10 light bulbs is drawn
and the lifetime recorded for each light bulb. Calculate the probability that the mean of this
sample
(a) differs from the actual mean lifetime of 800 by not more than 16 hours.
(b) differs from the actual mean lifetime of 800 by more than 16 hours.
(c) is greater than 820 hours.
(d) is less than 785 hours.
(a) P(-16 ≤ X  800 ≤ 16) = P( | X  800 | ≤ 16) = P(|Z |≤ 16/ 1600 / 10 ) = P(|Z|≤1.265)
= P(Z ≤ 1.265) – P(Z ≤ -1.265)
= 0.8971 – 0.1029 = 0.7942
(b) P( | X  800 | > 16 ) = 1 - P( | X  800 | ≤ 16) = 1 – 0.7942 = 0.2058
(c) P( X > 820) = P( Z >
(d) P( X  785) = P( Z <
820  800
1600 / 10
) = P( Z > 1.58) = 1 – 0.9429 = 0.0571
785  800
1600 / 10
) = P( Z < -1.19) = 0.117
12
6.3 The t-distribution (Student’s t-distribution)
The central limit theorem states that the statistic Z =
X 
follows a standard normal
/ n
distribution. If σ is not known, it would be logical to replace  (in the formula for Z) by its
X 
sample estimate S. For small values of the sample size n , the statistic t =
does not
S/ n
follow a normal distribution. If it is assumed that sampling is done from a population that is
approximately a normal population, the distribution of the statistic t follows a t-distribution.
This distribution changes with the degrees of freedom = df = n-1 i.e.for each value of
degrees of freedom a different distribution is defined.
The t-distribution was first proposed in a paper by William Gosset in 1908 who wrote the
paper under the pseudonym “Student”. The t-distribution has the following properties.
1. The Student t-distribution is symmetric and bell-shaped, but for smaller sample sizes it
shows increased variability when compared to the standard normal distribution (its curve
has a flatter appearance than that of the standard normal distribution). In other words, the
distribution is less peaked than a standard normal distribution and with thicker tails. As
the sample size increases, the distribution approaches a standard normal distribution. For
n > 30, the differences are negligible.
2. The mean is zero (like the standard normal distribution).
3. The distribution is symmetrical about the mean.
4. The variance is greater than one, but approaches one from above as the sample size
increases (σ2=1 for the standard normal distribution).
The graph below shows how the t-distribution changes for different values of r (the degrees
of freedom).
13
Tables for the t-distribution
The layout of the t-tables is as follows.
 =df/α 0.900 0.95
1
2
.
.
∞
.. 0.995
3.078 6.314
63.66
1.886 2.920
9.925
.
.
1.282 1.645
2.576
The row entry is the degrees of freedom (df) and the column entry (α) the area under the tcurve to the left of the value that appears in the table at the intersection of the row and
column entry.
When a t-value that has an area less than 0.5 to its left is to be looked up, the fact that the tdistribution is symmetrical around 0 is used i.e.
P(t ≤ tα) = P(t ≤ -t1-α) = P(t ≥ t1-α) for α ≤ 0.5 (Using symmetry).
This means that tα = -t1-α .
Examples
1 For df = 2 and α = 0.995 the entry is 9.925. This means that for the t-distribution with 2
degrees of freedom
P(t ≤ 9.925) = 0.995.
2 For df = ∞ and α = 0.95 the entry is 1.645. This means that for the t-distribution with ∞
degrees of freedom
P(t ≤ 1.645) = 0.95.
3 For df =  = 10 and α = 0.10 the value of t0.10 such that P(t ≤ t0.10 ) = 0.10 is found from
t0.10 = -t1-0.10 = -t0.90 = -1.372.
Note that the percentile values in the last row of the t-distribution are identical to the
corresponding percentile entries in the standard normal table. Since the t-distribution for large
samples (degrees of freedom) is the same as the standard normal distribution, their percentiles
should be the same.
6.4 The chi-square (χ2) distribution
The chi-square distribution arises in a number of sampling situations. These include the ones
described below.
14
1 Drawing repeated samples of size n from an approximate normal distribution with
variance σ2 and calculating the variance (S2) for each sample. It can be shown that the
quantity
χ =
2
(n  1) S 2
2
follows a chi-square distribution with degrees of freedom = n-1.
2 When comparing sequences of observed and expected frequencies as shown in the table
below. The observed frequencies (referring to the number of times values of some variable of
interest occur) are obtained from an experiment, while the expected ones arise from some
pattern believed to be true.
observed frequency f1 f2 .. fk
expected frequency e1 e2 .. ek
( f i  ei ) 2
can be shown to follow a chi-square distribution with k-1

ei
i 1
degrees of freedom. The purpose of calculating this χ2 is to make an assessment as to how
well the observed and expected frequencies correspond.
The quantity χ2 =
k
The chi-square curve is different for each value of degrees of freedom. The graph below
shows how the chi-square distribution changes for different values of  (the degrees of
freedom).
Unlike the normal and t-distributions the chi-square distribution is only defined for positive
values and is not a symmetrical distribution. As the degrees of freedom increase, the chisquare distribution becomes more a more symmetrical. For a sufficiently large value of
degrees of freedom the chi-square distribution approaches the normal distribution.
15
Tables for the chi-square distribution
The layout of the chi-square tables is as follows.
 = df/α 0.005
1
2
.
30
0.01
.. 0.99
0.000039 0.000157
6.63
0.010025 0.020101
9.21
13.79
14.95
0.995
7.88
10.60
50.89 53.67
The row entry is the degrees of freedom (df) and the column entry (α) the area under the chisquare curve to the left of the value that appears in the table at the intersection of the row and
column entry.
Examples:
1 For df = 30 and α = 0.01 the entry is 14.95. This means that for the chi-square distribution
with 30 degrees of freedom
P(  2 ≤ 14.95) = 0.01.
2 For df = 30 and α = 0.995 the entry is 53.67. This means that for the chi-square distribution
with 30 degrees of freedom
P(  2 ≤ 53.67) = 0.995.
3 For df = 6 and α = 0.95 the entry is 12.59. This means that for the chi-square distribution
with 6 degrees of freedom
P(  2  12.59 ) = 0.95 or P(  2  12.59 ) = 0.05.
This probability statement is illusrated in the next graph.
16
6.5 The F-distribution
Random samples of sizes n and m are drawn from normally distributed populations that are
labeled 1 and 2 respectively. Denote the variances calculated from these samples by S12 and
S 22 respectively and their corresponding population variances by  12 and  22 respectively.
S12 /  12
The ratio F  2
is distributed according to an F-distribution (named after the famous
S 2 /  22
statistician R.A. Fisher) with degrees of freedom df1  n1  1 (called the numerator degrees
of freedom) and df 2  n2  1 (called the denominator degrees of freedom). When  12   22
the F-ratio is F 
S12
.
S 22
The F-distribution is positively skewed, and the F-values can only be positive. The graph
below shows plots for a number of F-distributions (F-curves) with  12   22 . These plots are
referred to by F (df1 , df 2 ) e.g. F (33,10) refers to an F-distribution with 33 degrees of
freedom associated with the numerator and 10 degrees of freedom associated with the
denominator. For each combination of df 1 and df 2 there is a different F-distribution. Three
other important distributions are special cases of the F-distribution. The normal distribution is
an F(1, infinity) distribution, the t-distribution an F(1, n2 ) distribution and the chi-square
distribution an F( n1 , infinity) distribution.
17
Tables for the F-distribution
The layout of the F-distribution tables with  12   22 is as follows.
df2/df1
1
2
.
∞
1
2
...
161.5 199.5
18.51 19.0
∞
254.3
19.5
3.85
1.01
3.0
...
The entry in the table corresponding to a pair of ( df1 , df 2 ) values has an area of  under the
F (df1 , df 2 ) curve to its right.
Examples
1 F (3,26)  2.98 has an area (under the F (3,26) curve) of   0.05 to its right (see graph
below).
18
2 F (4,32)  2.67 has an area (under the F (4,32) curve ) of   0.05 to its right (see graph
below).
For each different value of  a different F-table is used to read off a value that has an area of
 to its right i.e. a percentage of 100(1-  ) to its left. The F-tables that are used and their 
and 100(1-  ) values are summarized in the table below.

Percentage point = 100(1-  )
0.05 95%
0.025 97.5%
0.01 99%
The first entry in the above table refers to the percentage of the area under the F-curve to the
left of the F-value read off and the second entry to the proportion under the F-curve to the
right of this F-value.
Examples:
1 For df1  7, df 2  5 the value read from the 95% F-distribution table is 4.88. This means
that for this F-distribution 95% of the area under the F-curve is to the left of 4.88 (a
proportion of 0.05 to the right of 4.88).
P( F  4.88) = 0.95
P( F >4.88) = 0.05
2 For df1  7, df 2  5 the value read from the 97.5% F-distribution table is 6.85. This means
that for this F-distribution 97.5% of the area under the F-curve is to the left of 6.85 (a
proportion of 0.025 to the right of 6.85).
P( F  6.85) = 0.975
P( F >6.85) = 0.025
19
3 For df1  10, df 2  17 the value read from the 99% F-distribution table is 3.59. This means
that for this F-distribution 99% of the area under the F-curve is to the left of 3.59 (a
proportion of 0.01 to the right of 3.59).
P( F  3.59) = 0.99
P( F >3.59) = 0.01
Lower tail values from the F-distribution
Only upper tail values (areas of 5%, 2.5% and 1% above) can be read off from the F-tables.
Lower tail values can be calculated from the formula
F (df1 , df 2 ;  ) 
1
i.e.
F (df 2 , df 1 ,1   )
F value with an area  under the F-curve to its left
= 1/ (F value with an area 1   under the F-curve to its left with numerator and denominator
degrees of freedom interchanged)
Examples
1. Find the value such that 2.5% of the area under the F(7,5) curve is to the left of it.
In the above formula df1  7, df 2  5 and   0.025 . Then
1
1
F (7,5;0.025) 

 0.189.
F (5,7;0.975) 5.29
2 Find the value such that 1% of the area under the F(10,17) curve is to the left of it.
In the above formula df1  10, df 2  17 and   0.01. Then
1
1
F (10,17;0.01) 

 0.223.
F (17,10;0.99) 4.49
6.5 Computer output
In excel values from the t, chi-square and F-distributions, that have a given area under the
curve above it, can be found by using the TINV(area, df), CHIINV (area, df) and FINV(area,
df1,df2) functions respectively.
Examples
1 TINV(0.05, 15) = 2.13145. The area under the t(15) curve to the right of 2.13145 is 0.025
and to the left of -2.13145 is 0.025. Thus the total tail area is 0.05.
2 CHIINV(0.01, 14) = 29.14124. The area under the chi-square (14) curve to the right of
29.14124 is 0.01.
3 FINV(0.05,10,8) = 3.347163. The area under the F (10, 8) curve to the right of 3.347163
is 0.05.
20
Chapter 7 – Statistical Inference: Estimation for one sample case
7.1 Statistical inference
Statistical inference (inferential statistics) refers to the methodology used to draw conclusions
(expressed in the language of probability) about population parameters on the basis of
samples drawn from the population.
Examples
1 The government of a country wants to estimate the proportion of voters ( p ) in the country
that approve of their economic policies.
2 A manufacturer of car batteries wishes to estimate the average lifetime (µ) of their
batteries.
3 A paint company is interested in estimating the variability (as measured by the variance,
 2 ) in the drying time of their paints.
The quantities p , µ and  2 that are to be estimated are called population parameters.
A sample estimate of a population parameter is called a statistic. The table below gives
examples of some commonly used parameters toegether with their statistics.
Parameter Statistic
p
p̂
x
µ
2
σ
S2
7.2 Point and interval estimation
A point estimate of a parameter is a single value (point) that estimates a parameter.
An interval estimate of a parameter is a range of values from L (lower value) to U (upper
value) that estimate a parameter. Associated with this range of values is a probability or
percentage chance that this range of values will contain the parameter that is being estimated.
Examples
Suppose the mean time it takes to serve customers at a supermarket checkout counter is to be
estimated.
1 The mean service time of 100 customers of (say) x  2.283 minutes is an example of a
point estimate of the parameter µ.
2 If it is stated that the probability is 0.95 (95% chance) that the mean service time will be
from 1.637 minutes to 4.009 minutes, the interval of values (1.637, 4.009) is an interval
estimate of the parameter µ.
21
The estimation approaches discussed will focus mainly on the interval estimate approach.
7.3 Confidence intervals terminology
A confidence interval is a range of values from L (lower value) to U (upper value) that
estimate a population parameter  with 100(1-  )% confidence.
 - pronounced “theta”.
L is the lower confidence limit.
U is the upper confidence limit.
The interval (L, U) is called the confidence interval.
1-  is called the confidence coefficient. It is the probability that the confidence interval will
contain  the parameter that is being estimated.
100 (1   ) is called the confidence percentage.
Example
Consider example 2 of the previous section.
 , the parameter that is being estimated, is the population mean  .
L = 1.637, U = 4.009
The confidence interval is the interval (1.637, 4.009).
 =0.05
The confidence coefficient is 1-  = 0.95
The confidence percentage is 100 (1   ) = 95.
In the sections that follow the determination of L and U when estimating the parameters µ, p
and σ2 will be discussed.
7.4 Confidence interval for the population mean (population variance known)
The determination of the confidence limits is based on the central limit theorem (discussed in
the previous chapter). This theorem states that for sufficiently large samples
the sample mean X ~ N(µ,
2
n
) and hence that Z =
X 
/ n
~ N(0, 1).
Formulae for the lower and upper confidence limits can be constructed in the following way.
22
Since Z ~ N(0,1), it follows from the above graph that
P(-1.96 ≤ Z ≤ 1.96) = 0.95.
P(-1.96 ≤
X 
/ n
≤ 1.96) = 0.95 , ( Substitute Z =
X 
/ n
in the line above ).
By a few steps of mathematical manipulation (not shown here), the above part in brackets can
be changed to have only the parameter µ between the inequality signs. This will give
P( X  1.96

n
≤ µ ≤ X  1.96
Let L = X  1.96

n

n
) = 0.95 .
and U = X  1.96

n
. Then the above formula can be written as
P(L ≤ µ ≤ U) = 0.95.
This formula is interpreted in the following way.
Since both L and U are determined by the sample values (which determine X ), they (and the
confidence interval) will change for different samples. Since the parameter µ that is being
estimated remains constant, these intervals will either include or exclude µ. The central limit
theorem states that such intervals will include the parameter µ with probability 0.95 (95 out
of 100 times).
23
In a practical situation the confidence interval will not be determined by many samples, but
by only one sample. Therefore the confidence interval that is calculated in a practical
situation will involve replacing the random variable X by the sample value x . Then the
above formulae for a 95% confidence interval for the population mean µ becomes
( x  1.96

n
, x  1.96

n
.) or
x  1.96

n
.
The percentage of confidence associated with the interval is determined by the value (called
the z – multiplier) obtained from the standard normal distribution. In the above formula a zmultiplier of 1.96 determines a 95% confidence interval.
If a different percentage of confidence is required, the z – multiplier needs to be changed. The
table below is a summary of z-multipliers needed for different percentages associated with
confidence intervals.
confidence percentage 99
95
90
z-multiplier
2.576 1.96 1.645
0.01 0.05 0.10

Calculation of confidence interval for µ (σ2 known)
Step 1 : Calculate x . Values of n, σ2 and confidence percentage are given
Step 2 : Look up z-multiplier for given a confidence percentage.
Step 3 : Confidence interval is x  z-multiplier

n
Example
The actual content of cool drink in a 500 milliliter bottle is known to vary. The standard
deviation is known to be 5 milliliters. Thirty (30) of these 500 milliliter bottles were selected
at random and their mean content found to 498.5. Calculate 95% and 99% confidence
intervals for the population mean content of these bottles.
Solution
95% confidence interval
Substituting x = 498.5, n = 30, σ = 5, z = 1.96 into the above formula gives
498.5 ± 1.96
5
30
= (496.71, 500.29).
99% confidence interval
Substituting x = 498.5, n = 30, σ = 5, z = 2.576 into the above formula gives
498.5 ± 2.576
5
30
= (496.15, 500.85).
24
7.5 Confidence interval for the population mean (population variance not known)
When the population variance (σ2) is not known, it is replaced by the sample variance (S2) in
the formula for Z mentioned in the previous section. In such a case the quantity
t=
X 
S/ n
follows a t-distribution with degrees of freedom = df = n-1.
The confidence interval formula used in the previous section is modified by replacing the zmultiplier by the t-multiplier that is looked up from the t-distribution.
Calculation of confidence interval for µ (σ2 not known)
Step 1 : Calculate x and S. Values of n and confidence percentage are given
Step 2 : Look up t-multiplier for a given confidence percentage and degrees of freedom = df.
S
Step 3 : Confidence interval is x  t-multiplier
n
Example
The time (in seconds) taken to complete a simple task was recorded for each of 15 randomly
selected employees at a certain company. The values are given below.
38.2
43.9
38.4
26.2
41.3
42.3
37.5
37.2
41.2
42.3
31
50.1
37.3
36.7
Calculate 95% and 99% confidence intervals for the population mean time it takes to
complete this task.
Solution
n = 15 (given), x  38.36, S = 5.78 (Calculated from the data)
95% confidence interval
Looking up the t-multiplier involves a row and column entry in the t-table.
Row entry: df =  = 15-1 = 14
Column entry: The α entry is determined from the confidence % required.
1-2α = 95 gives α = 0.975
From the t-tables with df =  14 and α = 0.975, t-multiplier = 2.145.
Substituting x = 38.36, n = 15, S = 5.78, t = 2.145 into the above formula gives
38.36 ± 2.145
5.78
15
= (35.16, 41.56).
99% confidence interval
31.8
25
Looking up the t-multiplier
Row entry: df =  = 15-1 = 14
Column entry: 1-2α = 99 which gives α = 0.995.
From the t-tables with df =  14 and α = 0.995, t-multiplier = 2.977.
Substituting x = 38.36, n = 15, S = 5.78, t = 2.977 into the above formula gives
38.36 ± 2.977
5.78
15
= (33.92, 42.80).
7.6 Confidence interval for population variance
The formulae for the confidence interval of the population variance σ2 are based on the fact
(n  1) S 2
that
follows a chi-square distribution with (n-1) degrees of freedom. Let
2




100
 2 (1  ) and  2 ( ) denote the 100( 1  ) and
percentile points of the chi-square
2
2
2
2
distribution with (n-1) degrees of freedom. These points are shown in the graph below.
For this distribution, it follows from the graph above that
26
(n  1) S 2


P[  2 ( ) ≤
≤  2 (1  ) ] = 1-  .
2
2
2

By a few steps of mathematical manipulation (not shown here), the above part in brackets can
be changed to have only the parameter σ2 between the inequality signs. This will give
P[
(n  1) S 2
(n  1) S 2
≤ σ2 ≤
] = 1-  ,
lower
upper
where upper =  2 (1 

2
) , the larger of the 2 percentile points and

lower =  2 ( ) , the smaller of the 2 percentile points.
2
The values of  and  / 2 are calculated from
confidence percentage = 100(1-  ) e.g. if
confidence percentage = 95,  = 0.05 ,  / 2  0.025 .
Calculation of confidence interval for σ2
Step 1 : Calculate S2. Values of n and confidence percentage are given
Step 2 : Look up upper and lower chi-square values for a given confidence percentage and
degrees of freedom = df.
(n  1) S 2 (n  1) S 2
Step 3 : Confidence interval is [
,
]
lower
upper
Example
Calculate 90% and 95% confidence intervals for the population variance of the time taken to
complete the simple task (see previous example).
Solution
n =15 , S2 = 33.3811 (Calculated from the data)
90% confidence interval
Look up upper and lower chi-square values by using df =  = 14 and  =0.10.
upper =  2 (1 

2
) =  2 (0.95) = 23.68 for  = 14.

lower =  2 ( ) =  2 (0.05) = 6.57 for  = 14.
2
27
(n-1)S2 = 14 x 33.3811 = 467.34
The confidence interval is (
467.34 467.34
,
) = (19.74, 71.13).
23.68
6.57
95% confidence interval
Look up upper and lower chi-square values by using df =  = 14 and  =0.05.
upper =  2 (1 

2
) =  2 (0.975) = 26.12 for  = 14.

lower =  2 ( ) =  2 (0.025) = 5.63 for  = 14.
2
(n-1)S2 = 14 x 33.3811 = 467.34
The confidence interval is (
467.34 467.34
) = (17.89, 83.01).
,
26.12
5.63
7.7 Confidence interval for population proportion
In some experiments the interest is in whether or not items posses a certain characteristic of
interest (e.g. whether a patient improves or not after treatment, whether an item manufactured
is acceptable or not, whether an answer to a question is correct or incorrect). The population
proportion of items labeled “success” in such an experiment (e.g. patient improves, item is
acceptable, answer is correct) is estimated by calculating the sample proportion of “success”
items.
The determination of the confidence limits for the population proportion of items labeled
X
“success” is based on the central limit theorem for the sample proportion Pˆ  , where X is
n
the number of items in the sample labeled “success”.
This theorem states that for sufficiently large samples
the sample proportion of “success” items P̂ ~ N(p,
Z=
pq
) and hence that
n
Pˆ   ( Pˆ )
Pˆ  p
~ N(0, 1).

 ( Pˆ )
pq / n
Formulae for the lower and upper confidence limits can be constructed in the following way.
Since Z ~ N(0,1),
P(-1.96 ≤ Z ≤ 1.96) = 0.95
28
P(-1.96 ≤
Pˆ  p
pq / n
≤ 1.96) = 0.95
By a few steps of mathematical manipulation (not shown here), the above part in brackets can
be changed to have the parameter p (in the numerator) between the inequality signs. This will
give
P( Pˆ  1.96 pq / n ≤ p ≤ Pˆ  1.96 pq / n ) = 0.95.
Since the confidence interval formula is based on a single sample, the random variable
X
x
is replaced by its sample estimate pˆ  and the parameters p and q=1-p by their
Pˆ 
n
n
x
respective sample estimates pˆ  and qˆ  1  pˆ .
n
This gives the following 95% confidence interval for p: ( pˆ  1.96 pˆ qˆ / n , pˆ  1.96 pˆ qˆ / n ).
If the percentage of confidence is to be changed, the z-multiplier is changed according to the
values given in the table below.
confidence percentage 99
95
90
z-multiplier
2.576 1.96 1.645
0.01 0.05 0.10

Calculation of confidence interval for p
x
Step 1 : Calculate pˆ  and qˆ  1  pˆ . x, n and confidence percentage are given
n
Step 2 : Look up z-multiplier for given a confidence percentage.
Step 3 : Confidence interval is p̂  z-multiplier pˆ qˆ / n
Example
During a marketing campaign for a new product 176 out of the 200 potential users of this
product that were contacted indicated that they would use it. Calculate a 90% confidence
interval for the proportion of potential users who would use this product.
Solution
x = 176, n = 200, confidence percentage = 90 (given)
p̂ =
176
 0.88, qˆ  1  pˆ = 0.12.
200
z-multiplier = 1.645 (From above table)
Confidence interval is ( 0.88 ± 1.645
0.88 * 0.12 / 200 ) = (0.88 ± 0.0378) = (0.842, 0.918).
29
7.8 Sample size when estimating the population mean
Consider the formula for the confidence interval of the mean (µ) when  2 is known.
x  z-multiplier

n
The quantity z-multiplier

n
is known as the error (denoted by E).
The smaller the error, the more accurately the parameter μ is estimated.
Suppose the size of the error is specified in advance and the sample size n is determined to
achieve this accuracy. This can be done by solving for n from the equation
E = z-multiplier
n=(

n
, which gives
z  multiplier *  2
) .
E
The z-multiplier is determined by the percentage confidence required in the estimation.
Example
Consider the example on the interval estimation of the mean content of 500 milliliter cool
drink bottles. The standard deviation σ is known to be 5. Suppose it is desired to estimate the
mean with 95% confidence and an error that is not greater than 0.8. What sample size is
needed to achieve this accuracy?
Solution
σ = 5, E = 0.8 (given), z-multiplier = 1.96 (from 95% confidence requirement).
n= (
1.96 * 5 2
) = 150.0625 =151 (n is always rounded up).
0.8
7.9 Sample size for estimation of population proportion
The approach used in determining the sample size for the estimation of the population
proportion is much the same as that used when estimating the population mean.
The equation to be solved for n is
z  multiplier * pq
E=
.
n
When solving for n the formula becomes
z  multiplier 2
) .
n = pq (
E
30
A practical problem encountered when using this formula is that values for the parameters p
and q=1-p are needed. Since the purpose of this technique is to estimate p, these values of p
and q are obviously not known.
If no information on p is available, the value of p that will give the maximum value of
p(1-p) = pq will be taken.
It can be shown that p= ½ maximizes this expression. This gives
max pq = ¼ .
Substituting this maximum value in the above formula gives
max n = ¼ (
z  multiplier 2
) .
E
If more accurate information on the value of p is known (e.g. some range of values), it should
be used in the above formula.
As explained before, the z-multiplier is determined by the percentage confidence required in
the estimation.
Example
Consider the problem (discussed earlier) of estimating the proportion of potential users who
would use a new product. Suppose this proportion is to be estimated with 99% confidence
and an error not exceeding 2% (proportion of 0.02) is required. What sample size is needed to
achieve this?
Solution
E = 0.02 (given), z-multiplier = 2.576 (99% confidence required)
2.576 2
) = 4147.36 = 4148 (rounded up).
0.02
Supppose it is known that the value of p is between 0.8 and 0.9. In such a case
n=¼(
max p(1  p)  pq = 0.8 x 0.2 =0.16 (Why is p = 0.8 used?).
0.8  p  0.9
By using this information the value of n can be calculated as
n =0.16 (
2.576 2
) = 2654.31 = 2655 (rounded up).
0.02
The additional information on possible values for p reduces the sample size by 36%.
31
7.10 Computer output
1 Confidence interval for the mean (  2 known). For the data in the example in section 7.4,
the information can be typed on an excel sheet and the confidence interval calculated as
follows.
mean
sigma
n
z multiplier
Confidence
lower
upper
498.5
5
30
1.959964
interval
496.71
500.29
2 Confidence interval for the mean (  2 not known). For the data in the example in section
7.5, the information can be typed on an excel sheet and the confidence interval calculated as
follows.
mean
stand.dev
n
t multiplier
Confidence
lower
upper
38.36
5.777642
15
2.144787
interval
35.16
41.56
3 Confidence interval for the variance. For the data in the example in section 7.6, the
information can be typed on an excel sheet and the confidence interval calculated as follows.
variance
n
degrees of
freedom
lower chisq.
upper chisq.
Confidence
lower
upper
33.38114
15
14
5.628726
26.11895
interval
17.89
83.03
4 Confidence interval for the proportion of successes. For the data in the example in section
7.7, the information can be typed on an excel sheet and the confidence interval calculated as
follows.
n
x
z multiplier
st.error
Confidence
lower
upper
200
176
1.644854
0.022978
interval
0.842
0.918
32
Chapter 8 – Statistical Inference: Testing of hypotheses for one sample
8.1 Formulation of hypotheses and related terminology
A statistical hypothesis is an assertion (claim) made about a value(s) of a population
parameter.
The purpose of testing of hypotheses is to determine whether a claim that is made could be
true. The conclusion about the truth of such a claim is not stated with absolute certainty, but
rather in terms of the language of probability.
Examples of claims to be tested
1 A supermarket receives complaints that the mean content of “1 kilogram” sugar bags that
are sold by them is less than 1 kilogram.
2 The variability in the drying time of a certain paint (as measured by the variance) has until
recently been 65 minutes. It is suspected that the variability has now increased.
3 A construction company suspects that the proportion of jobs they complete behind
schedule is 0.20 (20%). They want to test whether this is indeed the case.
Null and alternative hypotheses
The null hypothesis (H0) is a statement concerning the value of the parameter of interest
(  ) in a claim that is made. This is formulated as
H0:   0 (The statement that the parameter  is equal to the hypothetical value  0 ) .
The alternative hypothesis (H1) is a statement about the possible values of the parameter 
that are believed to be true if H0 is not true. One of the alternative hypotheses shown below
will apply.
H1a:    0 or H1b:    0 or H1c:    0 .
Examples
1 In the first example (above) the parameter of interest is the population mean µ and the
hypotheses to be tested are
H0: µ = 1 (Population mean is 1 kilogram)
versus
H1a: µ < 1 (Population mean is less than 1 kilogram)
In terms of the general notation stated above  =µ and  0  1 .
33
2 In the second example (above) the parameter of interest is the population variance σ2 and
the hypotheses to be tested are
H0: σ2 = 65 (Population variance is 65)
versus
H1b: σ2 > 65 (Population variance is greater than 65)
In terms of the general notation stated above  = σ2 and  0  65.
3 In the third example (above) the parameter of interest is the population proportion, p, of
job completions behind schedule and the hypotheses to be tested are
H0: p = 0.20 (Population proportion is 0.20)
versus
H1c: p ≠ 0.20 (Population proportion is not equal to 0.20)
In terms of the general notation stated above  = p and  0  0.20 .
One and two-sided alternatives
A one-sided alternative hypothesis is one that specifies the alternative values (to the null
hypothesis) in a direction that is either below or above that specified by the null hypothesis.
Example
The alternative hypothesis H1a (see example 1 above) is the alternative that the value of the
parameter is less than that stated under the null hypothesis and the alternative H1b (see
example 2 above) is the alternative that the value of the parameter is greater than that stated
under the null hypothesis.
A two-sided alternative hypothesis is one that specifies the alternative values (to the null
hypothesis) in directions that can be either below or above that specified by the null
hypothesis.
Example
The alternative hypothesis H1c (see example 3 above) is the alternative that the value of the
parameter is either greater than that stated under the null hypothesis or less than that stated
under the null hypothesis.
8.2 Testing of hypotheses for one sample: Terminology and summary of procedure
The testing procedure and terminology will be explained for the test for the population mean
μ with population variance σ2 known.
34
The hypotheses to be tested are
H0 : µ = µ0
versus
H1a: µ < µ0 or H1b: µ > µ0 or H1c: µ≠ µ0.
The data set that is needed to perform the test is
x1, x2,
. . . , xn
,
a random sample of size n drawn from the population for which the mean is tested. The test is
performed to see whether or not the sample data are consistent with what is stated by the null
hypothesis. The instrument that is used to perform the test is called a test statistic.
A test statistic is a quantity calculated from the sample data.
When testing for the population mean, the test statistic used is
z0 =
x  0
/ n
.
If the difference between x and µ0 (and therefore the value of z0) is reasonably small, H0 will
be not be rejected. In this case the sample mean is consistent with the value of the population
mean that is being tested. If this difference (and therefore the value of z0) is sufficiently large,
H0 will be rejected. In this case the sample mean is not consistent with the value of the
population mean that is being tested. In order to decide how large this difference between
x and μ0 (and therefore the value of z0) should be before H0 is rejected, the following should
be considered.
Type I error
A type I error is committed when the null hypothesis is rejected when, in fact it is true i.e.
H0 is wrongly rejected.
In this test, a type I error is committed when it is decided that the statement H0: µ = μ0 should
be rejected when, in fact, it is true.
A type II error is committed when the null hypothesis is not rejected when, in fact, it is
false i.e. a decision not to reject H0 is wrong.
In this test, a type II error is committed when it is decided that the statement H0: µ = μ0
should not be rejected when, in fact, it is false.
The following table gives a summary of possible conclusions and their correctness when
performing a test of hypotheses.
35
Actually true/Conclusion Reject H0
Do not reject H0
Type I error
Correct conclusion
H0 is true
Correct conclusion Type II error
H0 is false
A type I error is often considered to be more serious, and therefore more important to avoid,
than a type II error. The hypothesis testing procedure is therefore designed so that there is a
guaranteed small probability of rejecting the null hypothesis wrongly. This probability is
never 0 (why?). Mathematically the probability of a type I error can be stated as
P(type I error) = P(Reject H0 | H0 is true) = α.
When testing for the population mean,
P(type I error) = P(reject μ = μ0 | μ = μ0 is true) = α and
P(type II error) = P(do not reject µ = µ0 | µ = µ0 is false) = β.
Probabilities of type I and type II errors work in opposite directions. The more reluctant you
are to reject H0, the higher the risk of accepting it when, in fact, it is false. The easier you
make it to reject H0, the lower the risk of accepting it when, in fact, it is false
Critical value(s) and critical region
The critical (cut-off) value (s) for tests of hypotheses is a value(s) with which the test
statistic is compared with in order to determine whether or not the null hypothesis should be
rejected.
The critical value is determined according to the specified value of α, the probability of a type
I error.
For the test of the population mean the critical value is determined in the following way.
Assuming that H0 is true, the test statistic Z0 =
X  0
/ n
~ N(0, 1).
(i) When testing H0 versus the alternative hypothesis H1a (µ < µ0), the critical value is the
value Zα which is such that the area under the standard normal curve to the left of Zα is
α i.e. P(Z0 < Zα) = α.
The graph below illustrates the case α = 0.05 i.e. P(Z0 < -1.645) = 0.05.
36
(ii) When testing H0 versus the alternative hypothesis H1b (µ > µ0) , the critical value is the
value Z1-α which is such that the area under the standard normal curve to the right of Z1-α is
α i.e. P(Z0 > Z1-α) = α..
The graph below illustrates the case α = 0.05 i.e. P(Z0 > 1.645) = 0.05.
(iii) When testing H0 versus the alternative hypothesis H1c (µ ≠ µ0), the critical values are
the values Z1-α/2 and Zα/2 which are such that the area under the standard normal curve to the
right of Z1-α/2 is α/2 and the area under the standard normal curve to the left of Zα/2 is α/2. i.e.
P(Z0 > Z1-α/2) = α/2 and P(Z0 < Zα/2) = α/2.
The area under the normal curve between these two critical values is 1-α. The graph below
illustrates the case α = 0.05 i.e. P(Z0 <-1.96 or Z0> 1.96) = 0.05.
The critical region CR, or rejection region R, is the set of values of the test statistic for
which the null hypothesis is rejected.
(i) When testing H0 versus the alternative hypothesis H1a , the rejection region is
{ z0 | z0 < Zα }.
37
(ii) When testing H0 versus the alternative hypothesis H1b , the rejection region is
{ z0 | z0 > Z1-α }.
(iii) When testing H0 versus the alternative hypothesis H1c , the rejection region is
{ z0 | z0 > Z 1-α/2 or z0 < Zα/2 }.
H0 is rejected when there is a sufficiently large difference between the sample mean x and
the mean (μ0 ) under H0 . Such a large difference is called a significant difference (result of
the test is significant). The value of α is called the level of significance. It specifies the level
beyond which this difference (between x and μ0) is sufficiently large for H0 to be rejected.
The value of α is specified prior to performing the test and is usually taken as either 0.05 (5%
level of significance) or 0.01 (1% level of significance).
When H0 is rejected, it does not necessarily mean that it is not true. It means that according to
the sample evidence available it appears not to be true. Similarly when H0 is not rejected, it
does not necessarily mean that it is true. It means that there is not sufficient sample evidence
to disprove H0.
Critical values for tests based on the standard normal distribution can be found from the
selected percentiles listed at the bottom of the pages of the standard normal table.
8.3 Test for the population mean (population variance known)
A summary of the steps to be followed in the testing procedure is shown below.
Test for  when  2 is known
1 State null and alternative hypotheses.
H0:    0 versus H1a:    0 or H1b:  >  0 or H1c:    0
2 Calculate the test statistic z 0 
x  0
.
/ n
3 State the level of significance α and determine the critical value(s) and critical region.
(i) For alternative H1a the critical region is R = { z0 | z0 < Zα }.
(ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }.
(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }.
4 If z0 lies in the critical region, reject H0, otherwise do not reject H0.
5 State conclusion in terms of the original problem.
38
Examples
1 A supermarket receives complaints that the mean content of “1 kilogram” sugar bags that
are sold by them is less than 1 kilogram. A random sample of 40 sugar bags is selected from
the shelves and the mean found to be 0.987 kilograms. From past experience the standard
deviation contents of these bags is known to be 0.025 kilograms. Test, at the 5% level of
significance, whether this complaint is justified.
H0 : μ = 1
(The complaint is not justified)
H1 : μ < 1
(The complaint is justified)
n = 40, x = 0.987, σ = 0.025, μ0 = 1 (given)
Test statistic: z0 =
0.987  1
0.025 / 40
 -3.289.
α = 0.05. Critical region R = { z0 < Z0.05 = -1.645 }.
Since z0 = -3.289 < -1.645, H0 is rejected.
Conclusion: The complaint is justified.
2 A supermarket manager suspects that the machine filling “500 gram” containers of coffee
is overfilling them i.e. the actual contents of these containers is more than 500 grams. A
random sample of 30 of these containers is selected from the shelves and the mean found to
be 501.8 grams. From past experience the variance of contents of these bags is known to be
60 grams. Test at the 5% level of significance whether the manager’s suspicion is justified.
Solution
H0 : μ = 500
(Suspicion is not justified)
H1 : μ > 500
(Suspicion is justified)
n = 30, x = 501.8, σ2 = 60, μ0 = 500 (given)
Test statistic: z0 =
501.8  500
60 / 30
 1.273.
α = 0.05. Critical region R = { z0 > Z0.95 = 1.645 }.
Since z0 = 1.273 < 1.645, H0 is not rejected.
Conclusion: The suspicion is not justified.
3 During a quality control exercise the manager of a factory that fills cans of frozen shrimp
wants to check whether the mean weights of the cans conform to specifications i.e. the mean
of these cans should be 600 grams as stated on the label of the can. He/she wants to guard
39
against either over or under filling the cans. A random sample of 50 of these cans is selected
and the mean found to be 595 grams. From past experience the standard deviation of contents
of these bags is known to be 20 grams. Test, at the 5% level of significance, whether the
weights conform to specifications. Repeat the test at the 10% level of significance.
Solution
H0 : μ = 600
(Weights conform to specifications)
H1 : μ ≠ 600
(Weights do not conform to specifications)
n = 50, x = 595, σ = 20, μ0 = 600 (given)
Test statistic: z0 =
595  600
20 / 50
 1.768.
α = 0.05. Critical region R = { z0 < Z0.025 = -1.96 or z0 > Z0.975 = 1.96 }.
Since -1.96 < z0 = 1.768 < 1.96, H0 is not rejected.
Conclusion: The weights appear to conform to specifications.
Suppose the test is performed at the 10% level of significance. In such a case
α = 0.10. Critical region R = { z0 < Z0.25 = -1.645 or z0 > Z0.95 = 1.645 }.
Since z0 = 1.768 > 1.645, H0 is rejected.
Conclusion: The weights appear not to conform to specifications.
Thus, being less strict about controlling a type I error (changing  from 0.05 to 0.10) results
in a different conclusion about H0 (reject instead of do not reject).
Note
1 In example 1 the alternative hypothesis H1a was used, in example 2 the alternative H1b and
in example 3 the alternative H1c.
2 Alternatives H1a and H1b [one-sided (tailed) alternatives ] are used when there is a
particular direction attached to the range of mean values that could be true if H0 is not true.
3 Alternative H1c [two-sided (tailed) alternative] is used when there is no particular direction
attached to the range of mean values that could be true if H0 is not true.
4 If, in the above examples, the level of significance had been changed to 1%, the critical
values used would have been Z0.01= -2.326 (in example 1) , Z0.99 = 2.326 (in example 2) and
and Z0.005 = -2.576 , Z0.995= 2.576 (in example 3).
40
8.4 Test for the population mean (population variance not known): t-test
When performing the test for the population mean for the case where the population variance
is not known, the following modifications are made to the procedure.
1 In the test statistic formula the population standard deviation σ is replaced by the sample
standard deviation S.
2 Since the test statistic t0 =
x  0
that is used to perform the test follows a t-distribution
S/ n
with n-1 degrees of freedom, critical values are looked up in the t-tables.
Test for  when  2 is not known (t-test)
1 State null and alternative hypotheses.
H0:    0 versus H1a:    0 or H1b:  >  0 or H1c:    0 .
2 Calculate the test statistic t 0 
x  0
.
S/ n
3 State the level of significance α and determine the critical value(s) and critical region.
Degrees of freedom =  = n-1.
(i) For alternative H1a the critical region is R = { t0 | t0 < tα }.
(ii) For alternative H1b the critical region is R = { t0 | t0 > t1-α }.
(iii) For alternative H1c the critical region is R = { t0 | t0 > t1-α/2 or t0 < tα/2 }.
4 If t0 lies in the critical region, reject H0 , otherwise do not reject H0.
5 State conclusion in terms of the original problem.
Examples
A paint manufacturer claims that the average drying time for a new paint is 2 hours (120
minutes). The drying times for 20 randomly selected cans of paint were obtained. The results
are shown below.
123
127
131
122
109
106
128
133
115
120
139
119
121
116
130
135
130
136
133
109
Assuming that the sample was drawn from a normal distribution,
41
(a) test whether the population mean drying time is greater than 2 hours (120 minutes)
(i) at the 5% level of significance.
(ii) at the 1% level of significance.
(b) test, at the 5% level of significance, whether the population mean drying time could be 2
hours (120 minutes).
Solution
(a) H0 : μ = 120 (mean is 2 hours)
H1 : μ > 120 (mean is greater than 2 hours)
n = 20, μ0 = 120 (given), x = 124.1, S = 9.65674 (calculated from the data).
Test statistic t0 =
(i) If
124.1  120
9.65674 / 20
= 1.899.
α = 0.05, 1-α = 0.95. From the t-distribution table with
degrees of freedom = = n-1 =19, t0.95 = 1.729.
Critical region R = { t0 > t0.95 = 1.729 }.
Since 1.899 > 1.729 , H0 is rejected.
Conclusion: The mean drying time appears to be greater than 2 hours.
(ii) If
α = 0.01, 1-α = 0.99. From the t-distribution table with
degrees of freedom = = n-1 =19, t0.99 = 2.539.
Critical region R = { t0 > t0.95 = 2.539 }.
Since 1.899 < 2.539 , H0 is not rejected.
Conclusion: The mean drying time appears to be 2 hours.
Thus, being more strict about controlling a type I error (changing  from 0.05 to 0.01)
results in a different conclusion about H0 (Do not reject instead of reject).
(b) H0 : μ = 120 (mean is 2 hours)
H1 : μ ≠ 120 (mean is not equal to 2 hours)
n = 20, μ0 = 120 (given), x = 124.1, S = 9.65674 (calculated from the data).
42
Test statistic: t0 =
124.1  120
9.65674 / 20
= 1.899 (as calculated in part(a)).
If α = 0.05, α/2 = 0.025, 1-α/2 = 0.975. From the t-distribution table with
degrees of freedom = = n-1 =19, t0.025 = -2.093, t0.975= 2.093.
Critical region R = { t0 < -2.093 or t0 > t0.975 = 2.093 }.
Since -2.093 <1.899 < 2.093, H0 is not rejected.
Conclusion: The mean drying time appears to be 2 hours.
Note: Despite the fact that the same data were used in the above examples, the conclusions
were different. In the first test H0 was rejected, but in the next 2 tests H0 was not rejected.
1 In the first test the probability of a type I error was set at 5%, while in the second test this
was changed to 1%. To achieve this, the critical was moved from 1.729 to 2.539, resulting in
the test statistic value (1.899) being less than (in stead of greater than) the critical value.
2 In the third test (which has a two-sided alternative hypothesis), the upper critical value
was increased to 2.093 (to have an area of 0.025 under the t-curve to its right). Again this
resulted in the test statistic value (1.899) being less than (in stead of greater than) the critical
value.
8.5 Test for population variance
The test for the population variance is based on  2 
(n  1) S 2
2
following a chi-square
distribution with n-1 degrees of freedom. The critical values are therefore obtained from the
chi-square tables.
Test for the population variance σ2
1. State the null and alternative hypotheses.
H0:  2   02 versus H1a:  2   02 or H1b:  2 >  02 or H1c:  2   02
2. Calculate the test statistic  02 
(n  1) S 2
 02
.
3. State the level of significance α and determine the critical value(s) and critical region.
Degrees of freedom =  = n-1.
(i) For alternative H1a the critical region is R = {  02 |  02 <  2 }.
(ii) For alternative H1b the critical region is R = {  02 |  02 >  12 }.
(iii) For alternative H1c the critical region is R = {  02 |  02 >  12 / 2 or  02 <  2 / 2 }.
4. If  02 lies in the critical region, reject H0 , otherwise do not reject H0.
5. State conclusion in terms of the original problem.
43
For a one-sided test with alternative hypothesis H1b the rejection region (highlighted area) is
shown in the graph below.
For a two-sided test with alternative hypothesis H1c the rejection region (highlighted area) is
shown in the graph below.
Example
1 Consider the example on the drying time of the paint discussed in the previous section.
Until recently it was believed that the variance in the drying time is 65 minutes. Suppose it is
suspected that this variance has increased. Test this assertion at the 5% level of significance.
Solution
H0 : σ2 = 65
(Variance has not increased)
H1 : σ2 > 65
(Variance has increased)
n = 20,  02 = 65 (given), S = 9.65674 (calculated from the data).
44
Test statistic:  02 =
19 * 9.65674 2
= 27.258.
65
α = 0.05, 1-α = 0.95. From the chi-square distribution table with
degrees of freedom = = n-1 =19,  02.95 = 30.14.
Critical region R = {  02 >  02.95 = 30.14 }.
Since 27.258 < 30.14, H0 is not rejected.
Conclusion: Variance has not increased.
2 A manufacturer of car batteries guarantees that their batteries will last, on average 3 years
with a standard deviation of 1 year. Ten of the batteries have lifetimes of
1.2, 2.5, 3, 3.5, 2.8, 4, 4.3, 1.9, 0.7 and 4.3 years.
Test at the 5% level of significance whether the variability guarantee is still valid.
Solution
H0 : σ2 = 1
(Guarantee is valid)
H1 : σ2 ≠ 1
(Guarantee is not valid)
n = 10,  02 = 1 (given), S = 1.26209702, S2 = 1.592889 (calculated from the data).
Test statistic:  02 =
9 *1.592889
= 14.336.
1
α = 0.05, α/2 = 0.025, 1-α/2 = 0.975. From the chi-square distribution table with
degrees of freedom = = n-1 =9,  02.025 = 2.70 ,  02.975 = 19.02.
Critical region R = {  02 <  02.025 = 2.70 or  02 >  02.975 =19.02}.
Since 2.70 < 14.336 < 19.02, H0 is not rejected.
Conclusion: Variability guarantee appears to still be valid.
45
8.6 Test for population proportion
The test for the population proportion (p) is based on the fact that the sample proportion
X
~ N(p, pq/n) , where n is the sample size and X the number of items labeled
Pˆ 
n
Pˆ  p
“success” in the sample. From this result it follows that Z =
~ N(0, 1).
pq / n
For this reason the critical value(s) and critical region are the same as that for the test for the
population mean (both based on the standard normal distribution).
Test for the population proportion p
1 State the null and alternative hypotheses.
H0: p  p0 versus H1a: p  p0 or H1b: p > p0 or H1c: p  p0
pˆ  p
2 Calculate the test statistic z 0 =
’
p 0 q0 / n
3 State the level of significance α and determine the critical value(s) and critical region.
(i) For alternative H1a the critical region is R = { z0 | z0 < Zα }.
(ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }.
(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }.
4 If z0 lies in the critical region, reject H0, otherwise do not reject H0.
5 State conclusion in terms of the original problem.
Examples
1 A construction company suspects that the proportion of jobs they complete behind
schedule is 0.20 (20%). Of their 80 most recent jobs 22 were completed behind schedule.
Test at the 5% level of significance whether this information confirms their suspicion.
Solution
H0 : p = 0.20 (Suspicion is confirmed)
H1 : p ≠ 0.20 (Suspicion is not confirmed)
n = 80, x = 22 (given), p̂ =
Test statistic z0 =
22
= 0.275, p0 = 0.20.
80
0.275  0.20
0.20 * 0.80 / 80
= 1.677.
46
α = 0.05. Critical region R = { z0 < Z0.025 = -1.96 or z0 > Z0.975 = 1.96 }.
Since -1.96 < z0 = 1.677 < 1.96, H0 is not rejected.
Conclusion: The suspicion is confirmed.
2 During a marketing campaign for a new product 176 out of the 200 potential users of this
product that were contacted indicated that they would use it. Is this evidence that more than
85% of all the potential will actually use the product? Use α = 0.01.
Solution
H0 : p = 0.85 (85% of all potential users will use the product)
H1 : p > 0.85 (More than 85% of all potential users will use the product)
n = 200, x = 176, p0 = 0.85 (given), p̂ =
Test statistic z0 =
0.88  0.85
0.85 * 0.15 / 200
176
= 0.88.
200
= 1.188.
α = 0.01. Critical region R = { z0 > Z0.99 = 2.576 }.
Since z0 = 1.188 < 2.576, H0 is not rejected.
Conclusion: 85% of all potential users will use the product.
8.7 Computer output
1 The output shown below is when the test for the population mean, for the data in example
1 in section 8.4, is performed by using excel.
t-Test: Mean
Mean
Variance
Observations
Hypothesized Mean
df
t Stat
P(T<=t) one-tail
t Critical one-tail
129.1
93.25263158
20
120
19
1.898752271
0.036445557
1.729132792
The value of the test statistic is t0 = 1.90 (2 decimal places). From the table P(T<=-1.9) =
0.036. This probability is known as the p-value (the probability of getting a t value more
remote than the test statistic). When testing at the 5% level of significance, a p-value of
below 0.05 will cause the null hypothesis to be rejected.
47
2 The output shown below is when the test for the population variance in example 1 in
section 8.5 (the data in example 1 in section 8.4) is performed by using excel.
Chi-square test: Variance
Variance
Observations
Hypothesized variance
df
Chi-square stat
P(Chi-square<=27.25846) onetail
Chi-square critical one-tail
93.25263
20
65
19
27.25846
0.098775
30.14353
The values of the test statistic and critical value are the same as in the example in section 8.5.
The p-value is 0.098775 (2nd to last entry in the 2nd column in the table above). Since
0.098775 >0.05 the null hypothesis cannot be rejected at the 5% level of significance.
48
Chapter 9 – Statistical Inference: Testing of hypotheses for two samples
9.1
Formulation of hypotheses, notation and additional results
The tests discussed in the previous chapter involve hypotheses concerning parameters of a
single population and were based on a random sample drawn from a single population of
interest. Often the interest is in tests concerning parameters of two different populations
(labeled populations 1 and 2) where two random samples (one from each population) are
drawn.
Examples
1 Are the mean salaries the same for males and females with the same educational
qualifications and work experience?
2 Do smokers and non-smokers have the same mortality rate?
3 Are the variances in drying times for two different types of paints different?
4 Is a particular diet successful in reducing people’s weights?
Null and alternative hypotheses
The following hypotheses involving two samples will be tested.
1 The test for equality of two variances. As an example see example 3 above.
2 The test for equality of two means (independent samples). As an example see example 1
above.
3 The test for equality of two means (paired samples). As an example see example 4 above.
4 The test for equality of two proportions. As an example see example 2 above.
The parameters to be used, when testing the hypotheses, are summarized in the table below.
Parameter population 1 population 2
mean
2
1
2
variance
1
 22
proportion
p1
p2
The following null and alternative hypotheses (as defined in section 8.1) also apply in the two
sample case.
H0:   0 (The statement that the parameter  is equal to the hypothetical value  0 ) .
H1a:    0 or H1b:    0 or H1c:    0 .
Examples
1 When testing for equality of variances from 2 different populations labeled 1 and 2 the
hypotheses are
49
H0:  12   22
H1a:  12   22 or H1b:  12   22 or H1c:  12   22 .
These hypotheses can also be written as
 12
H0: 2  1
2
 12
 12
 12
H1a: 2  1 or H1b: 2  1 or H1c: 2  1 .
2
2
2
In terms of the general notation stated above  
 12
and  0  1 .
 22
2 When testing for equality of means from 2 different populations labeled 1 and 2 the
hypotheses are
H0: 1   2
H1a: 1   2 or H1b: 1  2 or H1c: 1   2 .
These hypotheses can also be written as
H0: 1   2  0
H1a: 1   2  0 or H1b: 1  2  0 or H1c: 1   2  0
In terms of the general notation stated above   1   2 and  0  0 .
3 When testing for equality of proportions from 2 different populations labeled 1 and 2 the
hypotheses are
H0: p1  p2
H1a: p1  p2 or H1b: p1  p 2 or H1c: p1  p2 .
These hypotheses can also be written as
H0: p1  p2  0
H1a: p1  p2  0 or H1b: p1  p2  0 or H1c: p1  p2  0 .
In terms of the general notation stated above   p1  p2 and  0  0 .
50
Notation
The following notation will used in the description of the two sample tests.
Measure
sample size
sample
notation (population 1) notation (population 2)
sample mean
sample variance (standard deviation)
sample proportion
n
m
x1 , x2 ,, xn
x 1 , x 2 , , x m
x1
S12 ( S1 )
x n*
p̂1 =
n
x2
S 22 ( S 2 )
x m*
p̂ 2 =
m
x n* and x m* are the numbers of “success” items in the samples from populations 1 and 2
respectively.
Standard error and variance formulae
When testing hypotheses for the difference between two means ( 1   2 ) or the difference
between two proportions ( p1  p2 ), formulae for the standard errors of the corresponding
sample differences ( X  X when testing for the mean, Pˆ  Pˆ when testing for proportions)
1
2
1
2
will be needed. These formulae are summarized in the table that follows.
Sample difference (ˆ)
condition
X1  X 2
population variances not equal
X1  X 2
Pˆ 1  Pˆ2
Pˆ 1  Pˆ2
standard error [ SE (ˆ)]
 12

 22
)1 / 2
n
m
population variances equal
1 1
 (  )1 / 2
2
2
2
i.e.  1   2  
n m
population proportions not equal
p (1  p1 ) p2 (1  p2 ) 1 / 2
[ 1

]
n
m
population proportions equal
1 1
[ p(1  p)(  )]1 / 2
i.e. p1  p2  p
n m
(
In each of the formulae the variance is the square of the standard error.
The general form of the test statistic used in most of these tests is Z 
ˆ   0
, where
SE (ˆ)
Z follows a N(0,1) distribution . In some small sample cases the test statistic has a general
form t 
ˆ   0
, where t follows a t-distribution.
SˆE (ˆ)
Two sample sampling distribution results
1 For sufficiently large random samples (both n, m  30) drawn from populations (with
known variances) that are not too different from a normal population, the statistic
51
Z
X 1  X 2  ( 1   2 )
(
 12
n

 22
m
)
follows a N(0,1) distribution.
1/ 2
1 1
When  12   22   2 the above mentioned result still holds, but with  (  )1 / 2 in the
n m
denominator.
2
3 When the population variances  12 ,  22 and  2 , referred to in the two above mentioned
results, are not known they may be replaced by their sample estimates S12 , S 22 and
(n  1) S12  (m  1) S 22
respectively. In such a case the resulting statistic follows
S 
nm2
2
(i) a N(0,1) distribution when the sample sizes are large (both n and m > 30).
(ii) a t-distribution when the sample sizes are small (at least one of n or m  30). The
degrees of freedom will depend on whether  12   22   2 is true or not. If  12   22   2 is
true, the degrees of freedom is n  m  2.
4 For sufficiently large random samples the statistic Z 
Pˆ1  Pˆ2  ( p1  p 2 )
p (1  p1 ) p 2 (1  p 2 ) 1 / 2
[ 1

]
n
m
follows a N(0,1) distribution.
1 1
5 When p1  p2  p the abovementioned result still holds but with [ p(1  p)(  )]1 / 2 in
n m
the denominator.
6 Provided the sample sizes are sufficiently large, the two above mentioned results will still
x n*
x m*
be valid with p1 , p2 and p in the denominator replaced by p̂1 =
, p̂ 2 
and
n
m
x *  x m*
respectively.
pˆ  n
nm
52
 12
9.2 Test for equality of population variances (F-test) and confidence interval for 2
2
A summary of the steps to be followed in the testing procedure is shown below.
Test for  12   22
Step 1: State null and alternative hypotheses
H0:  12   22 versus H1a:  12   22 or H1b:  12   22 or H1c:  12   22
max( S12 , S 22 )
min( S12 , S 22 )
Step 3: State the level of significance  and determine the critical value(s) and critical
region.
Step 2: Calculate the test statistic F0 
Degrees of freedom is df 1 = sample size (numerator sample variance)-1 and
df 2 = sample size (denominator sample variance) -1
(i) For alternatives H1a and H1b the critical region is R = { F0 | F0 F 1 } .
(ii) For alternatives H1c the critical region is R = { F0 | F0 F 1 / 2} .
Step 4: If F0 lies in the critical region, reject H0, otherwise do not reject H0.
Step 5: State the conclusion in terms of the original problem
Confidence interval for
 12
 22
Step 1: Calculate S12 and S 22 . Values of n, m and confidence percentage are given.
Step 2: Determine the upper and lower F - distribution values for a given confidence
percentage, df 1 and df 2 .
Step 3: Confidence interval is (
S12
S12
*
lower,
* upper)
S 22
S 22
Examples
1 The following sample information about the daily travel expenses of the sales (population
1) and audit (population 2) staff at a certain company was collected.
sales 1048 1080 1168 1320 1088 1136
audit 1040 816 1032 1142 1192 960 1112
(a) Test at the 10% level of significance whether the population variances could be the
same.
 12
(b) Calculate a 95% confidence interval for 2 .
2
53
(a) H0:  12   22
H1:  12   22
From the above information n  6, m  7 , S12  9593.6 and S 22  15884 .
Test statistic: F0 
max( 9593.6,15884) 15884

 1.656
min( 9593.6,15884) 9593.6
df1  7  1  6 , df 2  6  1  5 ,   0.10 ,  / 2  0.05.
For df1  6, df 2  5 F0.95  4.95.
Critical region R = { F0  4.95 }
Since F0  1.656 < 4.95, H0 is not rejected.
Conclusion: The population variances could be the same.
(b) P( F0.025 
S 22 /  22
S12
 12 S12
or
P(

F
)

0
.
95
F


F0.975 )  0.95
0.975
0.025
S12 /  12
S 22
 22 S 22
In the above expression S 22` is in the numerator and S12 in the denominator. Hence
1
with
df1  6, df 2  5 and upper = F0.975  6.98. lower = F0.025 is found from
F0.975
1
= 0.1669.
df1  5, df 2  6 i.e. lower =
5.99
S12
 0.604 , F0.025  0.1669 and F0.975 = 6.98 into the above gives a confidence
S 22
interval of (0.604*0.1669, 0.604*6.98) = (0.101, 4.216).
Substituting
2 The waiting times (minutes) for minor treatments were recorded at two different medical
centres. Below is a summary of the calculations made from the samples.
centre sample size mean variance
1
12
25.69
7.200
2
10
27.66 22.017
Test at the 5% level of significance whether the centre 1 population variance is less than that
for population 2.
H0:  12   22
H1:  12   22
54
From the above table n  12, m  10 , S12  7.200 and S 22  22.017 .
F0 
max( 22.017,7.200) 22.017

 3.058 .
min( 22.017,7.200)
7.200
df1  10  1  9 , df 2  12  1  11,   0.05
For df1  9, df 2  11 F0.95  2.90.
Critical region R = { F0  2.90 }
Since F0  3.058 > 2.90, H0 is rejected.
Conclusion: The variance for population 1 is probably less than that for population 2.
9.3 Test for difference between means for independent samples
(i) For large samples (both sample sizes n, m  30)
Test for 1   2  0 (large samples, population variances known)
Step 1: State null and alternative hypotheses
H0: 1   2  0
H1a: 1   2  0 or H1b: 1  2  0 or H1c: 1   2  0
x1  x 2
Step 2: Calculate the test statistic z 0 
.
2
 1  22 1 / 2
(

)
n
m
Step 3: State the level of significance  and determine the critical value(s) and critical
region.
(i) For alternative H1a the critical region is R = { z0 | z0 < Zα }.
(ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }.
(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }.
Step 4: If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Step 5: State the conclusion in terms of the original problem.
A 100(1-  ) % confidence interval for 1   2 is given by x1  x2  Z1 / 2 (
 12
n

 22
m
)1 / 2 .
55
If the population variances  12 and  22 are not known, they can be replaced in the above
formulae by their sample estimates S12 and S 22 respectively with the testing procedure
unchanged.
Examples:
1 Data were collected on the length of short term stay of patients at hospitals. Independent
random samples of n  40 male patients (population 1) and m  35 female patients
(population 2) were selected. The sample mean stays for male and female patients were x1 =
9 days and x 2 = 7.2 days respectively. The population variances are known from past
experience to be  12 = 55 and  22 = 47.
(a) Test at the 5% level of significance whether male patients stay longer on average than
female patients.
(b) Calculate a 95% confidence interval for the mean difference (in staying time) between
males and females.
(a) H0: 1   2  0 (mean staying times for males and females the same)
H1: 1  2  0 (mean staying time for males greater than for females)
x1  x 2
Test statistic: z 0 
(

2
1
n


2
2 1/ 2
m
)
=
97
2
 1.213
=
55 47 1 / 2 1.6486
(  )
40 35
  0.05 . Critical region R = { z 0 > Z 0.95 = 1.645 }.
Since z 0 = 1.213 < 1.645, H0 cannot be rejected.
Conclusion: The mean staying times for males and females are probably the same.
(b) x1  x2 = 2, (
 12

 22
)1 / 2 = 1.6486 (denominator value when calculating the test
n
m
statistic), 1    0.95 ,   0.05 ,  / 2 = 0.025, Z1 / 2 = Z 0.975 = 1.96.
x1  x2  Z1 / 2 (
x1  x2  Z1 / 2 (
 12
n
 12
n

 22

 22
m
m
)1 / 2 = 2 -1.96*1.6486 = -1.231
)1 / 2 = 2 + 1.96*1.6486 = 5.231
2 Researchers in obesity want to test the effectiveness of dieting with exercise against
dieting without exercise. Seventy three patients who were on the same diet were randomly
divided into “exercise” ( n =37 patients) and “no exercise” groups ( m =36 patients). The
56
results of the weight losses (in kilograms) of the patients after 2 months are summarized in
the table below.
Diet with exercise group Diet without exercise group
x1  7.6
x2  6.7
S12  2.53
S 22  5.59
Test at the 5% level of significance whether there is a difference in weight loss between the 2
groups.
H0: 1   2  0 (No difference in weight loss)
H1: 1  2  0 (There is a difference in weight loss)
Test statistic: z 0 
x1  x2
7.6  6.7
0 .9

=
= 1.903.
2
0.473
S
S 2 1 / 2 ( 2.53  5.59 )1 / 2
(  )
37
36
n
m
2
1
  0.05 . Critical region R = { z 0 < Z 0.025 = -1.96 or z 0  Z 0.975  1.96 }.
Since -1.96 < z 0 =1.903 < 1.96, H0 cannot be rejected.
Conclusion: There is not sufficient evidence to suggest a difference in weight loss between
the 2 groups.
(ii) For small samples (at least one of n or m  30 ) from normal populations with
variances unknown
The test to be performed in this case will be preceded by a test for equality of population
variances (  12   22 =  2 ) i.e. the F-test discussed in section 9.2. If the hypothesis of equal
variances cannot be rejected, the test described below should be performed. If this hypothesis
is rejected, the Welsh-Aspin test (not to be discussed here) should be performed. If, in this
case, the assumption of samples from normal populations does not hold, a nonparametric test
like the Mann-Whitney test (not to be discussed here) should be used.
57
Test for 1   2  0 (small sample sizes, population variances unknown but equal)
Step 1: State null and alternative hypotheses
H0: 1   2  0
H1a: 1   2  0 or H1b: 1  2  0 or H1c: 1   2  0
x1  x 2
(n  1) S12  (m  1) S 22
2
Step 2: Calculate the test statistic t 0 
with S 
1 1
nm2
S (  )1 / 2
n m
Step 3: State the level of significance  and determine the critical value(s) and critical
region.
Degrees of freedom =  = n  m  2 .
(i) For alternative H1a the critical region is R = { t0 | t0 < tα }.
(ii) For alternative H1b the critical region is R = { t0 | t0 > t1-α }.
(iii) For alternative H1c the critical region is R = { t0 | t0 > t1-α/2 or t0 < tα/2 }.
Step 4: If t0 lies in the critical region, reject H0, otherwise do not reject H0.
Step 5: State the conclusion in terms of the original problem.
1 1
A 100(1-  ) % confidence interval for 1   2 is given by x1  x 2  t n  m 2,1 / 2 S (  )1 / 2 .
n m
Examples
1 Consider the above example on the comparison of the travel expenses for the sale and
audit staff (see section 9.2, example 1 for F-test).
(a) Test, at the 5% level of significance, whether the mean expenses for the two types of
staff could be the same.
(b) Calculate a 95% confidence interval for the mean difference between the mean expenses
for the two types of staff.
(a) Since the hypothesis of equal population variances was not rejected, the test described
above can be performed. From the data given x1 =1140, x 2 = 1042, S12 = 9593.6 and S 22 =
15884.
H0: 1   2  0 (Mean travel expenses for sale and audit staff the same)
H1: 1   2  0 (Mean travel expenses for sale and audit staff not the same)
  0.05,  / 2  0.025,1   / 2  0.975 . From the t-distribution with   n  m  2 = 6+72=11 degrees of freedom, t 0.975 = 2.201.
58
S2 
(n  1) S12  (m  1) S 22 5 * 9593.6  6 *15884
=
 13024.727, S = 114.126
11
nm2
Test statistic =
1140  1042
= 1.543.
1 1 1/ 2
114.126(  )
6 7
Critical region = R = { t 0  t 0.975  2.201 }.
Since 1.543 < t 0.975  2.201 , H0 cannot be rejected.
Conclusion: Mean travel expenses for sale and audit staff are probably the same.
(b) A 95% confidence interval for the difference between sales and audit staff means is
1 1
1140-1042  2.201*114.126*(  )1 / 2 i.e (-41.75, 237.75).
6 7
2 A certain hospital has been getting complaints that the response to calls from senior
citizens is slower (takes longer time on average) than that to calls from other patients. In
order to test this claim, a pilot study was carried out. The results are shown below.
Patient type
sample mean response time sample standard deviation sample size
Senior citizens 5.60 minutes
0.25 minutes
18
Others
5.30 minutes
0.21 minutes
13
Test, at the 1% level of significance, whether the complaint is justified.
Label the “senior citizens” and “others” populations as 1 and 2 and their population mean
response times as 1 and  2 respectively.
H0: 1   2  0 (Mean response times the same)
H1: 1   2  0 (Mean response time for senior citizens longer than for others)
The hypothesis that the population variances are equal cannot be rejected (perform the F-test
to check this). Hence equal variances for the 2 populations can be assumed.
S2 =
17 * 0.25 2  12 * 0.212
= 0.0549
29
Test statistic: t 0 
5.6  5.3
= 3.518
1 1 1/ 2
0.2343(  )
18 13
59
  0.01,1    0.99. From the t-distribution table with   n  m  2  18  13  2  29
degrees of freedom t 0.99  2.462 .
Critical region = R = { t 0  t 0.99  2.462 }.
Since t 0  3.518 > 2.462, H0 is rejected.
Conclusion: The claim is justified i.e. the mean response time for senior citizens takes longer
than that for others.
9.4 Test for difference between means for paired (matched) samples
The tests for the difference between means in the previous section assumed independent
samples. In certain situations this assumption is not met.
Examples
1 A group of patients going on a diet is weighed before going on the diet and again after
having been on the diet for one month. A test to determine whether the diet has reduced their
weight is to be performed.
2 The aptitudes of boys and girls for mathematics are to be compared. In order to eliminate
the effect of social factors, pairs of brothers and sisters are used in the comparison. Each
(brother, sister) pair is given the same test and the mean marks of boys and girls compared.
In each of these situations the two samples cannot be regarded as independent. In the first
example two readings (before and after readings) are made on the same subject. In the second
example the two samples are matched via a common factor (family connection).
The data layout for the experiments described above is shown below.
sample 1
sample 2
difference
x1
y1
d 1 x1  y1
.....
x2
.....
y2
d 2  x2  y2 .....
xn
yn
d n  xn  y n
The mean of the paired differences of the ( x, y ) values of the two populations is defined
as  d  1   2 . Under the assumption that the differences are sampled from a normal
population, hypotheses concerning the mean of the differences  d can be tested by
performing a one sample t -test (described in the previous chapter) with the observed
differences d1 , d 2 ,, d n as the sample. The mean and standard deviation of these sample
differences will be denoted by d and S d respectively.
60
Test for  d  0 ( paired samples)
Step 1: State null and alternative hypotheses
H0:  d  0
H1a:  d  0 or H1b:  d  0 or H1c:  d  0
Step 2: Calculate the test statistic t 0 
d
.
Sd
n
Step 3: State the level of significance  and determine the critical value(s) and critical
region.
Degrees of freedom =  = n  1 .
(i) For alternative H1a the critical region is R = { t0 | t0 < tα }.
(ii) For alternative H1b the critical region is R = { t0 | t0 > t1-α }.
(iii) For alternative H1c the critical region is R = { t0 | t0 > t1-α/2 or t0 < tα/2 }.
Step 4: If t0 lies in the critical region, reject H0, otherwise do not reject H0.
Step 5: State the conclusion in terms of the original problem.
A 100(1-  ) % confidence interval for  d is given by d  t -multiplier*
Sd
, where the t n
multiplier is obtained from the t-tables with n  1 degrees of freedom with an area 1   / 2
under the t-curve below it.
Examples
1 A bank is considering loan applications for buying each of 10 homes. Two different
companies (company 1 and company 2) are asked to do an evaluation of each of these 10
homes. The evaluations (thousands of Rand) for these homes are shown in the table below.
Home
company 1
company 2
difference
1
750
810
60
2
990
1000
10
3
1025
1020
-5
4
1285
1320
35
5
1300
1290
-10
6
875
915
40
7
1240
1250
10
8
880
910
30
9
700
650
-50
10
1315
1290
-25
(a) At the 5% level of significance, is there a difference in the mean evaluations for the 2
companies?
(b) Calculate a 95% confidence interval for the difference between the mean evaluations for
companies 1 and 2.
61
(a) H0:  d  0 (No difference in mean evaluations)
H1:  d  0 (There is a difference in mean evaluations)
From the above table d = 9.5, S d = 33.12015, n =10.
Test statistic: t 0 
9 .5
= 0.907.
33.12015
10
  0.05,  / 2  0.025,1   / 2  0.975 . From the t-tables with   n - 1 = 9 degrees of
freedom, t 0.975 = 2.262.
Critical region R = { t 0  2.262 }.
Since t 0 = 0.907 < 2.262, H0 is not rejected.
(b) A 95% confidence interval is given by 9.5  2.262
33.12015
10
= (-14.19, 33.19).
2 Each of 15 people going on a diet was weighed before going on the diet and again one
after having been on the diet for one month. The weights (in kilograms) are shown in the
table below.
Person
before
after
difference
1
90
85
-5
2
110
105
-5
3
124
126
2
4
116
118
2
5
105
94
-11
6
88
84
-4
7
86
87
1
8
92
87
-5
9
101
99
-2
10
112
105
-7
11
138
130
-8
12
96
93
-3
13
102
95
-7
14
111
102
-9
15
82
83
1
Test, at the 1% level of significance, whether the mean weight after one month on the diet is
less than that before going on the diet.
Let  d denote the mean difference between the weight after having been on the diet for one
month and before going on the diet.
H0:  d  0 (No difference in mean weights)
H1:  d  0 (Mean weight after one month on diet less than before going on diet)
From the above table d = -4, S d = 4.1231, n =15.
Test statistic: t 0 
4
= -3.757.
4.1231
15
  0.01. From the t-tables with   n - 1 = 14 degrees of freedom, t 0.01  t 0.99 = -2.624.
62
Critical region R = { t 0  2.624 }.
Since t 0 = -3.757 < -2.624, H0 is rejected.
Conclusion: The mean weight after one month on the diet is less than before going on diet.
9.5
Test for the difference between proportions for independent samples
When testing for the difference between the proportions of two different populations, the test
is based on the sampling distribution results 4-6 described in the first section of this chapter.
Test for p1  p2  0
Step 1: State null and alternative hypotheses
H0: p1  p2  0
H1a: p1  p2  0 or H1b: p1  p 2  0 or H1c: p1  p2  0
x n*  x m*
pˆ 1  pˆ 2
Step 2: Calculate the test statistic z 0 
with pˆ 
.
1 1 1/ 2
nm
ˆ
ˆ
[ p(1  p)(  )]
n m
Step 3: State the level of significance  and determine the critical value(s) and critical
region.
(i) For alternative H1a the critical region is R = { z0 | z0 < Zα }.
(ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }.
(iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }.
Step 4: If z0 lies in the critical region, reject H0, otherwise do not reject H0.
Step 5: State the conclusion in terms of the original problem.
A 100(1-  ) % confidence interval for p1  p2 is given by
1 1
pˆ 1  pˆ 2  Z1 / 2 [ pˆ (1  pˆ )(  )]1 / 2 .
n m
Example
A perfume company is planning to market a new fragrance. In order to test the popularity of
the fragrance, 120 young women and 150 older women were selected at random and asked
whether they liked the new fragrance. The results of the survey are shown below.
women like did not like sample size
young 48 72
120
older
72 78
150
63
(a) Test, at the 5% level of significance, whether older women like the new fragrance more
than young women.
(b) Calculate a 95% confidence interval for the difference between the proportions of older
and young women who like the fragrance.
(a) Let the older and younger women populations be labeled 1 and 2 respectively and p1
and p 2 the respective population proportions that like the fragrance.
H0: p1  p2  0
H1: p1  p 2  0
From the above table n = 150, m = 120, x n* = 72, x m* = 48.
p̂ =
72  48
4

150  120 9
72 48

0.08
150 120
Test statistic: z 0 
=
=1.3145.
4
5
1
1 1/ 2
0.060858
[( )  ( )  (

)]
9
9
150 120
  0.05 . Critical region R = { z 0 > Z 0.95 = 1.645 }.
Since z 0 = 1.3145 < 1.645, H0 cannot be rejected.
Conclusion: There is not sufficient evidence to suggest that older women like the new
fragrance more than young women.
1 1
(b) [ pˆ (1  pˆ )(  )]1 / 2 = 0.06858 [denominator of z 0 in part (a)].
n m
pˆ 1  pˆ 2 = 0.08 [numerator of z 0 in part (a)], Z 0.975 =1.96
1 1
pˆ 1  pˆ 2  Z1 / 2 [ pˆ (1  pˆ )(  )]1 / 2 = 0.08  1.96*0.06858 = (-0.039, 0.199).
n m
9.6 Computer output
1 The test for the difference between population means in example 1 in section 9.3(ii) (the
data in example 1 in section 9.2) can be performed by using excel. What follows is the
output.
64
t-Test: Two-Sample Assuming Equal Variances
Variable Variable
1
2
Mean
1140
1042
Variance
9593.6
15884
Observations
6
7
Pooled Variance
13024.73
Hypothesized Mean
Difference
0
df
11
t Stat
1.543458
P(T<=t) two-tail
0.150984
t Critical two-tail
2.200985
The p-value is 0.150984 > 0.05. At the 5% level of significance the null hypothesis cannot be
rejected.
2 The output shown below is when the test for equality of population variances for the data
in example 1 in section 9.2 is performed by using excel.
F-Test Two-Sample for Variances
Variable 1 Variable 2
Mean
1140
1042
Variance
9593.6
15884
Observations
6
7
df
5
6
f
0.603979
P(F<=f)
0.701718
F Critical two-tail
0.143266
s12 9593.6

 0.603979 . The
s 22 15884
critical value (last entry under variable 1 in the above table) is
The value of the test statistic shown in the above table is
1
 0.143266 and the p-value (second to last entry under variable 1 in
F6,5;0.975 6.98
the above table) is 0.701718. Since 0.701718 > 0.025, the null hypothesis cannot be rejected.
F 5,6:0.025
1

65
Chapter 10 – Linear Correlation and regression
10.1 Bivariate data and scatter diagrams
Often two variables are measured simultaneously and relationships between these variables
explored. Data sets involving two variables are known as bivariate data sets.
The first step in the exploration of bivariate data is to plot the variables on a graph. From
such a graph, which is known as a scatter diagram (scatter plot, scatter graph), an idea can
be formed about the nature of the relationship.
Examples
1 The number of copies sold (y) of a new book is dependent on the advertising budget (x)
the publisher commits in a pre-publication campaign. The values of x and y for 12 recently
published books are shown below.
x (thousands of
rands)
8
y
(thousands)
12.5
9.5
7.2
6.5
10
12
11.5
14.8
17.3
27
30
25
18.6
25.3
24.8
35.7
45.4
44.4
45.8
65.3
75.7
72.3
79.2
Scatter diagram
Adve rting budge t and copie s s old
90
80
copies sold
70
60
50
40
30
20
10
0
0
5
10
15
20
adve rtis ing budge t
25
30
35
66
2 In a study of the relationship between the amount of daily rainfall (x) and the quantity of
air pollution removed (y), the following data were collected.
Rainfall
(centimeters)
quantity removed (micrograms per cubic
meter)
4.3
4.5
5.9
5.6
6.1
5.2
3.8
2.1
7.5
126
121
116
118
114
118
132
141
108
Scatter diagram
Rainfall and quantity removed
160
Quantity removed
140
120
100
80
60
40
20
0
0
2
4
6
8
Rainfall
1 In both cases the relationship can be fairly well described by means of a straight line i.e.
both these relationships are linear relationships.
2 In the first example an increase in y is proportional to an increase in x (positive linear
relationship).
3 In the second example a decrease in y is proportional to an increase in x (negative
linear relationship).
4 In both the examples changes in the values of y are affected by changes in the values of x
(not the other way round). The variable x is known as the explanatory (independent)
variable and the variable y the response (dependent) variable.
In this section only linear relationships between 2 variables will be explored. The issues to be
explored are
67
1 Measuring the strength of the linear relationship between the 2 variables (the linear
correlation problem).
2 Finding the equation of the straight line that will best describe the relationship between
the 2 variables (the linear regression problem). Once this line is determined, it can be used
to estimate a value of y for given value of x (linear estimation).
10.2 Linear Correlation
The calculation of the coefficient of correlation (r) is based on the closeness of the plotted
points (in the scatter diagram) to the line fitted through them. It can be shown that
-1 ≤ r ≤ 1.
If the plotted points are closely clustered around this line, r will lie close to either 1 or -1
(depending on whether the linear relationship is positive or negative). The further the plotted
points are away from the line, the closer the value of r will be to 0. Consider the scatter
diagrams below.
Strong positive correlation (r close to 1)
Strong negative correlation (r close -1)
No pattern (r close to 0)
68
For a sample of n pairs of values (x1, y1) , (x2, y2), . . . , (xn, yn) , the coefficient of
correlation can be calculated from the formula
r=
n xy   x y
[n x 2  ( x) 2 ][ n y 2  ( y ) 2 ]
.
Example
Consider the data on the advertising budget (x) and the number of copies sold (y) considered
earlier. For this data r can be calculated in the following way.
x
sum
y
8
9.5
7.2
6.5
10
12
11.5
14.8
17.3
27
30
25
178.8
12.5
18.6
25.3
24.8
35.7
45.4
44.4
45.8
65.3
75.7
72.3
79.2
545
xy
x2
100
176.7
182.16
161.2
357
544.8
510.6
677.84
1129.69
2043.9
2169
1980
10032.89
64
90.25
51.84
42.25
100
144
132.25
219.04
299.29
729
900
625
3396.92
y2
156.25
345.96
640.09
615.04
1274.49
2061.16
1971.36
2097.64
4264.09
5730.49
5227.29
6272.64
30656.5
Substituting n=12, ∑ x = 178.8, ∑ y = 545, ∑ xy = 10032.89, ∑ x2 = 3396.92 and
∑ y2 = 30656.5 into the equation for r gives
r=
12 * 10032.89  178.8 * 545
[12 * 3396.92  (178.8) 2 [12 * 30656.5  (545) 2
=
229486.8
8793.6 * 70853
= 0.9194.
Comment: Strong positive correlation i.e. the increase in the number of copies sold is closely
linked with an increase in advertising budget.
69
Coefficient of determination
The strength of the correlation between 2 variables is proportional to the square of the
correlation coefficient (r2). This quantity, called the coefficient of determination, is the
proportion of variability in the y variable that is accounted for by its linear relationship with
the x variable.
Example
In the above example on copies sold (y) and advertising budget (x), the
coefficient of determination = r2 = 0.91942 = 0.8453.
This means that 84.53% of the change in the variability of copies sold is explained by its
relationship with advertising budget.
10.3 Linear Regression
Finding the equation of the line that best fits the (x, y) points is based on the least squares
principle. This principle can best be explained by considering the scatter diagram below.
The scatter diagram is a plot of the DBH (diameter at breast height) versus the age for 12 oak
trees. The data are shown in the table below.
Age
x
DBH
y
(years)
(inch)
97
93
88
81
75
57
52
45
28
15
12
11
12.5
12.5
8
9.5
16.5
11
10.5
9
6
1.5
1
1
According to the least squares principle, the line that “best” fits the plotted points is the one
that minimizes the sum of the squares of the vertical deviations (see vertical lines in the
70
above graph) between the plotted y and estimated y (values on the line). For this reason the
line fitted according to this principle is called the least squares line.
Calculation of least squares linear regression line
The equation for the line to be fitted to the (x, y) points is
ŷ = a + bx,
where ŷ is the fitted y value (y value on the line which is different to the observed y value),
a is the y-intercept and b the slope of the line.
It can be shown that the coefficients that define the least squares line can be calculated from
b=
n xy   x y
and
n x 2  ( x) 2
a = y  bx .
Example
For the above data on age (x) and DBH (y) the least squares line can calculated as shown
below.
x
sum
y
x2
xy
97
93
88
81
75
57
52
45
28
15
12
11
12.5
12.5
8
9.5
16.5
11
10.5
9
6
1.5
1
1
1212.5
9409
1162.5
8649
704
7744
769.5
6561
1237.5
5625
627
3249
546
2704
405
2025
168
784
22.5
225
12
144
11
121
654
99
6877.5
47240
Substituting n=12, ∑ x = 654, ∑ y = 99, ∑ xy = 6877.5 and ∑ x2 = 47240 into the above
equation gives.
b=
12 * 6877.5  654 * 99 17784

 0.12779 and
139164
12 * 47240  (654) 2
71
a=
99
654
= 1.285.
 0.12779 *
12
12
Therefore the equation of the y on x least squares line that can be used to estimate values of y
(DBH) based on x (age) is
ŷ = 1.285 + 0.12779 x.
Suppose the DBH of a tree aged 90 years is to be estimated. This can be done by substituting
the value of x = 90 into the above equation. Then
ŷ = 1.285 + 0.12779*90 = 12.786.
A word of caution
1 The linear relationship between y and x is often only valid for values of x within a certain
range e.g. when estimating the DBH using age as explanatory variable, it should be taken into
account that at some age the tree will stop growing. Assuming a linear relationship between
age and DBH for values beyond the age where the tree stops growing would be incorrect.
2 Only relationships between variables that could be related in a practical sense are explored
e.g. it would be pointless to explore the relationship between the number of vehicles in New
York and the number of divorces in South Africa. Even if data collected on such variables
might suggest a relationship, it cannot be of any practical value.
3 If variables are not linearly related, it does not mean that they are not related. There are
many situations where the relationships between variables are non-linear.
Example
A plot of the banana consumption (y) versus the price (x) is shown in the graph below. A
straight line will not describe this relationship very well, but the non-linear curve shown
below will describe it well.
NONLINEAR REGRESSION: EXAMPLE
14
y
12
10
8
y  
6

x
 u    z  u
4
2
0
0
1
2
3
4
5
6
7
8
9
10
11
12
x
This sequence shows how a nonlinear regression model may be fitted. It uses the banana
consumption example in the first sequence.
1
72
10.4 Computer output
Consider the data on age (x variable) and DBH (y variable). The output when performing a
straight line regression on this data on excel is shown below.
SUMMARY OUTPUT
Regression Statistics
R Square
0.689307572
ANOVA
df
1
10
11
SS
189.3872553
85.36274468
274.75
Coefficients
1.285353971
0.12779167
Standard
Error
1.702259153
0.027130722
Regression
Residual
Total
Intercept
X Variable
MS
189.3873
8.536274
F
22.1862
t Stat
0.755087
4.71022
P-value
0.46761
0.00083
Significance
F
0.000828626
1 The coefficient of determination in the above table is R square = 0.689307572.
2 The ANOVA (Analysis of Variance) table is constructed to test whether there is a
significant linear relationship between X and Y. The p-value for this test is the entry under
the Significance F heading in the ANOVA table. Since this p-value < 0.05 (or 0.01), the
hypothesis of “no linear relationship between X and Y” can be rejected and it can be
concluded that there is a significant linear relationship between X and Y.
3 The third of the tables in the summary output shows the intercept and slope values of the
line. These are the first two entries under Coefficients. The remaining columns to the right of
the Coefficients column concerns the performance of tests for zero intercept and slope. From
the intercept and slope p-values (0.46761 and 0.00083 respectively) it can be seen that the
intercept is not significantly different from zero at the 5% level of significance
(0.46761>0.05) but that the slope is significantly different from zero at the 5% or 1% levels
of significance (0.00083 < 0.01 < 0.05).
When the correlation coefficient is calculated for the above mentioned data by using excel,
the output is as shown below.
Column
1
Column
1
Column
2
Column 2
1
0.83025
1
The above table shows that the correlation between x and y is 0.83025.
73
TUTORIAL
QUESTIONS
CHAPTERS 5 TO 10