Download here - BCIT Commons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

History of statistics wikipedia , lookup

Association rule learning wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Confidence interval wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
MATH 2441
Probability and Statistics for Biological Sciences
Interval Estimates of the Difference of Two Population Means
Independent Samples -- Small Sample Case
This document continues the discussion of methods for estimating the difference between two population
mean values. The preceding document presented a method that is valid when
(i.)
(ii.)
the two samples are independent
each sample has a size of 30 or larger
If the second condition doesn't hold, that is, the two samples available are not both of size 30 or larger, the
estimation of the difference of two population means becomes somewhat more dicey. Unfortunately, this is
quite a common situation in technical work.
When condition (ii) above does not hold, the consensus seems to be that the most favorable situation is one
in which two additional conditions hold:
(iia) the populations are approximately normally distributed
(iib) the population variances are equal: 12 = 22 = 2
When these two conditions hold, along with condition (i) above, we have the small sample estimation of
the difference of two population means: equal variances case. When these three conditions hold, then
the random variable
t
x 1  x 2   1   2 

2
n1



x 1  x 2   1   2 
2

n2
(DMS-1)
1
1

n1 n 2
is approximately t-distributed with n1 + n2 -2 degrees of freedom. In the usual fashion, this leads to a
confidence interval formula for 1 - 2:
 1   2  x 1  x 2  t  / 2, 
1
1

n1 n 2
@ 100 (1   ) %
(DMS-2)
If you happen to know what this common value of  is, then you can use formula (DMS-2) directly. In the
much more common situation that  is not known, we need to use the observed sample variances to
estimate .
Now, s12 is an unbiased estimator of 12 = 2 and s22 is an unbiased estimator of 22 = 2. Thus, both s12
and s22 are estimating the same parameter 2. An even better estimate of this common 2 would be to take
an average of s12 and s22 weighted by the respective sample sizes. The formula for this so-called pooled
sample variance is
s p2 
n1  1 s12  n 2
 1 s 22
n1  n 2  2
(DMS-3)
sp is then substituted for  in formula (DMS-2) to get the formula
 1   2  x 1  x 2  t  / 2, s p
1
1

n1 n 2
@ 100 (1   ) %
(DMS-4)
To repeat, in both (DMS-2) and (DMS-4), the number of degrees of freedom for the t-statistic is
 = n1 + n2 - 2.
© David W. Sabo (1999)
Estimation of Difference of Two Means: Small Samples
Page 1 of 6
Example 1: (PAH)
One of the standard data sets we're using involves comparison of levels of polycyclic aromatic hydrocarbons
in river sediments at different times of the year. A sample of 8 specimens of sediments collected in April
gave a sample mean of 194.63 g/g with a sample standard deviation of 64.66 g/g. A second sample of 12
specimens was collected in July, giving a sample mean of 134.69 g/g and a sample standard deviation of
66.98 g/g. Estimate the difference of the mean PAH levels in these sediments for April and July.
Solution
In summary and with standard notation, we are being asked to estimate April - July given the following
information:
nApril = 8
x April = 194.63 g/g
sApril = 64.66 g/g
nJuly = 12
x July = 134.69 g/g
sJuly = 66.98 g/g
and
Clearly, with sample sizes of 8 and 12 respectively, we are dealing with a small sample situation. We will
assume that the two samples are independent because it is unlikely that exactly the same locations of the
riverbed were sampled on the two occasions. This leaves two additional conditions to be met before we
should be comfortable with using (DMS-4): are the samples consistent with approximately normally
distributed populations and are the population standard deviations equal.
It is relatively common practice to simply consider the issue of normality essentially unanswerable for such
small samples, and so assume the condition (iia) is met. However, we can take a brief look at the normal
probability plots for both sets of data:
PAH Levels (April)
PAH Levels (July)
300
300
250
250
200
200
150
150
100
100
50
50
0
-2
-1
0
0
1
2
-2
-1
-50
0
1
2
One might think there is cause for concern in the April data, with an apparent curvature in the pattern of
points. However, that curvature is caused by just the two most extreme observations (granted out of a total
of only eight), and so we are probably justified in not giving the apparent non-normality much weight. The
same thing happens in the July data, with the two most extreme values giving the sense of an upward
curvature to the probability plot. Other than those two points, the July plot seems to be quite a good straight
line. So, it appears at least that there is no strong reason to doubt the normality of the population
distributions.
We can dispense with condition (iib) quite quickly here. Since s April = 64.66 and sJuly = 66.98 are so close in
value, it would be perverse to assert a strong suspicion that the two populations have vastly different
standard deviations or variances.
The number of degrees of freedom are 8 + 12 - 2 = 18. Thus, to write down a 95% confidence interval
estimate, we need t0.025,18 = 2.101. The pooled variance is
s p2 
 8  1  64.66 2   12  1  66.98 2
8  12  2
 4367 .55
or
sp = 66.09 g/g
Page 2 of 6
Estimation of Difference of Two Means: Small Samples
© David W. Sabo (1999)
(Notice that since sp2 is a weighted average of the two original sample variances, its value must be between
the values of those two variances, and so sp itself must have a value between the two sample standard
deviations. If you find this is not so, you've made an arithmetic mistake!)
So, finally,
 April   July  194 .63  134 .69  (2.101) (66 .09 )
= 59.94  63.38
1
1

8 12
@ 95 %
@ 95%
or, in interval form
-3.44 g/g  April - July  123.32 g/g
@95%
Unfortunately, this confidence interval estimate just catches the value zero, reducing its meaningfulness
rather drastically. At a 95% confidence level, we cannot rule out the possibility that the two mean values are
identical.

The two new conditions, (iia) and (iib), in this small sample case can be a bit problematic. We've illustrated
a bit how to deal with them in the rather detailed example above. The following general comments have
some support in statistical theory.
First, the method above does not seem to be unduly sensitive to small departures from normality in the
population distributions (this seems to be a general characteristic of methods based on the t-distribution).
Rough checks of normality should be adequate  either by constructing normal probability plots as was
done above, or by just looking at frequency histograms or stemplots for the data. Results seem to be ok as
long as the population distribution has a single major peak and is not too asymmetric.
The requirement of equal variances in the two populations is much more problematic, both from the point of
view of confirmation that it has been met, and also in devising an interval estimator which is valid when it
appears that the populations do not come close to sharing common variances.
First, how can you tell if it is reasonable to assume that 12 = 22 = 2? Almost always, this question will
have to be answered by looking at the values of s12 and s22. Several suggestions have been made in
standard textbooks:
(i.)
(ii.)
(iii.)
(iv.)
most authors simply say something along the lines "as long as s12 and s22 are not too
different, you're probably all right in assuming 12 = 22." Since they don't specify what
they mean by "too different", this advice is rather useless.
some authors (for example, Jarrell, p 468) actually give a rule of thumb, along the lines "if
the larger of the two variances is not more than double the smaller one" as meaning "not
too different." This is a good kind of rule of thumb, because it's easy to use, and many
teachers and practitioners in statistics seem to have a vague impression that it is a
reasonable rule of thumb. It would be comforting to know that at some time in the past
someone has done some research to demonstrate the practicality of this rule.
some authors suggest simply looking at a frequency histogram or stemplot of the two sets
of data to see if they have approximately the same degree of spread. If so, it is
reasonable to assume 12 = 22.
you may find occasionally that an author suggests performing a hypothesis test procedure
to determine if the evidence allows you to reject the claim that 12 = 22 . (This is the socalled F-test for two population variances  you can find the details in most statistics
textbooks if we don't get to it in this course.) However, almost everyone agrees that this is
a rather dicey approach, since the F-test is known to be rather sensitive to departures
from normality in the populations. In fact, to quote Neil Weiss (Introductory Statistics, 4th
edition, p. 588), "As the noted statistician George E. P. Box remarked: "To make a
preliminary test on variances is rather like putting to sea in a rowing boat to find out
whether conditions are sufficiently calm for an ocean liner to leave port!" " [This is
© David W. Sabo (1999)
Estimation of Difference of Two Means: Small Samples
Page 3 of 6
(v.)
(vi.)
statistics humor at its best, folks  enjoy it while it lasts!] The point here is that the
conclusion you might get from the F-test is even more questionable than the validity of the
result from (DMS-4) when the assumption that 12 = 22 is not valid.
it appears that the error introduced by erroneously assuming 12 = 22 is least when the
two sample sizes are approximately equal. Thus, by using samples of approximately
equal size, the validity of the assumption that 12 = 22 is a less pressing issue.
finally, a number of authors advise that if there is any doubt about the validity of assuming
12 = 22, one should resort to a procedure which does not rely on this assumption. As
you'll see in the next section, there is no general consensus about which method is best
under those circumstances, but there is a sense that the most commonly used
approaches (we'll describe three similar ones) give about the same quality of results, and
results which are not too different from (DMS-4) when 12 = 22 is approximately correct.
Sample Variances Unequal
For the construction of confidence interval estimates of 1 - 2 when one or both sample sizes are less than
30 and there is good reason to doubt that 12 = 22 = 2, there are really three very similar modifications of
(DMS-4) in common use. All involve modification of the probability factor, t/2,. (For testing hypotheses
about 1 - 2, there is also an additional non-parametric approach that seems to be recommended highly in
situations such as this  the Mann-Whitney test.)
The basic formula in this case is
1   2  x 1  x 2  t  / 2
s12
s2
 2
n1 n 2
@ 100 (1   ) %
(DMS-5)
where we've deliberately left the subscript denoting degrees of freedom off of the t-factor. Then, Weiss and
others suggest calculating the effective degrees of freedom for this t-factor using the formula
 
 s 12
s2 

 2
 n1
n 2 

2
 s 12 


 n1 

n1  1
2
2
 s 22 


 n2 
n2  1
(DMS-6)
This looks a bit frightening, but notice that most of the subexpressions are repetitions of s 2/n for each
sample. If (DMS-6) doesn't give a whole number result, then round down. Formula (DMS-6) is apparently
an approximation developed by Satterthwaite, to replace more complex approaches that required
specialized tables.
A second suggestion, which seems to be favored by authors of statistics textbooks oriented towards
business applications is to simply use as the effective degrees of freedom in (DMS-5) the smaller of n1 - 1 or
n2 - 1. This suggestion has some plausibility  it really amounts to restating what we've mentioned above:
you can't expect greater precision is the estimate of a difference between two means than you would be able
to get in the estimation of either of the two means separately.
A final suggestion, attributed by Wayne Daniel (in his textbook, Biostatistics, 6th edition, p. 168) to Cochran
is to compute the t-factor as a weighted average of two values from the t-table. That is, use
t 
w 1 t  / 2, n1  1  w 2 t  / 2, n2  1
w1  w 2
(DMS-7a)
where the weights, w1 and w2, are given by
Page 4 of 6
Estimation of Difference of Two Means: Small Samples
© David W. Sabo (1999)
w1 
s12
n1
and
w2 
s 22
n2
(DMS-7b)
There is little more to be said about these formulas. We'll illustrate them with a quick example and then
leave this topic.
Example 2: (Peas)
In the standard data sets is a description of an experiment performed to compare amounts of vitamin C in
peas.
For the seven specimens of frozen peas that a technologist analyzed, the amounts of vitamin C were
25.9
23.4
21.2
12.3
18.4
18.0
CpeasFrozen
19.8
and, for the twelve specimens of canned peas that she analyzed, the amounts of vitamin C were:
9.7
7.0
8.2
9.5
6.6
5.0
6.5
8.2
6.5
7.3
6.8
10.6
CpeasCanned
These numbers are in units of mg of vitamin C per 100 g of peas. Compute 95%
confidence interval estimates of the difference in mean vitamin C content of frozen and canned peas, based
on this data, and using each of the three variations on the basic estimation method described above.
Solution:
We have two independent samples of peas here. In the notation of the subject, the relevant sample
characteristics are:
nfrozen = 7
x f rozen = 19.86
sfrozen = 4.350
s2frozen = 18.926
ncanned = 12
x canned = 7.66
scanned = 1.623
s2canned = 2.634
Here, the larger variance is over seven times as large as the smaller variance -- clear evidence even for
such small samples that the population variances are unequal.
From formula (DMS-6), we get
2
 18 .926 2.634 



7
12 

 
 6.988
2
2
18 .926
2.634
7 
12
6
11

 

Thus, in formula (DMS-5), we need t0.025,6 = 2.447. The resultant confidence interval estimate is:
 f rozen   canned  19.86  7.66  2.447
18.926 2.634

7
12
@ 95 %
= 12.20  4.18 mg/100g @95%
Using the second suggestion above, we note that nfrozen - 1 = 7 - 1 = 6, and ncanned -1 = 12 - 1 = 11, so we
would again use t0.025, 6 in formula (DMS-5). This would give exactly the same result.
Finally, for the third approach, we need the two weight factors:
w1 
s12 18.926

 2.704
n1
7
© David W. Sabo (1999)
and
w2 
s22 2.634

 0.219
n2
12
Estimation of Difference of Two Means: Small Samples
Page 5 of 6
Thus, since t0.025, 6 = 2.447 and t0.025, 11 = 2.201, we get from formula (DMS-7a)
t 
2.704 2.447   0.219 2.201
2.704  0.219
 2.429
Using this in formula (DMS-5) then gives the interval estimate
 f rozen   canned  19.86  7.66  2.429
18.926 2.634

7
12
@ 95 %
= 12.20  4.15 mg/100 g @95%
This is just slightly narrower than the interval estimate given by the other two approaches. For all practical
purposes, the three approaches have yielded the same results in this example.
(Note that if we had ignored the obvious signs of unequal population variances and employed the procedure
described in the first part of this document, we would have obtained sp  2.895. Then, using t0.025, 17 = 2.110,
the confidence interval estimate would have turned out to be
frozen - canned = 12.20  2.91 mg/100 g @ 95%
a considerably narrower estimate. The considerable apparent difference in population variances results in a
much less precise estimate of the difference between the two population means.)

Page 6 of 6
Estimation of Difference of Two Means: Small Samples
© David W. Sabo (1999)