• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia, lookup

History of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Misuse of statistics wikipedia, lookup

Sufficient statistic wikipedia, lookup

Transcript
```MATH 2441
Probability and Statistics for Biological Sciences
Confidence Interval Estimates of the Mean
(Small Sample Case)
In the previous document, we looked at the construction of confidence interval estimates of the population
mean when either of the following conditions applied:
a large sample was available  taken to mean that the sample size was around 30 or
greater, or,
the sample size was less than 30, but we had evidence that the population was
approximately normally distributed and we knew the value of the population standard
deviation, .


Neither of these sets of conditions covers a very common situation: the sample size is less than 30 (and we
aren't able to meet all the conditions of the second case above). These would be called small sample
cases.
There are two possible situations that can arise here as well, as it turns out.
(i.)
the data or any other information we have about the population does not allow us
to declare or even assume that the population is approximately normally
distributed. In this case, we're sunk! Some references suggest one might use
a so-called "non-parametric" or "distribution-free" technique (meaning a statistical
method that doesn't take any account of the distribution of a population).
Methods of this type which do exist tend to give relatively poor results when small
sample sizes are involved. Another alternative is to collect more data to push the
sample size up so that the large sample formulas can be used.
(ii.)
We have evidence that the population is approximately normally distributed (or
we're prepared to assume that this is true). We will look briefly at methods for
assessing the consistency of data with the assumption of a normally-distributed
population at bit later in the course.
In this second case (that the population is approximately normally distributed), it can be shown
mathematically that the quantity
t
x
s
n
(SS-1)
has the t-distribution with  = n - 1 degrees of freedom. Using this, we can obtain the following 100(1 - )%
confidence interval estimate of the population mean, :
x  t  / 2,
s
n
   x  t  / 2,
s
(SS-2a)
n
or
  x  t  / 2,
s
(SS-2b)
n
Notice that this latter formula has the general pattern:
population parameter = point estimator  probability factor x standard error
Small Sample Estimates of the Mean
Page 1 of 4
as we encountered in the large sample formula for confidence interval estimates of the population mean. All
that has really changed here is that the probability factor now is taken from the t-distribution rather than the
standard normal distribution.
There's not much more to be said about the method. We will illustrate it with a couple of examples.
Example Cpeas:
Construct a 95% confidence interval estimate of the mean vitamin C content (in mg/100g) of canned peas,
based on the standard data set Cpeas.
Solution:
Consulting the document entitled "Example Data Sets", we find that twelve specimens of canned peas were
analyzed for vitamin C content, yielding the results:
9.7
7.0
8.2
9.5
6.6
5.0
6.5
8.2
6.5
7.3
6.8
10.6
CpeasCanned
These 12 numbers have a mean, x , of 7.66 mg/100g and a
12
standard deviation of 1.623 mg/100g. The normal probability plot
for this data is shown to the right, indicating reasonable grounds
10
for regarding this data as consistent with a normally distributed
8
population. (The figure shows a normal probability plot prepared
using the method suggested by Snedecor and Cochran  for
6
perfect consistency with a normally distributed population, the
4
points should fall along a straight line. The deviations of the points
from a straight line appear to be random, uniform, and there are no
2
trends near the ends of the patterns that might contradict an
assumption of normality.) The conditions under which formula
0
-2
-1
0
1
(SS-2) is valid are thus met. Since n = 12, we have  = 12 - 1 =
11. For a 95% confidence interval, we need  = 0.05, so that /2
= 0.025. From the t-table, we get t0.025,11 = 2.201. Thus, the requested confidence interval estimate is:
  7.66  (2.201)
1.623
 7.66  1.03
2
mg / 100 g @ 95%
12
That is, there is a 95% chance that the interval 7.66  1.03 mg/100 g captures the actual value of  in this
case. You could also write this interval as 6.63    8.69 mg/100 g.

Example Cheddar
Construct a 99% confidence interval estimate for the mean percentage fat in cheddar cheese based on the
data set CheddarFat.
Solution
The data consists of 21 values:
27.2
26.3
27.9
28.1
28.7
32.6
24.6
23.3
30.8
35.0
21.5
22.6
28.0
24.8
24.8
27.9
22.9
28.3
35.0
32.6
29.3
CheddarFat
The normal probability plot for this data is shown in the figure just below. It's not an ideal plot (the tendency
of the pattern of points to level out near the extremes is consistent with a somewhat tighter clustering pattern
than the normal distribution in the population, but the sample size here is so small that this is not a decisive
feature.) We will consider the normal probability plot to be an adequate basis to consider formula (SS-2) to
be valid for this data.
Page 2 of 4
Small Sample Estimates of the Mean
For a 99% confidence interval estimate, we need  = 0.01, so /2
= 0.005. Since n = 21, we have  = 20. The mean value, x , for
this data is 27.72 % and the standard deviation, s, is 3.898 %.
The probability factor is t.005,20 = 2.845, so that the requested
confidence interval estimate becomes
40
35
30
25
3.898
  27 .72  (2.845 )
20
21
15
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
 27.72  2.420 % @ 99%
(By the way, even though the values here are percents, we are really working with population and sample
mean values  the mean values of the percent fat in the cheddar cheese. We know that we are not dealing
with proportions, which are also often written as percents, because a proportion denotes a percentage of a
total number of items which fall into a certain category. That is not the case in this example.)

Deciding on Sample Sizes
As in the large sample case, the width of confidence interval estimates can be reduced by selecting larger
samples here. Thus, if we wish to have the '' part of formula (SS-2b) to be smaller than some value , we
could write
t  / 2,
s


n
 t  / 2, s 

n  

  
2
(SS-3)
The only difficulty here is that the constant t/2, appearing on the right-hand side not only depends on , but
also on the value of n, through . As a result, we cannot solve for n as cleanly as was possible in the large
sample case where the probability factor, z/2, did not depend on n.
The recommended strategy here is to seek a self-consistent solution to (SS-3) by starting off with z/2,
substituted for t/2, in (SS-3). If this gives a value of n which is larger than 30, then that is the appropriate
sample size to use, and the confidence interval formulas for the large sample case can be used. If this first
estimate of n gives a value smaller than 30, then you need to use trial and error to find an appropriate value
of n that will satisfy (SS-3) as written.
This is not a big issue, because usually when a specific precision is targeted in statistical estimation, it is a
high enough precision that the smallest acceptable sample size will still correspond to a large sample case.
Example Cpeas
Just to illustrate, return to the first example above, involving the estimation of the vitamin C content of
canned green peas. There we obtain a 95% confidence interval estimate which contained the uncertainty
term 1.03 mg/100 g. Suppose, it was necessary to try to reduce this uncertainty to just 0.5 mg/100 g
(meaning to achieve  = 0.5 mg/100g in formula (SS - 3)). To estimate the sample size required, we use
z.025 = 1.96 in formula (SS-3) to get
2
 (1.96 ) (1.623 ) 
n
  40 .47
0 .5


Since this result is larger than 30, we conclude that to achieve an estimate with  = 0.5 mg/100g, we would
need to collect a sample of size 41 or larger. With this sample size, the confidence interval estimate can
then be calculated using the large sample formulas.
On the other hand, suppose the goal was an estimate with  = 0.8 mg/100g. Now, the calculation above
becomes
Small Sample Estimates of the Mean
Page 3 of 4
2
 (1.96 ) (1.623 ) 
n
  15 .81
0.8


This result is smaller than 30, and so is inconsistent with the formula used to calculate it (we've assumed a
large sample model in the bracketed expression but obtained a solution indicating a small sample). The
large sample formulas tend to underestimate the required value of n, so it is reasonable to expect that the
required value of n here will be some number slightly larger than 16. Using n = 17, 18, 19 and 20, for
example, in the expression
t .025,
1.623
,
n
we get
n
17
18
19
20
t .025,
1.623
n
18.5
18.3
18.2
18.0
What this table says is,
if you assume n = 17, formula (SS-3) says use n = 19 (an inconsistency),
if you assume n = 18, formula (SS-3) says use n = 19 (an inconsistency),
if you assume n = 19 formula (SS-3) says use n = 19 (consistent!),
if you assume n = 20, formula (SS-3) says use n = 18 (an inconsistency)
Thus, the appropriate sample size would appear to be 19. Of course, these estimates of n are correct only if
for the expanded sample, you get s very close to 1.623 mg/100 g of peas. Thus, there may not be much
point in quibbling over one or two items in the sample unless each element of the sample represents an
extremely costly procedure.

Page 4 of 4
Small Sample Estimates of the Mean