Download here - BCIT Commons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
MATH 2441
Probability and Statistics for Biological Sciences
Confidence Interval Estimates of the Population Variance
(Large Sample Case)
The population variance (and its square root, the population standard deviation) is a measure of the
variability (or, if you like, the uniformity) of the population. It is not as common to require an estimate of  or
2 in statistical work as it is to require an estimate of the population mean or population proportion. In fact, if
anything, the comparison of two population variances is a more common requirement than the estimation of
the variance of a single population (and we will deal with that comparison of variances later as part of our
study of hypothesis testing). Still, of the population parameters that require a sampling distribution which is
non-symmetric, the variance is the most important. So, we will look briefly at the estimation of the
population variance so you know how to compute interval estimates of the population variance and
population standard deviation, and so you know how to work with interval estimates when the sampling
distribution involved is not symmetric.
We know from the brief comment in the document on sampling distributions that when the population is
approximately normally distributed, the variable
2 
n  1 s 2
(OV-1)
2
has the so-called 2-distribution with  = n - 1 degrees of freedom. The 2-distribution is not symmetric
about its mean value of , and so the sort of '' approach
we used to construct confidence interval estimates of the
area = 
/2
population mean and proportion can't really be used here.
However, retaining the notion of  as the probability that
the interval estimate fails to capture the true value of 2,
we can still come up with a useful interval estimate
formula.
Refer to the figure to the right. We have a probability 
that the true value of 2 is outside the interval estimate,
and we'll set up the interval so that there is a probability
of /2 that the interval misses the true value of 2 on the
left and on the right. The starting probability equation
becomes
Pr(  12 / 2   2   2 / 2 )  Pr(  12 / 2 
n  1 s 2
2
area = 1-
2

2

1-
/2
  2 / 2 )  1  
2


/2
(OV-2)
Thus, at the left edge, we have that
n  1 s 2
2
  12 / 2
which gives
2 
n  1 s 2
 12 / 2
Similarly, at the right-edge, we have
© David w. Sabo (1999)
Interval Estimates of the Population Variance
Page 1 of 4
n  1 s 2

2
  2 / 2
which gives
2 
n  1 s 2
 2 / 2
Combining these two inequalities for 2, we get the 100(1-)% confidence interval estimate for the
population variance:
n  1 s 2
 2 / 2
 2 
n  1 s 2
 12 / 2
@ 100 (1   )%
(OV-3)
Note that the numerators of both fractions here are identical. In the denominator on the left side, the value
2/2 is larger than the value 21-/2 on the right side, and so the quotient on the left of the inequality gives a
smaller value than the quotient on the right  which is what we need here, of course. Although we haven't
shown the symbol  explicitly in this formula, you need to use 2 values from the row  = n-1 of the
2-distribution tables.
Formula (OV-3) gives a confidence interval estimate of 2. To get a corresponding interval estimate of ,
you simply take square roots.
Example 1: The cholesterol concentration in the yolks of each of a sample of 18 randomly selected eggs
laid by genetically engineered chickens were found to have a mean value, x , of 9.38 mg/g of yolk and a
standard deviation, s, of 1.62 mg/g. Use this information to construct a confidence interval estimate of the
true variance and standard deviation of the cholesterol concentration in these egg yolks.
Solution:
This requires a straightforward application of formula (OV-3). We have n = 18, so  = 18 - 1 = 17. We also
have s = 1.62 mg/g, so s2 = (1.62 mg/g)2 = 2.6244 (mg/g)2. No confidence level is specified, so we use the
conventional default of 95%, meaning that  = 0.05. Thus /2 = 0.025 and 1 - /2 = 0.975. Referring to the
row  = 17 of the 2-distribution table, we find that 20.025 = 30.191 and 20.975 = 7.564. Thus, we have
17  2.6244
17  2.6244
2 
30 .191
7.564
or
1.478  2  5.898
@95%
The units here are still (mg/g)2. This last line is the required confidence interval estimate of the population
variance, 2. To get the 95% confidence interval estimate of the population standard deviation, , we just
take square roots:
1.478   2  5.898
or
1.216 mg/g    2.429 mg/g
@95%.

(Before moving on, perhaps a remark about the implications of this result for the way we calculate
confidence interval estimates of the population mean is in order. Recall that we used the value of s as a
point estimate of  in that formula. The rationale was that this was the best estimate available for . Now,
for this example, we see that at a 95% confidence level, the true value of  might be as much as 25% less
than s or as much as 50% more than s. This means that the true width of the confidence interval estimate
for the population mean from this data might be of the order of 25 % less to 50% more than the width of the
interval we actually compute when the value of s is used as an estimate of the required value of . This is
quite a bit of difference! Of course, we are working with a relatively small sample here, which tends to
accentuate the effect.)
Page 2 of 4
Interval Estimates of the Population Variance
© David w. Sabo (1999)
Large Sample Approximations
Without trying to prolong this section too much, we note the transition to a large sample approximation,
which results in somewhat simpler formulas, and allows you to use the standard normal probability table
instead of the 2 table. Formula (OV-3) is valid for all values of  greater than or equal to 1 (or none of them,
for that matter, when you cannot support the assumption that the population is approximately normally
distributed). For sample sizes larger than n = 30 or so, we also know that the distribution of the 2 random
variable becomes more and more like the distribution of a normal random variable with a mean of  = n - 1
and a variance 2 = 2(n - 1) = 2. This means that formula (OV-2) can be replaced by the formula
Pr(( n  1)  z  / 2 2(n  1) 
n  1 s 2
2
 (n  1)  z  / 2 2(n  1) )  1  
from which we get either
2 
(n  1) s 2
@ 100 (1   )%
(n  1)  z  / 2 2(n  1)
(OV-4a)
or, in the form of an interval,
(n  1) s 2
(n  1)  z  / 2 2(n  1)
2 
(n  1) s 2
(n  1)  z  / 2 2(n  1)
@ 100 (1   )% (OV-4b)
This might look pretty bad, but all the numbers in the formula are quite simple, and so it is quite easy to
implement this formula when appropriate.
Example 2: A technologist is developing a new method for processing a food material. It is known that for
best quality, it is important to control moisture content in the final product. So, as one part of determining the
practicality of the new method, the technologist must estimate the variability of water content in the resulting
product. He collects 50 specimens of product from the new process, and determines the percent water in
each. These 50 specimens give a sample mean water content of 43.24% and a sample standard deviation
of 7.93%. Compute a 95% confidence interval estimate of the true standard deviation of the percentage
water for this new process.
Solution:
Here n = 50, s2 = 7.932 and  = 0.05, so z/2 = z0.025 = 1.96. Substituting these numbers into (OV-4b) gives
49  7.93 2
49  1.96
2  49
 2 
49  7.93 2
49  1.96
2  49
or
45.047  2  104.111
@ 95%
as the 95% confidence interval estimate for 2. Taking square roots, we get the following confidence interval
for the standard deviation :
6.712%    10.203%
@95%.
(As a matter of interest, 20.975 = 31.555 and 20.025 = 70.222. Thus, the "exact" 95% confidence interval
estimate for the variance here is 43.880  2  97.650, and so for the standard deviation is 6.624%   
9.882% @95%. Thus, the relative error in the large sample approximation is about 3 - 4 %.)

© David w. Sabo (1999)
Interval Estimates of the Population Variance
Page 3 of 4
Devore, Freund, and other elementary statistics textbooks give one more large-sample approximation,
based on the observation that for large samples from an approximately normally distributed population, the
sample standard deviation, s, itself is approximately normally distributed with a mean of  and a variance of
2/2n. This means that
z
s

2n
(OV-5)
has an approximately standard normal distribution, and so




s
Pr   z  / 2 
 z / 2   1 



2n


This leads directly to a 100(1-)% confidence interval for :
1
s

z / 2
1
s
z / 2
2n
(OV-6)
2n
(You can actually get this formula from (OV-4b) by a bit of rearrangement and approximation: first, replace
n - 1 by n, then take the square roots, and finally, in the denominator, use the approximation
1  x  1  1 x ).
2
Example 3: Repeating Example 2, using (OV-6) gives
1
7.93
1.96
2  50

1
7.93
1.96
@ 95 %
2  50
or
6.630%    9.863%
@ 95%.
You can see that this approximation agrees very well with the results given by the exact and earlier
approximate formula.

With this, we end our discussion of the estimation of variances and standard deviations of a single
population.
Page 4 of 4
Interval Estimates of the Population Variance
© David w. Sabo (1999)