Download here - BCIT Commons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
MATH 2441
Probability and Statistics for Biological Sciences
Confidence Interval Estimates of the Mean
(Large Sample Case)
We now have most of the tools we need in order to begin deriving formulas for interval estimates of various
population parameters. The population mean is one of the most important parameters to be able to
estimate. We will look briefly at how interval estimate formulas arise under a relatively simple set of
circumstances (though our derivations will be rather informal -- rather than aiming to give the most general
and mathematically rigorous derivation, we aim instead to give you a picture of how these formulas result
from our recent discussion of sampling distributions), how the formulas are used, and most importantly, how
the results are to be interpreted.
Recall from our discussion of the sampling distribution of the mean that for a large enough sample (usually
taken to be a sample with n  30 elements) then, irrespective of the nature of the population distribution, the
sample mean, x , is an approximately normally distributed random variable with
mean:
 x   , the population mean
standard deviation:
x 
and

,
n
where  is the population standard deviation.
(This same result holds for samples of any size if the population is approximately normally distributed.
However, for the formulas that result to be of practical use, it is also necessary to know the value of  if the
sample size is less than about 30.)
Under these conditions, then, we can evaluate the following rather strange looking probability:
Pr(  z  / 2  x  x    z  / 2  x )
   z / 2 x   x
  z / 2 x   x
 Pr
z
x
x

 Pr z  / 2  z  z  / 2 
1
 

2 2



 1 
(LS - 1)
We get the second line here by converting the
x event in the first line into its equivalent in
terms of z. This is justified because x is an
approximately normally-distributed random
variable. The third line results from algebraic
simplification after substituting  x   . Finally,
area = 
/2
the last line comes from the meaning of z/2,
which cuts off a right-hand tail region of area /2
(and similarly, its symmetric partner, - z/2, cuts
off a left-hand tail of area /2 -- see the diagram
-z/2
to the right). So, it appears that we have just
demonstrated that under the conditions described at the beginning, the event
  z  / 2 x  x    z  / 2 x
©David W. Sabo (1999)
area = 
/2
z
z/2
(LS - 2)
Large Sample Estimates of the Mean
Page 1 of
10
has a probability of (1 - ). For instance, if we were to choose  = 0.05, then the probability of this event
would be 0.95.
The whole trouble is that this is a rather useless event as it is written, for at least two reasons:


in real life, we always measure the value of x directly; the value of x is always obtained
experimentally -- and so we don't really need any probability statements about it.
the expressions   z  / 2 x and   z  / 2 x in the event above contain  and
x  
n
, but we don't know the values of these, and so we don't know what precise event
the above expression is really describing.
But, we can rearrange the expression (LS - 2). The left-hand part gives:
  z  / 2 x  x

  x  z  / 2 x
x    z  / 2 x

x  z  / 2 x  
and
Putting these two back together, and reinserting them into the earlier probability expression, we get
Pr x  z  / 2 x    x  z  / 2 x   1  
(LS-3)
This is a rather unusual statement. The event, x  z  / 2 x    x  z  / 2 x , seems to be an event about
the value of , which is not a random variable. In fact, the randomness in this event is not in the value of the
symbol, , at its center, but in the boundaries of the interval which are stated in terms of x , which is a
random variable. (  x is not a random variable since it is the standard deviation of the population of
samples, and z/2 is just a number determined by our choice of the value of .)
So, what (LS-3) is really saying is that there is a probability of (1 - )
that the random interval constructed as shown will really contain or
capture the actual value of .
This interpretation of equation (LS-3) is absolutely central to understanding what the results we get in these
calculations really mean. We will explore the implications of it in some detail before we're done here.
To restate: the interval estimate
x  z  / 2 x    x  z  / 2 x
(LS-4a)
sometimes written in the form
  x  z / 2  x
(LS-4b)
is called the 100(1 - )% confidence interval estimate of  (large sample case). There is a probability of
(1 - ) or 100(1 - )% that the interval that results when actual values are substituted for x , z/2, and  x ,
will actually capture or contain the single, true, but unknown value of . Incidentally, the difference
x    z / 2  x
(LS-5)
is called the sampling error. It tells you by how much estimating  using the observed value of x could be
in error at this level of confidence.
Page 2 of 10
Large Sample Estimates of the Mean
©David W. Sabo (1999)
What Is the Value of ?
The symbol  has recurred often so far in this document. The quantity (1 - ) is the probability that the
interval estimate we construct for  really does capture the value of . This means that  is the probability
that our interval estimate is incorrect -- that it does not capture the value of . The value of (1 - ) is called
the level of confidence -- it is a measure of how confident we are that the interval estimate has captured
the desired value . (Remember, we have no way of determining  exactly and for certain, because that
would require sampling the entire population. The only information we have regarding what the value of 
might be is the information used in the formula (LS-4a, b) giving the interval estimate. This interval may or
may not contain the actual value of . We have no way of knowing for sure if it does or if it doesn't -- but we
do have a probability, which is better than no clue at all!)
So,  is the probability of the interval estimate being wrong. We may choose the value of  as we please
(though, of course, being a probability,  can't be less than zero or greater than 1). While you are free to
choose  as you please, your choice has some implications. If you pick a rather large value for , then
there will be a rather large likelihood of the resulting interval estimate being wrong. On the other hand, if you
pick an extremely small value for  (thinking that the less chance of being wrong, the better), you'll find that
the sampling error, (LS-5) becomes extremely large. The smaller the value of , the larger the value of
z/2. In the illustration above, you see that in order to get smaller tail areas, we need to move the cutoff
value of z further away from zero. You can see from this that in order to be absolutely certain of our interval
estimate of , we would have to reduce , the combined tail areas, to zero. This can only be achieved by
having z/2 become infinite, and so obtaining the interval estimate: -    +, an absolutely certain result
which is absolutely useless. If you want greater certainty, you must pay for it by reducing the precision of
the estimate, in that a wider interval results. This is not just a feature of statistical inference (though it shows
up very blatantly in all aspects of statistical inference), but of life in general.
For most routine statistical work, people use  = 0.05 as a reasonable compromise between certainty and
precision, and so using  = 0.05 has the support of convention or tradition. Your decision to use a different
value of  would have to be based on your assessment of the severity of consequences that may result from
an estimation error. If an incorrect estimate would have very serious consequences, you may be able to
justify using a smaller value of  at the expense of ending up with a broader, less-precise estimate. On the
other hand, few practitioners would feel comfortable with an interval estimate based on a value of  much
larger than 0.10 in any application. (If it was necessary to use a value of  much less than 0.05, it would be
good to consider other modifications to the experiment (see below) to control the width of the interval
estimate.)
Unless instructed otherwise, or unless you can defend a decision to
do otherwise, use  = 0.05.
Where Do I Get a Value for ?
Formulas (LS - 4a,b) require a value for  x  
n
, and hence, a value for , the standard deviation of the
population for which we are trying to estimate the mean, .
Occasionally, situations arise in which  is known fairly precisely even though the mean, , is not known.
Remember that  is a measure of variability in a population, whereas  is an indication of where on the
number line that population is. In technical applications, the value of  can be the result of calibration,
whereas the value of  is the result of some inherent characteristic, design feature, or quality of the situation
under study. This means that the value of  may be known quite precisely from long-term records, but the
value of  must be estimated whenever the process is recalibrated to confirm that the calibration. In such
instances, all the information you need to implement formulas (LS - 4a,b) is available.
More usually when  is being estimated, there is no prior direct information available about the precise value
of . In this case, the original probability equation, (LS - 1) on which our confidence interval formulas
©David W. Sabo (1999)
Large Sample Estimates of the Mean
Page 3 of
10
(LS - 4) are based, contains two unknowns,  and , and we seem to be stymied. However, since the focus
is on estimating , and under the circumstances that (LS - 4) is valid, s is usually an adequate point
estimator of , the common practice is to substitute the value of s obtained from the sample for the required
value of . Thus, (LS - 4b) becomes
  x  z / 2 
s
@ 100 (1  )%
n
(LS - 6)
Example SalmonCa0:
Just to give a sense of what sort of results the formulas above give, consider the standard example
SalmonCa0. Recall the "story." A technologist is studying the suitability of chlorine dioxide (ClO2) for
sanitizing salmon fillets. One concern is that the ClO2 treatment might adversely affect the mineral content
of the fillets. First, she selects 40 salmon fillets which have not been treated at all, and obtains the following
concentrations of calcium (in parts per million, ppm):
75
107
72
61
56
90
52
61
52
53
76
59
73
68
103
72
63
68
78
88
94
69
67
68
47
120
96
43
54
91
63
107
72
101
83
29
54
101
56
129
SalmonCa0
From these 40 values obtained for the random sample of 40 untreated salmon fillets, we can calculate:
x = 74.28 ppm
s = 22.022 ppm
For a 95% confidence interval estimate of , we need  = 0.05, and so we require z0.05/2 = z0.025 = 1.96, to
two decimal places. Substituting all this into formula (LS - 6) then gives:
  x  z / 2 
s
@ 100 (1  )%
n
 74 .28 ppm  (1.96 )
22 .022 ppm
@ 95 %
40
We can state this finally in one of two ways, paralleling (LS - 4a) and (LS - 4b):
67.46 ppm    81.10 ppm
@ 95%
 = 74.28 ppm  6.82 ppm
@ 95%
or
Either form is acceptable, though probably the second form is more usual because it displays both the
values of the sample mean and the estimation error explicitly. What this result tells us is that there is a 95%
probability that the interval between 67.46 ppm and 81.10 ppm contains the true value of .

A Deeper Example
Page 4 of 10
Population Distribution
Relative Frequency
Because it is so important that you have a good
understanding of exactly what these formulas are telling
you about the population in question, we need to look at a
somewhat more detailed example of the interval
estimation process. For this, we return to the simple
classroom populations, which consist of equal numbers of
just six values: {1, 2, 3, 5, 7, 12}. These are the same
populations we worked with in our early experiments with
sampling, described in the documents "The Real
Problem" and "The Arithmetic Mean: Sampling Issues."
0.2
0.15
0.1
0.05
0
1
Large Sample Estimates of the Mean
2
3
4
5
6
7
8
9
10
11
12
Value
©David W. Sabo (1999)
Though infinite in size, this population is far from normally distributed. It does have the advantage that we
know everything that can be known about it, and so we are able to compute  and  exactly, and so we will
be able to see directly how accurate are the results of estimating  via sampling and the formulas given
earlier in this document.
First, we calculate  and , for later reference. Using formulas given in an earlier document ("Calculating
Probabilities III"),
6
   x k  Pr( x k )  1
k 1
1
1
1
1
1
1
 2   3   5   7   12   5
6
6
6
6
6
6
using the fact that each of the six distinct values that occur in the population have the same probability, 1/6.
Then
 2   x k    Pr( x k )
6
2
k 1
 (1  5) 2

1
1
1
1
1
1
 (2  5) 2  (3  5) 2  (5  5) 2  (7  5) 2  (12  5) 2
6
6
6
6
6
6
82
 13 .667
6
Thus
  2
13.667  3.6969
Now, the experiment will work like this. We will select a succession of random samples of size n = 30 and
another succession of random samples of size n = 65 from this population. For each of these samples, we
will calculate x and s, and then use equation (LS - 6) to compute a 90% confidence interval estimate of .
This will give us an idea of how well the "theory" above works in practice for this particular population at
least.
For example, the first sample of size n = 30 yielded the following data:
1
12
5
5
7
1
5
2
7
12
7
1
7
12
5
12
7
2
1
12
5
7
1
3
1
3
2
3
2
1
These numbers have a sum of 151 and the sum of their squares is 1189. Given that n = 30, this means that
x
151
 5.033
30
and
s2 
1189  (30 )(5.033 2 )
 14.79195
29
giving
s  14.79195  3.846
To construct a 90% confident interval estimate, we use  = 0.10, or /2 = 0.05, and so we need z.05  1.645.
Thus, based on the data in this one sample, (LS - 6) gives
  5.033  (1.645 ) 
3.846
 5.033  1.155
@ 90 %
30
or
3.878    6.188
@ 90%
Note that this interval estimate does contain the known exact value of , which is 5.
©David W. Sabo (1999)
Large Sample Estimates of the Mean
Page 5 of
10
We now repeat this process 39 more times, to get a total of 40 replications of the random
sampling/confidence interval estimate construction process. We graph these intervals relative to the same
horizontal scale of x-values to facilitate comparison and insight into the interpretation of the process.
The 40 random samples of size 30 and the 40 random samples of size 65 each gave a set of 40 values of x
and 40 values of s. The following figures are scatter line plots of these:
70
80
60
70
60
sample size
sample size
50
40
30
20
50
40
30
20
10
10
0
0
3
4
5
6
2
7
2.5
3
3.5
4
4.5
5
value of sample standard deviation
value of sample mean
Two features of these plots are noteworthy. First, in all instances, there is considerable scatter in the values
of the statistics about the value of the population parameter of which they are point estimators. Secondly,
the amount of scatter is quite dramatically smaller for the larger sample size than for the smaller sample
size. Clearly, for this population and these sample sizes, use of x as a point estimator of  and use of s as
a point estimator of  is quite a risky business.
The following two figures compare the forty 90% confidence interval estimates of  obtained for each sample
size:
90% Confidence Intervals
(Sample Size: n = 65)
90% Confidence Intervals
(Sample Size: n = 30)
2
4
6
8
2
4
6
8
x
x
The horizontal scales here have the same size, so direct comparison of widths of the intervals is valid. You
see immediately that the intervals resulting from the n = 65 samples are visibly narrower than those resulting
from the n = 30 samples. Since wider interval estimates mean less precise estimates of the population
parameter, we are clearly getting more precise estimates with the larger sample sizes.
Notice that with the n = 30 samples, six of the forty interval estimates miss the value 5 entirely. We would
have expected about four of the forty (or 10%) would have this defect. However, n = 30 is right on the
borderline of validity as far as the rule-of-thumb is concerned, and we're working with a population that has a
Page 6 of 10
Large Sample Estimates of the Mean
©David W. Sabo (1999)
rather less than usual degree of "normality" in its distribution, so the envelope is being pushed pretty hard
here. Only three of the forty samples of size n = 65 gave interval estimates that missed the actual value of
the population parameter, which is more in line with the 10% rate that should be expected.
Notice that even for those intervals which do capture the true value of , there is quite a bit of variation in
horizontal position. While a few of the intervals are more or less centered on 5, the true value of  (which
would happen in instances where x had a value near 5), many have the value 5 rather close to one end or
another. Thus, when you construct a confidence interval estimate, you aren't really justified in thinking of the
middle of the interval as being where the true value of the population parameter probably is. All that you can
say is that there is a certain probability that the value of the parameter being estimated is somewhere inside
the stated interval.
In fact, this is the catch. In real life, you would not take forty different random samples from a population in
order to compare the results as was done above. You take just one random sample, and base your
calculations and conclusions on that one sample. If we had been studying this classroom population the
way statisticians study real populations, we would have taken just one sample of the type illustrated above.
There would have been a 10% probability that our interval estimate of  missed its true value entirely.
However, even if our interval was among those which really did capture the true value of , the true value of
 might be anywhere in the interval from the extreme left to the extreme right.

The Major Pieces of a Confidence Interval Estimate
The right-hand side of formula (LS-6)
  x  z / 2 
s
n
@ 100 (1  )%
contains five symbols, all of which play a role which shows up in confidence interval formulas for many
different types of population parameters. We note their meaning and implications briefly here.
x is the point estimator of the population parameter of interest here. It is typical to formulate
interval estimates by appending a standard error term to a point estimator.
 is a value determining the confidence level of the estimate. In general, this symbol stands for a
probability of being wrong in the result that is asserted. So, all things considered, smaller
values of  correspond to greater reliability, less likelihood of being wrong or making a
mistake.
z/2 is a factor that represents the characteristics of the distribution of the point estimator being
used. In this large-sample estimation of the population mean case, the point estimator x
is approximately normally distributed and so we use values of z. As you'll see when we
deal with the small sample case, we need to use values obtained from a different, more
spread-out distribution (called the t-distribution). What happens there is that there is so
much less information available with a small sample that the precision possible for a given
value of  is poorer than indicated by the normal distribution. This critical value is a factor
that represents how well a sampling scheme is able to make use of the available
information.
Note as well that the value of z/2 is affected very strongly by the value of . If you choose
a very small value of , you'll end up with a large value for z/2, and so a very broad
interval estimate. Thus, there is a trade-off between reliability (how likely is your estimate
to be wrong) and precision (how narrow is the interval). All other things being equal,
increasing reliability always results in reduced precision and vice versa. For a given
sample size, you cannot have arbitrarily high reliability and high precision at the same
time.
©David W. Sabo (1999)
Large Sample Estimates of the Mean
Page 7 of
10
s measures the variability of the population being sampled. Samples selected from a very uniform
population (for which  and hence s would be small numbers) would generally be quite
representative of that population, and so interval estimates of population parameters
should be quite precise and/or reliable. On the other hand, it will be much more difficult to
get precise estimates by sampling highly variable populations. The larger the value of s,
the poorer the precision of the interval estimate for this reason for a specific sample size.
n is the sample size. The width of the interval estimate is inversely proportional to the square root
of the sample size, n. Thus, by increasing the sample size, we can decrease this width,
and hence increase the precision of the estimate. Larger sample sizes lead to narrower
interval estimates because larger samples contain more information  therefore more
precise estimates should be result.
Of these five quantities, only two are under our control:  and n. Once  is chosen, the value of z/2 is fixed.
The values of x and s result from experimental observation.
Selecting an Appropriate Sample Size
In planning a statistical study to obtain an interval estimate of some population parameter, we usually have
just two numbers at our discretion: the value of  (the probability of being wrong due to random sampling
error) and the sample size n. A prudent strategy then is to start by selecting the largest value of  that is
acceptable. Except in special circumstances, this will usually be 0.05.
Secondly, a choice of sample size must be made. This is usually done indirectly by deciding on the largest
value of the sampling error that is acceptable  that is, how precise do you need the interval estimate to
be? The sample size is then chosen to meet this requirement.
In the present application, denote the maximum acceptable value of the sampling error by the symbol .
What we want to do then is select a sample size that will ensure
z / 2 


n
Note that all dependence on n is shown explicitly here  none of the symbols z/2, , or  depend on n. It is
easy to solve this inequality for n, to get

z
n    / 2





2
(LS-7a)
You can use this formula directly if you have a value for . If not, you need to use your value of s as a point
estimate of the required :
s
z
n    / 2 
 

2
(LS-7b)
Since s is only an approximation to , the value of n you get from here may not guarantee that the eventual
interval estimate will have a standard error term which is strictly less than .
In fact, when you don't have a prior value of , there is a bit of a chicken-and-egg problem here. We need a
value of s to estimate the size of sample to be collected. However, we can't get a value of s until we have a
sample (and so we need to know n, it looks like, to get a value for s). In actual practice, one would handle
this situation by carrying out a small-scale sampling (sometimes called a pilot study) to get an estimate of
s. That estimate is then used to compute a rough value of n to use for the main study. Sometimes people
compute n in this way and then increase its value by some amount or some factor for the main study, just to
be on the safe side. Another approach is to carry out the main study with the estimated value of n. If the
results are still of unacceptable precision, further sampling can be done to improve the precision of the final
result.
Page 8 of 10
Large Sample Estimates of the Mean
©David W. Sabo (1999)
According to formula (LS-7b), the value of n that is required is influenced by four quantities. The effect of
each of these by themselves is seen to be:
: choosing  smaller results in z/2 being larger and so the required value of n gets larger. All
other things being kept the same, requiring increased reliability of the estimate results in a
larger value of n  hence the requirement for a larger sample. On the other hand,
increasing the value of  (that is, being willing to accept a lesser degree of reliability)
reduces the size of the sample required.
z/2: this value is determined by our choice of the value of . n increases as z increases, and n
decreases as z decreases.
s: the larger the value of s, the larger the value of n to achieve the same precision of estimation.
The greater the variability in a population, the larger will be the value of  and so also, the
value of its estimate, s. This in turn leads to the requirement of larger sample sizes to
achieve a specific precision of estimation.
: as  decreases (indicating a more precise estimate is desired), the value of n increases. If you
want a more precise estimate, or a smaller estimation error, you will need to collect a larger
sample.
Increasing sample size inevitably increases the time and cost of the experiment. Requiring higher precision
for a given value of  inevitably results in the requirement of a larger sample. Thus, decisions about the
appropriate target precision and the appropriate value of  must be made with some care  selecting
values which are more rigorous than necessary can result in adding unnecessary cost to the study. On the
other hand, even very reliable results may be useless if the precision is inadequate. After all I can be 100%
confident that it will either rain today or not rain today, but while highly reliable, this weather forecast is
probably too imprecise to be of much utility.
Example SalmonCa0
Earlier, we obtained a 95% confidence interval estimate of the mean calcium content of untreated salmon
fillets which involved an uncertainty of  6.82 ppm. Suppose we were required to determine this mean
calcium content to the nearest 1 ppm. How many salmon fillets would have to be sampled to achieve this?
Solution
We are being asked to compute the value of n required by  = 1 ppm. We don't have a value of , but from
the sample of size 40 detailed above, we have the estimate s = 22.022 ppm. For the 95% confidence
interval estimate z/2 = 1.96. Thus, substituting into formula (LS - 7b), we get
2
 (1.96 ) (22.022 ppm ) 
  1863 .05
n  
1 ppm


Because of the inequality, we need to round up to the nearest integer, 1864. Thus, it appears that we would
need a sample size of at least 1864 to achieve the requested 1 ppm precision.
Note that this value, 1864, is just an estimate. Whether it achieves the goal or not depends on whether, with
the larger sample, we get a value of s which is greater or less than our present value of 22.022 ppm. What
you might do if this precision is mandatory is just round this value off to say 2000, to allow a small safety
factor. Or, collect a sample of size 1864, and if the precision is still not adequate, collect another few
hundred specimens to add to your previous sample, until the desired precision is achieved.
If the original study of 40 specimens was kind of typical, then a proposal to expand the study to include an
additional more than 1800 specimens might be considered prohibitively expensive. What this work points
out is that the only alternative is to reduce the target precision  that is, if trying for  = 1 ppm is too
expensive, then the only alternative is to be willing to accept a lesser precision. The really important point is
that the value of , , and n you eventually would end up using are the result of deliberate decisions based
©David W. Sabo (1999)
Large Sample Estimates of the Mean
Page 9 of
10
on what you need to accomplish and what resources you have available  they should not just be wild
guesses.

Page 10 of 10
Large Sample Estimates of the Mean
©David W. Sabo (1999)