Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH 2441 Probability and Statistics for Biological Sciences Confidence Interval Estimates of the Mean (Large Sample Case) We now have most of the tools we need in order to begin deriving formulas for interval estimates of various population parameters. The population mean is one of the most important parameters to be able to estimate. We will look briefly at how interval estimate formulas arise under a relatively simple set of circumstances (though our derivations will be rather informal -- rather than aiming to give the most general and mathematically rigorous derivation, we aim instead to give you a picture of how these formulas result from our recent discussion of sampling distributions), how the formulas are used, and most importantly, how the results are to be interpreted. Recall from our discussion of the sampling distribution of the mean that for a large enough sample (usually taken to be a sample with n 30 elements) then, irrespective of the nature of the population distribution, the sample mean, x , is an approximately normally distributed random variable with mean: x , the population mean standard deviation: x and , n where is the population standard deviation. (This same result holds for samples of any size if the population is approximately normally distributed. However, for the formulas that result to be of practical use, it is also necessary to know the value of if the sample size is less than about 30.) Under these conditions, then, we can evaluate the following rather strange looking probability: Pr( z / 2 x x z / 2 x ) z / 2 x x z / 2 x x Pr z x x Pr z / 2 z z / 2 1 2 2 1 (LS - 1) We get the second line here by converting the x event in the first line into its equivalent in terms of z. This is justified because x is an approximately normally-distributed random variable. The third line results from algebraic simplification after substituting x . Finally, area = /2 the last line comes from the meaning of z/2, which cuts off a right-hand tail region of area /2 (and similarly, its symmetric partner, - z/2, cuts off a left-hand tail of area /2 -- see the diagram -z/2 to the right). So, it appears that we have just demonstrated that under the conditions described at the beginning, the event z / 2 x x z / 2 x ©David W. Sabo (1999) area = /2 z z/2 (LS - 2) Large Sample Estimates of the Mean Page 1 of 10 has a probability of (1 - ). For instance, if we were to choose = 0.05, then the probability of this event would be 0.95. The whole trouble is that this is a rather useless event as it is written, for at least two reasons: in real life, we always measure the value of x directly; the value of x is always obtained experimentally -- and so we don't really need any probability statements about it. the expressions z / 2 x and z / 2 x in the event above contain and x n , but we don't know the values of these, and so we don't know what precise event the above expression is really describing. But, we can rearrange the expression (LS - 2). The left-hand part gives: z / 2 x x x z / 2 x x z / 2 x x z / 2 x and Putting these two back together, and reinserting them into the earlier probability expression, we get Pr x z / 2 x x z / 2 x 1 (LS-3) This is a rather unusual statement. The event, x z / 2 x x z / 2 x , seems to be an event about the value of , which is not a random variable. In fact, the randomness in this event is not in the value of the symbol, , at its center, but in the boundaries of the interval which are stated in terms of x , which is a random variable. ( x is not a random variable since it is the standard deviation of the population of samples, and z/2 is just a number determined by our choice of the value of .) So, what (LS-3) is really saying is that there is a probability of (1 - ) that the random interval constructed as shown will really contain or capture the actual value of . This interpretation of equation (LS-3) is absolutely central to understanding what the results we get in these calculations really mean. We will explore the implications of it in some detail before we're done here. To restate: the interval estimate x z / 2 x x z / 2 x (LS-4a) sometimes written in the form x z / 2 x (LS-4b) is called the 100(1 - )% confidence interval estimate of (large sample case). There is a probability of (1 - ) or 100(1 - )% that the interval that results when actual values are substituted for x , z/2, and x , will actually capture or contain the single, true, but unknown value of . Incidentally, the difference x z / 2 x (LS-5) is called the sampling error. It tells you by how much estimating using the observed value of x could be in error at this level of confidence. Page 2 of 10 Large Sample Estimates of the Mean ©David W. Sabo (1999) What Is the Value of ? The symbol has recurred often so far in this document. The quantity (1 - ) is the probability that the interval estimate we construct for really does capture the value of . This means that is the probability that our interval estimate is incorrect -- that it does not capture the value of . The value of (1 - ) is called the level of confidence -- it is a measure of how confident we are that the interval estimate has captured the desired value . (Remember, we have no way of determining exactly and for certain, because that would require sampling the entire population. The only information we have regarding what the value of might be is the information used in the formula (LS-4a, b) giving the interval estimate. This interval may or may not contain the actual value of . We have no way of knowing for sure if it does or if it doesn't -- but we do have a probability, which is better than no clue at all!) So, is the probability of the interval estimate being wrong. We may choose the value of as we please (though, of course, being a probability, can't be less than zero or greater than 1). While you are free to choose as you please, your choice has some implications. If you pick a rather large value for , then there will be a rather large likelihood of the resulting interval estimate being wrong. On the other hand, if you pick an extremely small value for (thinking that the less chance of being wrong, the better), you'll find that the sampling error, (LS-5) becomes extremely large. The smaller the value of , the larger the value of z/2. In the illustration above, you see that in order to get smaller tail areas, we need to move the cutoff value of z further away from zero. You can see from this that in order to be absolutely certain of our interval estimate of , we would have to reduce , the combined tail areas, to zero. This can only be achieved by having z/2 become infinite, and so obtaining the interval estimate: - +, an absolutely certain result which is absolutely useless. If you want greater certainty, you must pay for it by reducing the precision of the estimate, in that a wider interval results. This is not just a feature of statistical inference (though it shows up very blatantly in all aspects of statistical inference), but of life in general. For most routine statistical work, people use = 0.05 as a reasonable compromise between certainty and precision, and so using = 0.05 has the support of convention or tradition. Your decision to use a different value of would have to be based on your assessment of the severity of consequences that may result from an estimation error. If an incorrect estimate would have very serious consequences, you may be able to justify using a smaller value of at the expense of ending up with a broader, less-precise estimate. On the other hand, few practitioners would feel comfortable with an interval estimate based on a value of much larger than 0.10 in any application. (If it was necessary to use a value of much less than 0.05, it would be good to consider other modifications to the experiment (see below) to control the width of the interval estimate.) Unless instructed otherwise, or unless you can defend a decision to do otherwise, use = 0.05. Where Do I Get a Value for ? Formulas (LS - 4a,b) require a value for x n , and hence, a value for , the standard deviation of the population for which we are trying to estimate the mean, . Occasionally, situations arise in which is known fairly precisely even though the mean, , is not known. Remember that is a measure of variability in a population, whereas is an indication of where on the number line that population is. In technical applications, the value of can be the result of calibration, whereas the value of is the result of some inherent characteristic, design feature, or quality of the situation under study. This means that the value of may be known quite precisely from long-term records, but the value of must be estimated whenever the process is recalibrated to confirm that the calibration. In such instances, all the information you need to implement formulas (LS - 4a,b) is available. More usually when is being estimated, there is no prior direct information available about the precise value of . In this case, the original probability equation, (LS - 1) on which our confidence interval formulas ©David W. Sabo (1999) Large Sample Estimates of the Mean Page 3 of 10 (LS - 4) are based, contains two unknowns, and , and we seem to be stymied. However, since the focus is on estimating , and under the circumstances that (LS - 4) is valid, s is usually an adequate point estimator of , the common practice is to substitute the value of s obtained from the sample for the required value of . Thus, (LS - 4b) becomes x z / 2 s @ 100 (1 )% n (LS - 6) Example SalmonCa0: Just to give a sense of what sort of results the formulas above give, consider the standard example SalmonCa0. Recall the "story." A technologist is studying the suitability of chlorine dioxide (ClO2) for sanitizing salmon fillets. One concern is that the ClO2 treatment might adversely affect the mineral content of the fillets. First, she selects 40 salmon fillets which have not been treated at all, and obtains the following concentrations of calcium (in parts per million, ppm): 75 107 72 61 56 90 52 61 52 53 76 59 73 68 103 72 63 68 78 88 94 69 67 68 47 120 96 43 54 91 63 107 72 101 83 29 54 101 56 129 SalmonCa0 From these 40 values obtained for the random sample of 40 untreated salmon fillets, we can calculate: x = 74.28 ppm s = 22.022 ppm For a 95% confidence interval estimate of , we need = 0.05, and so we require z0.05/2 = z0.025 = 1.96, to two decimal places. Substituting all this into formula (LS - 6) then gives: x z / 2 s @ 100 (1 )% n 74 .28 ppm (1.96 ) 22 .022 ppm @ 95 % 40 We can state this finally in one of two ways, paralleling (LS - 4a) and (LS - 4b): 67.46 ppm 81.10 ppm @ 95% = 74.28 ppm 6.82 ppm @ 95% or Either form is acceptable, though probably the second form is more usual because it displays both the values of the sample mean and the estimation error explicitly. What this result tells us is that there is a 95% probability that the interval between 67.46 ppm and 81.10 ppm contains the true value of . A Deeper Example Page 4 of 10 Population Distribution Relative Frequency Because it is so important that you have a good understanding of exactly what these formulas are telling you about the population in question, we need to look at a somewhat more detailed example of the interval estimation process. For this, we return to the simple classroom populations, which consist of equal numbers of just six values: {1, 2, 3, 5, 7, 12}. These are the same populations we worked with in our early experiments with sampling, described in the documents "The Real Problem" and "The Arithmetic Mean: Sampling Issues." 0.2 0.15 0.1 0.05 0 1 Large Sample Estimates of the Mean 2 3 4 5 6 7 8 9 10 11 12 Value ©David W. Sabo (1999) Though infinite in size, this population is far from normally distributed. It does have the advantage that we know everything that can be known about it, and so we are able to compute and exactly, and so we will be able to see directly how accurate are the results of estimating via sampling and the formulas given earlier in this document. First, we calculate and , for later reference. Using formulas given in an earlier document ("Calculating Probabilities III"), 6 x k Pr( x k ) 1 k 1 1 1 1 1 1 1 2 3 5 7 12 5 6 6 6 6 6 6 using the fact that each of the six distinct values that occur in the population have the same probability, 1/6. Then 2 x k Pr( x k ) 6 2 k 1 (1 5) 2 1 1 1 1 1 1 (2 5) 2 (3 5) 2 (5 5) 2 (7 5) 2 (12 5) 2 6 6 6 6 6 6 82 13 .667 6 Thus 2 13.667 3.6969 Now, the experiment will work like this. We will select a succession of random samples of size n = 30 and another succession of random samples of size n = 65 from this population. For each of these samples, we will calculate x and s, and then use equation (LS - 6) to compute a 90% confidence interval estimate of . This will give us an idea of how well the "theory" above works in practice for this particular population at least. For example, the first sample of size n = 30 yielded the following data: 1 12 5 5 7 1 5 2 7 12 7 1 7 12 5 12 7 2 1 12 5 7 1 3 1 3 2 3 2 1 These numbers have a sum of 151 and the sum of their squares is 1189. Given that n = 30, this means that x 151 5.033 30 and s2 1189 (30 )(5.033 2 ) 14.79195 29 giving s 14.79195 3.846 To construct a 90% confident interval estimate, we use = 0.10, or /2 = 0.05, and so we need z.05 1.645. Thus, based on the data in this one sample, (LS - 6) gives 5.033 (1.645 ) 3.846 5.033 1.155 @ 90 % 30 or 3.878 6.188 @ 90% Note that this interval estimate does contain the known exact value of , which is 5. ©David W. Sabo (1999) Large Sample Estimates of the Mean Page 5 of 10 We now repeat this process 39 more times, to get a total of 40 replications of the random sampling/confidence interval estimate construction process. We graph these intervals relative to the same horizontal scale of x-values to facilitate comparison and insight into the interpretation of the process. The 40 random samples of size 30 and the 40 random samples of size 65 each gave a set of 40 values of x and 40 values of s. The following figures are scatter line plots of these: 70 80 60 70 60 sample size sample size 50 40 30 20 50 40 30 20 10 10 0 0 3 4 5 6 2 7 2.5 3 3.5 4 4.5 5 value of sample standard deviation value of sample mean Two features of these plots are noteworthy. First, in all instances, there is considerable scatter in the values of the statistics about the value of the population parameter of which they are point estimators. Secondly, the amount of scatter is quite dramatically smaller for the larger sample size than for the smaller sample size. Clearly, for this population and these sample sizes, use of x as a point estimator of and use of s as a point estimator of is quite a risky business. The following two figures compare the forty 90% confidence interval estimates of obtained for each sample size: 90% Confidence Intervals (Sample Size: n = 65) 90% Confidence Intervals (Sample Size: n = 30) 2 4 6 8 2 4 6 8 x x The horizontal scales here have the same size, so direct comparison of widths of the intervals is valid. You see immediately that the intervals resulting from the n = 65 samples are visibly narrower than those resulting from the n = 30 samples. Since wider interval estimates mean less precise estimates of the population parameter, we are clearly getting more precise estimates with the larger sample sizes. Notice that with the n = 30 samples, six of the forty interval estimates miss the value 5 entirely. We would have expected about four of the forty (or 10%) would have this defect. However, n = 30 is right on the borderline of validity as far as the rule-of-thumb is concerned, and we're working with a population that has a Page 6 of 10 Large Sample Estimates of the Mean ©David W. Sabo (1999) rather less than usual degree of "normality" in its distribution, so the envelope is being pushed pretty hard here. Only three of the forty samples of size n = 65 gave interval estimates that missed the actual value of the population parameter, which is more in line with the 10% rate that should be expected. Notice that even for those intervals which do capture the true value of , there is quite a bit of variation in horizontal position. While a few of the intervals are more or less centered on 5, the true value of (which would happen in instances where x had a value near 5), many have the value 5 rather close to one end or another. Thus, when you construct a confidence interval estimate, you aren't really justified in thinking of the middle of the interval as being where the true value of the population parameter probably is. All that you can say is that there is a certain probability that the value of the parameter being estimated is somewhere inside the stated interval. In fact, this is the catch. In real life, you would not take forty different random samples from a population in order to compare the results as was done above. You take just one random sample, and base your calculations and conclusions on that one sample. If we had been studying this classroom population the way statisticians study real populations, we would have taken just one sample of the type illustrated above. There would have been a 10% probability that our interval estimate of missed its true value entirely. However, even if our interval was among those which really did capture the true value of , the true value of might be anywhere in the interval from the extreme left to the extreme right. The Major Pieces of a Confidence Interval Estimate The right-hand side of formula (LS-6) x z / 2 s n @ 100 (1 )% contains five symbols, all of which play a role which shows up in confidence interval formulas for many different types of population parameters. We note their meaning and implications briefly here. x is the point estimator of the population parameter of interest here. It is typical to formulate interval estimates by appending a standard error term to a point estimator. is a value determining the confidence level of the estimate. In general, this symbol stands for a probability of being wrong in the result that is asserted. So, all things considered, smaller values of correspond to greater reliability, less likelihood of being wrong or making a mistake. z/2 is a factor that represents the characteristics of the distribution of the point estimator being used. In this large-sample estimation of the population mean case, the point estimator x is approximately normally distributed and so we use values of z. As you'll see when we deal with the small sample case, we need to use values obtained from a different, more spread-out distribution (called the t-distribution). What happens there is that there is so much less information available with a small sample that the precision possible for a given value of is poorer than indicated by the normal distribution. This critical value is a factor that represents how well a sampling scheme is able to make use of the available information. Note as well that the value of z/2 is affected very strongly by the value of . If you choose a very small value of , you'll end up with a large value for z/2, and so a very broad interval estimate. Thus, there is a trade-off between reliability (how likely is your estimate to be wrong) and precision (how narrow is the interval). All other things being equal, increasing reliability always results in reduced precision and vice versa. For a given sample size, you cannot have arbitrarily high reliability and high precision at the same time. ©David W. Sabo (1999) Large Sample Estimates of the Mean Page 7 of 10 s measures the variability of the population being sampled. Samples selected from a very uniform population (for which and hence s would be small numbers) would generally be quite representative of that population, and so interval estimates of population parameters should be quite precise and/or reliable. On the other hand, it will be much more difficult to get precise estimates by sampling highly variable populations. The larger the value of s, the poorer the precision of the interval estimate for this reason for a specific sample size. n is the sample size. The width of the interval estimate is inversely proportional to the square root of the sample size, n. Thus, by increasing the sample size, we can decrease this width, and hence increase the precision of the estimate. Larger sample sizes lead to narrower interval estimates because larger samples contain more information therefore more precise estimates should be result. Of these five quantities, only two are under our control: and n. Once is chosen, the value of z/2 is fixed. The values of x and s result from experimental observation. Selecting an Appropriate Sample Size In planning a statistical study to obtain an interval estimate of some population parameter, we usually have just two numbers at our discretion: the value of (the probability of being wrong due to random sampling error) and the sample size n. A prudent strategy then is to start by selecting the largest value of that is acceptable. Except in special circumstances, this will usually be 0.05. Secondly, a choice of sample size must be made. This is usually done indirectly by deciding on the largest value of the sampling error that is acceptable that is, how precise do you need the interval estimate to be? The sample size is then chosen to meet this requirement. In the present application, denote the maximum acceptable value of the sampling error by the symbol . What we want to do then is select a sample size that will ensure z / 2 n Note that all dependence on n is shown explicitly here none of the symbols z/2, , or depend on n. It is easy to solve this inequality for n, to get z n / 2 2 (LS-7a) You can use this formula directly if you have a value for . If not, you need to use your value of s as a point estimate of the required : s z n / 2 2 (LS-7b) Since s is only an approximation to , the value of n you get from here may not guarantee that the eventual interval estimate will have a standard error term which is strictly less than . In fact, when you don't have a prior value of , there is a bit of a chicken-and-egg problem here. We need a value of s to estimate the size of sample to be collected. However, we can't get a value of s until we have a sample (and so we need to know n, it looks like, to get a value for s). In actual practice, one would handle this situation by carrying out a small-scale sampling (sometimes called a pilot study) to get an estimate of s. That estimate is then used to compute a rough value of n to use for the main study. Sometimes people compute n in this way and then increase its value by some amount or some factor for the main study, just to be on the safe side. Another approach is to carry out the main study with the estimated value of n. If the results are still of unacceptable precision, further sampling can be done to improve the precision of the final result. Page 8 of 10 Large Sample Estimates of the Mean ©David W. Sabo (1999) According to formula (LS-7b), the value of n that is required is influenced by four quantities. The effect of each of these by themselves is seen to be: : choosing smaller results in z/2 being larger and so the required value of n gets larger. All other things being kept the same, requiring increased reliability of the estimate results in a larger value of n hence the requirement for a larger sample. On the other hand, increasing the value of (that is, being willing to accept a lesser degree of reliability) reduces the size of the sample required. z/2: this value is determined by our choice of the value of . n increases as z increases, and n decreases as z decreases. s: the larger the value of s, the larger the value of n to achieve the same precision of estimation. The greater the variability in a population, the larger will be the value of and so also, the value of its estimate, s. This in turn leads to the requirement of larger sample sizes to achieve a specific precision of estimation. : as decreases (indicating a more precise estimate is desired), the value of n increases. If you want a more precise estimate, or a smaller estimation error, you will need to collect a larger sample. Increasing sample size inevitably increases the time and cost of the experiment. Requiring higher precision for a given value of inevitably results in the requirement of a larger sample. Thus, decisions about the appropriate target precision and the appropriate value of must be made with some care selecting values which are more rigorous than necessary can result in adding unnecessary cost to the study. On the other hand, even very reliable results may be useless if the precision is inadequate. After all I can be 100% confident that it will either rain today or not rain today, but while highly reliable, this weather forecast is probably too imprecise to be of much utility. Example SalmonCa0 Earlier, we obtained a 95% confidence interval estimate of the mean calcium content of untreated salmon fillets which involved an uncertainty of 6.82 ppm. Suppose we were required to determine this mean calcium content to the nearest 1 ppm. How many salmon fillets would have to be sampled to achieve this? Solution We are being asked to compute the value of n required by = 1 ppm. We don't have a value of , but from the sample of size 40 detailed above, we have the estimate s = 22.022 ppm. For the 95% confidence interval estimate z/2 = 1.96. Thus, substituting into formula (LS - 7b), we get 2 (1.96 ) (22.022 ppm ) 1863 .05 n 1 ppm Because of the inequality, we need to round up to the nearest integer, 1864. Thus, it appears that we would need a sample size of at least 1864 to achieve the requested 1 ppm precision. Note that this value, 1864, is just an estimate. Whether it achieves the goal or not depends on whether, with the larger sample, we get a value of s which is greater or less than our present value of 22.022 ppm. What you might do if this precision is mandatory is just round this value off to say 2000, to allow a small safety factor. Or, collect a sample of size 1864, and if the precision is still not adequate, collect another few hundred specimens to add to your previous sample, until the desired precision is achieved. If the original study of 40 specimens was kind of typical, then a proposal to expand the study to include an additional more than 1800 specimens might be considered prohibitively expensive. What this work points out is that the only alternative is to reduce the target precision that is, if trying for = 1 ppm is too expensive, then the only alternative is to be willing to accept a lesser precision. The really important point is that the value of , , and n you eventually would end up using are the result of deliberate decisions based ©David W. Sabo (1999) Large Sample Estimates of the Mean Page 9 of 10 on what you need to accomplish and what resources you have available they should not just be wild guesses. Page 10 of 10 Large Sample Estimates of the Mean ©David W. Sabo (1999)