Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

MATH 2441 Probability and Statistics for Biological Sciences Confidence Interval Estimates of the Mean (Small Sample Case) In the previous document, we looked at the construction of confidence interval estimates of the population mean when either of the following conditions applied: a large sample was available taken to mean that the sample size was around 30 or greater, or, the sample size was less than 30, but we had evidence that the population was approximately normally distributed and we knew the value of the population standard deviation, . Neither of these sets of conditions covers a very common situation: the sample size is less than 30 (and we aren't able to meet all the conditions of the second case above). These would be called small sample cases. There are two possible situations that can arise here as well, as it turns out. (i.) the data or any other information we have about the population does not allow us to declare or even assume that the population is approximately normally distributed. In this case, we're sunk! Some references suggest one might use a so-called "non-parametric" or "distribution-free" technique (meaning a statistical method that doesn't take any account of the distribution of a population). Methods of this type which do exist tend to give relatively poor results when small sample sizes are involved. Another alternative is to collect more data to push the sample size up so that the large sample formulas can be used. (ii.) We have evidence that the population is approximately normally distributed (or we're prepared to assume that this is true). We will look briefly at methods for assessing the consistency of data with the assumption of a normally-distributed population at bit later in the course. In this second case (that the population is approximately normally distributed), it can be shown mathematically that the quantity t x s n (SS-1) has the t-distribution with = n - 1 degrees of freedom. Using this, we can obtain the following 100(1 - )% confidence interval estimate of the population mean, : x t / 2, s n x t / 2, s (SS-2a) n or x t / 2, s (SS-2b) n Notice that this latter formula has the general pattern: population parameter = point estimator probability factor x standard error ©David W. Sabo (1999) Small Sample Estimates of the Mean Page 1 of 4 as we encountered in the large sample formula for confidence interval estimates of the population mean. All that has really changed here is that the probability factor now is taken from the t-distribution rather than the standard normal distribution. There's not much more to be said about the method. We will illustrate it with a couple of examples. Example Cpeas: Construct a 95% confidence interval estimate of the mean vitamin C content (in mg/100g) of canned peas, based on the standard data set Cpeas. Solution: Consulting the document entitled "Example Data Sets", we find that twelve specimens of canned peas were analyzed for vitamin C content, yielding the results: 9.7 7.0 8.2 9.5 6.6 5.0 6.5 8.2 6.5 7.3 6.8 10.6 CpeasCanned These 12 numbers have a mean, x , of 7.66 mg/100g and a 12 standard deviation of 1.623 mg/100g. The normal probability plot for this data is shown to the right, indicating reasonable grounds 10 for regarding this data as consistent with a normally distributed 8 population. (The figure shows a normal probability plot prepared using the method suggested by Snedecor and Cochran for 6 perfect consistency with a normally distributed population, the 4 points should fall along a straight line. The deviations of the points from a straight line appear to be random, uniform, and there are no 2 trends near the ends of the patterns that might contradict an assumption of normality.) The conditions under which formula 0 -2 -1 0 1 (SS-2) is valid are thus met. Since n = 12, we have = 12 - 1 = 11. For a 95% confidence interval, we need = 0.05, so that /2 = 0.025. From the t-table, we get t0.025,11 = 2.201. Thus, the requested confidence interval estimate is: 7.66 (2.201) 1.623 7.66 1.03 2 mg / 100 g @ 95% 12 That is, there is a 95% chance that the interval 7.66 1.03 mg/100 g captures the actual value of in this case. You could also write this interval as 6.63 8.69 mg/100 g. Example Cheddar Construct a 99% confidence interval estimate for the mean percentage fat in cheddar cheese based on the data set CheddarFat. Solution The data consists of 21 values: 27.2 26.3 27.9 28.1 28.7 32.6 24.6 23.3 30.8 35.0 21.5 22.6 28.0 24.8 24.8 27.9 22.9 28.3 35.0 32.6 29.3 CheddarFat The normal probability plot for this data is shown in the figure just below. It's not an ideal plot (the tendency of the pattern of points to level out near the extremes is consistent with a somewhat tighter clustering pattern than the normal distribution in the population, but the sample size here is so small that this is not a decisive feature.) We will consider the normal probability plot to be an adequate basis to consider formula (SS-2) to be valid for this data. Page 2 of 4 Small Sample Estimates of the Mean ©David W. Sabo (1999) For a 99% confidence interval estimate, we need = 0.01, so /2 = 0.005. Since n = 21, we have = 20. The mean value, x , for this data is 27.72 % and the standard deviation, s, is 3.898 %. The probability factor is t.005,20 = 2.845, so that the requested confidence interval estimate becomes 40 35 30 25 3.898 27 .72 (2.845 ) 20 21 15 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 27.72 2.420 % @ 99% (By the way, even though the values here are percents, we are really working with population and sample mean values the mean values of the percent fat in the cheddar cheese. We know that we are not dealing with proportions, which are also often written as percents, because a proportion denotes a percentage of a total number of items which fall into a certain category. That is not the case in this example.) Deciding on Sample Sizes As in the large sample case, the width of confidence interval estimates can be reduced by selecting larger samples here. Thus, if we wish to have the '' part of formula (SS-2b) to be smaller than some value , we could write t / 2, s n t / 2, s n 2 (SS-3) The only difficulty here is that the constant t/2, appearing on the right-hand side not only depends on , but also on the value of n, through . As a result, we cannot solve for n as cleanly as was possible in the large sample case where the probability factor, z/2, did not depend on n. The recommended strategy here is to seek a self-consistent solution to (SS-3) by starting off with z/2, substituted for t/2, in (SS-3). If this gives a value of n which is larger than 30, then that is the appropriate sample size to use, and the confidence interval formulas for the large sample case can be used. If this first estimate of n gives a value smaller than 30, then you need to use trial and error to find an appropriate value of n that will satisfy (SS-3) as written. This is not a big issue, because usually when a specific precision is targeted in statistical estimation, it is a high enough precision that the smallest acceptable sample size will still correspond to a large sample case. Example Cpeas Just to illustrate, return to the first example above, involving the estimation of the vitamin C content of canned green peas. There we obtain a 95% confidence interval estimate which contained the uncertainty term 1.03 mg/100 g. Suppose, it was necessary to try to reduce this uncertainty to just 0.5 mg/100 g (meaning to achieve = 0.5 mg/100g in formula (SS - 3)). To estimate the sample size required, we use z.025 = 1.96 in formula (SS-3) to get 2 (1.96 ) (1.623 ) n 40 .47 0 .5 Since this result is larger than 30, we conclude that to achieve an estimate with = 0.5 mg/100g, we would need to collect a sample of size 41 or larger. With this sample size, the confidence interval estimate can then be calculated using the large sample formulas. On the other hand, suppose the goal was an estimate with = 0.8 mg/100g. Now, the calculation above becomes ©David W. Sabo (1999) Small Sample Estimates of the Mean Page 3 of 4 2 (1.96 ) (1.623 ) n 15 .81 0.8 This result is smaller than 30, and so is inconsistent with the formula used to calculate it (we've assumed a large sample model in the bracketed expression but obtained a solution indicating a small sample). The large sample formulas tend to underestimate the required value of n, so it is reasonable to expect that the required value of n here will be some number slightly larger than 16. Using n = 17, 18, 19 and 20, for example, in the expression t .025, 1.623 , n we get n 17 18 19 20 t .025, 1.623 n 18.5 18.3 18.2 18.0 What this table says is, if you assume n = 17, formula (SS-3) says use n = 19 (an inconsistency), if you assume n = 18, formula (SS-3) says use n = 19 (an inconsistency), if you assume n = 19 formula (SS-3) says use n = 19 (consistent!), if you assume n = 20, formula (SS-3) says use n = 18 (an inconsistency) Thus, the appropriate sample size would appear to be 19. Of course, these estimates of n are correct only if for the expanded sample, you get s very close to 1.623 mg/100 g of peas. Thus, there may not be much point in quibbling over one or two items in the sample unless each element of the sample represents an extremely costly procedure. Page 4 of 4 Small Sample Estimates of the Mean ©David W. Sabo (1999)