* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download here - BCIT Commons
Survey
Document related concepts
Transcript
MATH 2441 Probability and Statistics for Biological Sciences Analysis of Variance (ANOVA) (Comparison of Many Population Means) Analysis of Variance or ANOVA is the name given to a rather extensive set of statistical techniques used in the inference of differences in the means of three or more populations. The reason for this somewhat unexpected name being applied to methods for comparing population means will become clear as we go along. In this document, we present and illustrate the simplest situation addressed with ANOVA methods -when the populations in question are distinguished by different instances of a single feature. Further study of the methods will then be deferred until MATH 3441 for students in the Food Technology Program. An Example These days, people are quite interested in reducing the amount of fat in their diet. One of the foods that has often been seen as a source of avoidable fat is eggs -- particularly the egg yolk. Suppose that a technologist wishes to determine whether the amount of fat in the egg yolk can be influenced by the diet of the hens. She prepares four different blends of food: blend 1, blend 2, blend 3, and blend 4. Each of a group of hens is randomly assigned one of the food blends for a period of several weeks. Then, five eggs are selected at random from chickens fed each blend of food, and the percent fat in the yolk of each selected egg is measured. The data obtained is: egg #1 egg #2 egg #3 egg #4 egg #5 sample size (n): sample mean sample standard deviation blend 1 36.3 34.1 33.4 33.9 39.4 blend 2 34.5 39.1 34.3 30.3 31.8 blend 3 28.3 28.0 30.9 29.6 28.6 blend 4 31.6 36.3 32.5 41.5 33.1 5 35.42 2.487 5 34.0 3.350 5 29.08 1.182 5 35.0 4.042 For convenience, we've also included some statistics in this table: the sample sizes (all equal at 5), the sample means, and the sample standard deviations. Now, this is a problem involving four populations of egg-laying hens; the populations distinguished by the identity of the blend of food eaten by their members. The different food blends here are examples of different levels of the factor food in the experiment. Because there is just one factor that distinguishes the populations in the problem, our analysis below will be an example of single-factor ANOVA or one-way ANOVA. (If we had also distinguished hens on the basis of, say, variety of hen, as well as on the basis of which diet they were fed, then we would have a two-factor experiment. It is not uncommon for experimental studies to involve two or three different factors, but the calculations and interpretation of results becomes much more difficult as the number of factors increases. In this document we will consider only single factor situations.) The question to be answered here is whether or not the data above is evidence that the true mean percentage fat in egg yolks is different for hens eating different ones of the four blends of food used in the experiment. We see that some of the sample means are quite similar, whereas others are quite different. The basic issue is: are any of the differences in the sample means obtained from this data big enough to infer a statistically significant difference in the corresponding population means. First Try: Looking at the Populations One-at-a-Time In order to compare the true means of the four populations in this example, you might think of starting by constructing, say, 95% confidence interval estimates of each mean. If we use the symbols 1, 2, 3, and 4 to denote the true mean fat percentages of egg yolks from hens fed, respectively, blends 1, 2, 3, or 4, and © David W. Sabo (2000) Analysis of Variance (ANOVA) Page 1 of 10 similar subscripts on the symbols x and s to denote the corresponding sample means and standard deviations, then the formula for these interval estimates is k xk t sk 2 ,nk 1 nk @100(1 )% (1) This formula is valid only if the populations from which the samples were selected are approximately normally distributed. We have no evidence supporting or opposing that assumption because the samples are so small. However, this condition is a requirement of all methods to be considered in this document. We will simply assume the condition is met. Using = 0.05, and taking note that all four sample sizes, n k are equal to 5, the t-table gives the critical value t0.025,4 = 2.776. Then, for instance, for k = 1, the first population/sample, we get 1 x1 t0.025,4 s1 @ 95% n1 35.42 2.776 2.487 5 35.42 3.09 @95% It is probably more useful to write this in the form of an actual interval: 32.33 1 38.51 @95% 3 2 1 In a similar manner, we can also determine that 29.84 2 38.16 @95% 27.61 3 30.55 @95% 29.98 4 40.02 @95% 26.00 31.00 36.00 41.00 These intervals are sketched accurately in the figure to the right. You can see that the only nonoverlapping intervals are those for 1 and for 3. All of the others overlap to a greater or lesser degree. Based on these calculations, we could thus say that it appears as if eggs from hens fed blend 1 have less fat in the yolks than do the yolks of eggs from hens fed blend 3, but we aren't able to make any other definite statements. However, you need to realize that the interval estimates computed above are rather crude. They are each based on quite small samples since each estimate results from just five observations. While there is a 95% probability that each of these intervals does capture the true mean value being estimated, the precision of the estimates is quite poor. With this approach the best we can say is that if two intervals do not overlap, then very likely the corresponding population means are unequal. However, if two intervals do overlap somewhat, we are not able draw any conclusion about the relative values of the corresponding population means. Because of these defects, the single-population approach illustrated above is not a recommended method. A Second Try: Comparing the Population Means Pair-wise We can get somewhat better results if we apply the procedures developed earlier for comparing the means of two populations to each pair of means that arise from the four population means in this example. There are six pairs to consider in all: 1 - 2, 1 - 3, 1 - 4, 2 - 3, 2 - 4, and 3 - 4. The samples are small and independent, so we must assume that the populations are all approximately normally distributed, and that their variances are equal. This latter assumption is dubious in a couple of pairings involving population 3, but to avoid undue complexity, we will proceed on that second assumption as well. Page 2 of 10 Analysis of Variance (ANOVA) © David W. Sabo (2000) The procedure should be familiar by now, but we will review it for one pair briefly: 1 - 2. We must first form the pooled standard deviation, sp: sp2 n1 1 s12 n2 1 s22 (2) n1 n2 2 5 1 2.4872 5 1 3.3502 552 8.7038 Thus sp sp2 8.7038 2.950 Now, one way to determine if the data will support a conclusion that 1 - 2 0 (that is, that the two populations means are different), would be to test the hypotheses: H0: 1 - 2 = 0 vs. HA: 1 - 2 0 The appropriate standardized test statistic is t x1 x2 0 sp 1 1 n1 n2 35.42 34.00 0 2.950 1 1 5 5 0.761 Now, this is a two-tailed hypothesis test, and so H0 can be rejected at a level of significance of 0.05 if |t| > t0.025,8 = 2.306 (remember, the degrees of freedom are now n1 + n2 - 2 = 5 + 5 - 2 = 8) . Clearly that condition is not met here, since |0.761| = 0.761 is not greater than 2.306. Thus, we conclude that the data is not adequate evidence to conclude that 1 is different from 2 at a level of significance of 0.05. The same conclusion could have been obtained by constructing a 95% confidence estimate of 1 - 2: 1 2 x1 x2 t , sp 2 1 1 n1 n2 @ 100(1 )% 35.42 34.0 2.306 2.950 1.42 4.30 (3) 1 1 5 5 @ 95% or, in the form of an interval 2.88 1 2 5.72 @ 95% Clearly, since this interval straddles the value zero, we are unable to eliminate the possibility that the two population means are equal -- a result echoing the result obtained using the hypothesis testing approach just above. The results of these calculations for all six pairs of populations is shown in the following table: comparison 1 - 2 1 - 3 1 - 4 2 - 3 © David W. Sabo (2000) t 0.761 5.148 0.198 3.097 (p-value) (0.47) (0.00088)* (0.85) (0.015)* 95% confidence Interval estimate -2.883 1 - 2 5.723 3.500 1 - 3 9.180 -4.475 1 - 4 5.315 1.257 2 - 3 8.583 Analysis of Variance (ANOVA) Page 3 of 10 2 - 4 3 - 4 -0.426 -3.143 (0.68) (0.014)* -6.414 2 - 4 4.414 -10.263 3 - 4 -1.577 For any pair of means for which |t| > 2.306, we can reject the null hypothesis that the population means are equal at a level of significance of 0.05. The confidence intervals are sketched in the figure to the right as well to give a bit more of a visual impression of their relative sizes and positions. Remember that intervals which do not straddle the vertical axis correspond to pairs of populations for which the difference of the means is non-zero at this confidence level. 3 - 4 2 - 4 2 - 3 1 - 4 1 - 3 1 - 2 Now, from both the hypothesis test results and -15.000 -10.000 -5.000 0.000 5.000 10.000 15.000 (equivalently) the estimates of the differences of the various population means, we see that three pairs of population means are different to a statistically significant degree here: 1 and 3, 2 and 3, and 3 - 4. The confidence intervals of the corresponding differences are the ones which do not overlap the vertical axis in the figure just above. It looks like these results allow us to say that feed blend 3 results in egg yolks with a statistically significantly lower fat content than does any of the other three diets. However, there is no statistically significant difference in fat content of yolks of eggs from hens fed any of blends 1, 2, or 4. This is probably a useful result as far as this specific problem. However, there is a fairly serious defect with this approach. We tested six separate hypotheses, and were able to reject three of them at a level of significance of 0.05 (in this example). We can refer to the 0.05 as the pairwise level of significance, since it represents the probability that the rejection of the null hypothesis for a pair of means will be mistaken. However, a more useful measure of the reliability of our analysis would be to calculate the experiment-wise level of significance, EW , the probability that at least one of the rejections of H0 is in error. In the specific problem above, where we declared three null hypotheses rejected at a pairwise level of significance of 0.05, this probability is Pr(at least one error) = 1 - Pr(no errors were made) = 1 – (0.95)3 0.1426 Here, it was easier to calculate the probability of no errors being made than of ‘at least one error’, since the event ‘at least one error’ corresponds to the compound event ‘one error or two errors or three errors.’ If the probability of error for one pair is 0.05, then the probability of not making an error is 0.95. Anyway, the point here is that the probability that we made at least one error in rejecting three of the null hypotheses here is quite a bit larger in principle than the value of used for each pairwise comparison. (For this particular example with the specific data given, the situation is not as bad as indicated, since from the pvalues calculated for the three rejected null hypotheses, we get what amounts to an experiment-wise pvalue of 0.029 – however, you cannot assume that such favorable numbers will always occur in applications, and so the “worst case” must be considered.) When people speak of EW , they normally have in mind the value that would result if all possible pair-wise null hypotheses were rejected, each at a level of significance , and this gives surprising large values for EW even when a relatively small number of populations is considered. To see this, note first that if we are dealing with k populations, then the number of unique pairs, C, of populations to be compared is given by C k k 1 2 For instance, in the example above, we had k = 4, which resulted in C = 4(3)/2 = 6 pairs of populations. Then, if each pairwise comparison is done via a hypothesis test with a level of significance , we get that Page 4 of 10 Analysis of Variance (ANOVA) © David W. Sabo (2000) EW 1 1 C (4) For = 0.05, we get the following sort of results: k 3 4 5 6 7 8 9 10 15 EW 0.143 0.265 0.401 0.537 0.659 0.762 0.842 0.901 0.995 C 3 6 10 15 21 28 36 45 105 Again, the numbers in the third column of this table are an upper limit to the true probability of making at least one error in comparing the means of all possible. pairs of populations. However, because these values of EW potentially get so much larger than so quickly as the number of populations increases, this strict pair-wise approach to comparing the populations in this sort of problem is not considered sound. The thing is, these two approaches pretty well exhaust the use of sample means directly to detect differences between corresponding population means. It is by looking in more detail at the various contributions to the variance of the data that we get a method which more reliably detects the presence of differences between the means of several populations – hence the name analysis of variance or ANOVA. Single-Factor Analysis of Variance Because the eventual test statistic calculated here depends on all of the data in the experiment, we need to develop a notation which can keep track of the details adequately. Some of the following notation has already been established. k = the number of populations being considered nj = size of the random sample selected from population #j. There are k of these: n 1, n2, and so on to nk nT = total number of observations in all samples together (equals the sum of the n j values) xij = the observation #i in sample #j. For sample #j, there will be n j of these: x1j, x2j, x3j, and so on up to xnj,j x j = the mean of sample #j. x j 1 nj m n j x (5) mj m 1 x = the mean of all of the observations in all of the samples. Note that this is not equal to the mean of the sample means unless all samples are the same size. m n j sj2 = the variance of sample #j. s 2 j x 2 mj n j x j2 m 1 nj 1 (6) We’ll introduce other symbols as we go along. We note three assumptions that underlie ANOVA methods: (i) the k populations are assumed to be approximately normally distributed (ii) the variances of all k populations are assumed to be equal (iii) the samples from the k populations are independent © David W. Sabo (2000) Analysis of Variance (ANOVA) Page 5 of 10 It is also a good idea to try to have all sample sizes equal, but there are ways to deal with unequal sample sizes. From these conditions, you can see that ANOVA methods are designed to distinguish between normally-distributed populations which differ simply in their mean values. The principle that ANOVA will exploit is based on the following notion. If all k populations have essentially identical means, then the k samples will essentially be sampling identical populations. If the k samples are pooled to form one large set of nT observations, the mean and variance of that larger set will be very similar to the means and variances of the individual samples. On the other hand, if the k populations have very different mean values, then the variance of the pooled data will be much larger than the variance of the individual samples because most of the observations will be quite different from the pooled mean, x . We will come up with a single number that “measures” the difference between these two extreme situations. The effect described in the previous paragraph shows up if we look at just the numerator in the formula for the variance, which simplifies the algebra considerably. Now, if we put all of the data together into one big list, it will have the mean value, x , and its variance, s2TOT is given by the formula: nj 2 x j 1 m 1 nT 1 x k mj 2 sTOT (7) We will work with just the numerator of the rather forbidding expression on the right-hand side, which will be identified by the symbol SST for “Sum of Squares Total”, since it consists of a sum of squares of terms taken over the total collection of data: makes sure that the sum includes all k samples. k nj 2 SST xm j x j 1 m 1 sums over the values of xm j x 2 for all of the observations xmj in sample #j. (8) The double summation notation may be a bit intimidating, but simply amounts to saying that if you are going to form the sum in SST over all observations in the experiment, you can do it by summing over all observations in each sample, and then combining the sums from the individual samples into one grand total. Now, do the following to the expression inside the curved brackets in formula (8): k nj 2 SST xm j x j x j x j 1 m 1 (9) All we’ve done is subtract x j , the mean for the jth sample, and then add it in again. There is no net effect on the value of SST, since we are subtracting and adding the same quantity, but this form allows us to group things in new useful way. We can insert brackets to indicate how this grouping will be done: k nj 2 SST xm j x j x j x j 1 m 1 (10) Now, we can carry out the squaring of the expression inside the inner square brackets, keeping the terms inside the round brackets intact. This gives Page 6 of 10 Analysis of Variance (ANOVA) © David W. Sabo (2000) k nj 2 2 SST xm j x j 2 xm j x j x j x x j x j 1 m 1 Now, it doesn’t matter if we sum the three terms inside the inner square brackets for each value of m, and then sum those results for all values of m, or if we sum each term for each value of m, getting three intermediate results, and then sum those three intermediate values to get the overall inner sum. Thus, the formula above is equivalent to nj nj k nj 2 2 SST xm j x j 2 xm j x j x j x x j x j 1 m 1 m 1 m 1 Further, all of the terms in the last two of these inner sums contain common factors which do not depend on m, which we can factor out of the sums, to get: nj nj k nj 2 2 SST xm j x j 2 x j x xm j x j x j x 1 j 1 m 1 m 1 m 1 (11) Now, we can deal with each of the terms in the square brackets quite easily. The first term is nothing but the numerator of the formula for the variance of the sample from the j th population. Since nj s 2j x mj m 1 xj 2 nj , nj 1 x then m 1 x j n j 1 s 2j 2 mj (12) The second term takes a bit more work, but gives an even simpler result. Notice that we can break the sum up into two parts: nj x mj m 1 nj nj nj nj m 1 m 1 m 1 m 1 x j xm j x j xm j x j 1 Now, note the following: nj 1 n (13) j m 1 since we are just adding 1 to itself for every value of m between 1 and nj. Furthermore, by definition, xj 1 nj nj x mj m 1 Thus, nj nj nj x x x x 1 m j j m j j xm j n1 m 1 m 1 m 1 m 1 j nj nj nj x n x m j j m1 m j m1 xm j 0 m 1 nj (14) Thus, the second term in the square brackets of equation (11) is just equal to zero. Finally, the third term in the square brackets of (11) is easily handled using the result (13), just above: x j © David W. Sabo (2000) x 2 nj 1 n x j m 1 j x 2 Analysis of Variance (ANOVA) (15) Page 7 of 10 So, putting results (12), (14), and (15) into (11), we end up with SST n j 1 s 2j n j x j x k k j 1 j 1 2 (16) This may have seemed like an awful lot of complex algebra which hasn’t really resulted in much of a simplification of the formula for SST. We displayed the steps in the simplification of (11) in some detail so that you could see how this sort of thing is done, and that despite the somewhat sophisticated notation that is necessary, the actual algebra step-by-step is quite routine. Further, you should see that once you’ve calculated the individual sample means and standard deviations, as well as the grand mean of all observations in all samples, it is not too difficult to calculate SST from formula (16). However, that by itself is not really the goal of deriving formula (16). The immense value of formula (16) is that the terms each reflect the two extremes that may occur in an experiment of this type. Extreme Situation #1: This is the situation in which all of the k populations have more or less equal means (so that the difference between their means is small compared to their variances). Then, we would expect all of the individual sample means, x j to be very nearly equal, and also to be nearly equal to the grand mean, x . If this is the case, then the second sum in (16) is a sum of very small values, and so will itself be small compared to the sum of variances in the first term. Thus, in the extreme situation that the means of the k populations are essentially identical, we expect the first term on the right-hand side of SST to be much much larger than the second term. Extreme Situation #2: This is the situation where the means of many of the k populations are very different – separated by intervals which are large compared to their standard deviations (this is another way of saying that the bell curves representing the distribution of each of the k populations do not come near to overlapping). Then, the individual sample means will be very different, and very different from the grand mean. This means that the sum in the second term of the right-hand side of (16) will be a sum of large numbers (since they involve the square of large numbers in this case, which are even larger). In this situation, the second term on the right-hand side of (16) will be dominant. We’re looking for a measure that will distinguish between the situation in which all k populations have very similar means from the situation in which at least one of the populations has a quite different mean value from all the others. We see now that these two sorts of situations are distinguished by the relative values of the two terms on the right-hand side of (16). To make the method precise, people introduce some additional notation, which starts with symbols for each of the terms in (16): SST = SSW + SSB where k SSW n j 1 s 2j (17) j 1 measures the amount of scatter or spread within the individual populations themselves (hence the notation SSW from “Sum of Squares Within”). If the individual populations are themselves very scattered, it becomes more difficult to detect a difference in their means, and so SSW is a relevant quantity here. Similarly SSB nj x j x k 2 (18) j 1 measures the amount of separation between the individual population means (hence the notation SSB from “Sum of Squares Between”). SSB will tend to have larger values if at least some of the populations have mean values which are quite different from the mean values of others of the populations. Page 8 of 10 Analysis of Variance (ANOVA) © David W. Sabo (2000) There is one problem with working with SSB and SSW directly. Since every observation in an experiment contributes a positive amount to each of SSB and SSW (or at least a non-negative amount), simply increasing sample sizes will increase the values of both of these quantities, and so when we state that one or the other is “big”, it is unclear whether the bigness is due to the properties of the populations under study or simply due to us taking larger samples. To solve this problem, we don’t use SSB and SSW directly, but with the corresponding quantities MSW SSW nT k and MSB SSB k 1 (19) (where the M stands for “Mean”). By dividing by these denominators reflecting sample sizes and numbers of populations, we compensate for the effects of simply including more observations in our calculations. Finally, we can now state that the ratio F MSB MSW (20) is a random variable which has the F-distribution with numerator degrees of freedom equal to k – 1, and denominator degrees of freedom equal to nT – k. From the discussion so far, we know that when one or more of the populations has mean values much different from the others, then MSB will be larger than MSW, and so F will have a larger value, than if all of the k populations have more or less equal means (in which case, MSW will tend to be larger than MSB and so F will have a smaller value). Now we can state the solution of the problem of detecting the presence of differences in the means of many populations as a simple hypothesis test: H0: 1 = 2 = 3 = … = k vs. HA: at least one of these means is different from the others Then, subject to the three conditions underlying all ANOVA methods listed earlier in this document, the test statistic for these hypotheses is F MSB MSW (20) and H0 can be rejected at a level of significance if F > F (21) where F is the value of F cutting off a right-hand tail of area in the F-distribution with k – 1 numerator degrees of freedom and nT – k denominator degrees of freedom. It’s just that simple. The p-value for this test is computed as the right-hand tail area to the right of the value of F calculated using formula (20). Normally, the calculations leading up to this hypothesis test are organized into a standard form called an ANOVA table: Source of Variation Between Within Total Sum of Squares SSB SSW SST = SSB + SSW Degrees of Freedom k–1 nT – k nT - 1 Mean Square F MSB MSW MSB/MSW Formulas or symbols in the table should be replaced by the numerical values they generate when an actual problem is being solved. © David W. Sabo (2000) Analysis of Variance (ANOVA) Page 9 of 10 Example To illustrate the use of the formulas above, we apply this analysis to the example introduced at the beginning of this document. Since the example includes data for samples of five items from each of four populations, we have that k = 4 (number of populations being compared) and nT = 20 (total number of observations available). This allows us to fill in the ‘degrees of freedom’ column of the ANOVA table with the values 3, 16, and 19 in order from top to bottom. The sample means and standard deviations are given in the table on page 1 of this document. Thus, SSW = (n1 –1) s1 2 + (n2 - 1)s2 2 + (n3 - 1)s3 2 + (n4 - 1)s4 2 = (5 - 1)(2.487)2 + (5 - 1)(3.350)2 + (5 - 1)(1.182)2 + (5 - 1)(4.402)2 = 140.576 Similarly, since the mean of all 20 observations works out to be 33.375, we get for SSB: SSB n1 x1 x n2 x2 x n3 x3 x n4 x4 x 2 2 2 2 = (5)(35.42 – 33.375)2 + (5)(34 – 33.375)2 + (5)(29.08 – 33.375)2 + (5)(35 – 33.375)2 = 128.3015 For what it’s worth, then SST = SSB + SSW = 268.8775 Finally, MSB SSB 128.3015 42.7672 k 1 3 and MSW SSW 140.576 8.786 nT k 20 4 and so F MSB 42.7672 4.868 MSW 8.786 Putting all of these results into the ANOVA table form gives the summary: Source of Variation Between Within Total Sum of Squares 128.3015 140.576 268.8775 Degrees of Freedom 3 16 19 Mean Square F 42.7672 8.786 4.868 From the F-distribution tables, we have that F0.05,3,16 = 3.24. Thus, since 4.868 > 3.24, the rejection criterion is met. We can reject H0 at as level of significance of 0.05, concluding that at least one of the population means in this problem is significantly different from the others. In fact, the function call FDIST(4.868, 3, 16) in Excel gives the p-value for this test as 0.0136. This concludes the analysis. Having established that at least one of the population mean differs from the others, the next logical step is to try to determine which one(s) are the different one(s). This is a fairly difficult problem in itself, but many methods have been developed for attempting to solve it. Such multiple comparison methods are beyond the scope of this course however. We do look at three or four approaches in MATH 3441. Page 10 of 10 Analysis of Variance (ANOVA) © David W. Sabo (2000)