* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download One-Way Analysis of Variance: Comparing Several Means
Psychometrics wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Omnibus test wikipedia , lookup
Misuse of statistics wikipedia , lookup
Resampling (statistics) wikipedia , lookup
P1: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 CHAPTER 22 (Alan Schein Photography/CORBIS) PB286I-22 P2: IXK/GDO One-Way Analysis of Variance: Comparing Several Means In this chapter we cover. . . Comparing several means The analysis of variance F test Using technology The idea of analysis of variance The ANOVA model Some details of ANOVA* The two-sample t procedures of Chapter 17 compare the means of two populations or the mean responses to two treatments in an experiment. Of course, studies don’t always compare just two groups. We need a method for comparing any number of means. EXAMPLE 22.1 Cars, SUVs, and pickup trucks Sport-utility vehicles (SUVs) and pickup trucks have replaced cars in many American driveways. SUVs and trucks are often larger and heavier than cars and so are under attack for reasons of safety and fuel economy. Do SUVs and trucks have lower gas mileage than cars? Table 22.1 contains data on the highway gas mileage (in miles per gallon) for 31 midsize cars, 31 SUVs, and 14 standard-size pickup trucks. The gas mileages and vehicle classifications are compiled by the Environmental Protection Agency.1 Because the cars are two-wheel-drive, the table describes only two-wheeldrive SUVs and trucks. The popular four-wheel-drive versions get poorer mileage. Figure 22.1 shows side-by-side stemplots of the gas mileages for the three types of vehicle. We used the same stems in all three for easier comparison. It does appear that gas mileage decreases as we move from cars to SUVs to pickups. 597 P1: IXK/GDO PB286I-22 598 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means Here are the means, standard deviations, and five-number summaries for the three types of vehicle, from software: N Mean Median Midsize 31 27.903 27.000 SUV 31 22.677 22.000 Pickup 14 21.286 20.000 StDev Minimum Maximum Q1 Q3 2.561 23.000 33.000 26.000 30.000 3.673 17.000 29.000 19.000 26.000 2.758 17.000 26.000 19.750 23.500 We will use the mean to describe the center of the gas mileage distributions. Mean gas mileage goes down as we move from midsize cars to SUVs to pickup trucks. The differences among the means are not large. Are they statistically significant? TABLE 22.1 Highway gas mileage for 2003 model vehicles Midsize Cars Model Acura 3.5TL Audi A6 Audi A8 Buick Century Buick Regal Cadillac CTS Cadillac Seville Chevrolet Monte Carlo Chrysler Sebring Honda Accord Hyundai Sonata Hyundai XG350 Infiniti I35 Jaguar S-Type Jaguar Vanden Plas Kia Optima Lexus ES300 Lexus GS300 Lincoln LS Mercedes-Benz E320 Mercedes-Benz E500 Mitsubishi Diamante Mitsubishi Galant Nissan Altima Nissan Maxima Saab 9-5 Saturn L200 Saturn L300 Toyota Camry Volkswagen Passat Volvo S80 Sport-Utility Vehicles MPG 29 27 25 29 27 26 27 32 30 33 30 26 26 26 24 30 29 25 26 27 23 25 27 29 26 29 32 29 32 31 28 Model Cadillac Escalade Chevrolet Avalanche Chevrolet Blazer Chevrolet Suburban Chevrolet Tahoe Chevrolet Tracker Chevrolet Trailblazer Chrysler PT Cruiser Dodge Durango Ford Escape Ford Expedition Ford Explorer Honda CR-V Hyundai Santa Fe Infiniti QX4 Isuzu Axiom Isuzu Rodeo Jeep Grand Cherokee Jeep Liberty Kia Sorento Lincoln Navigator Mazda Tribute Mitsubishi Montero Nissan Pathfinder Nissan Xterra Saturn Vue Suzuki Vitara Toyota 4Runner Toyota Highlander Toyota Rav4 Volvo XC 90 Standard Pickup Trucks MPG 18 18 23 18 18 26 22 25 19 28 19 19 28 27 21 21 23 21 24 20 17 28 22 21 24 28 25 20 27 29 24 Model Chevrolet Silverado Dodge Dakota Dodge Ram Ford Explorer Ford F150 Ford Ranger GMC Sierra Lincoln Blackwood Mazda B2300 Mazda B3000 Mazda B4000 Nissan Frontier Toyota Tacoma Toyota Tundra MPG 20 19 20 20 20 26 20 17 26 22 21 23 25 19 P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 Comparing several means 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 0 0 000 000000 00000 0 000000 000 0 000 0 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Midsize 0 0000 000 00 0000 00 00 000 00 0 00 0000 0 SUV 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 0 00 00000 0 0 0 0 00 Pickup Figure 22.1 Side-by-side stemplots comparing the highway gas mileages of midsize cars, sport-utility vehicles, and standard-size pickup trucks, from Table 22.1. Comparing several means Call the mean highway gas mileages for the three populations of vehicles µ1 for midsize cars, µ2 for SUVs, and µ3 for pickups. The subscript reminds us which group a parameter or statistic describes. To compare these three population means, we might use the two-sample t test several times: r Test H0: µ1 = µ2 to see if the mean miles per gallon for midsize cars differs from the mean for SUVs. r Test H0: µ1 = µ3 to see if midsize cars differ from pickup trucks. r Test H0: µ2 = µ3 to see if SUVs differ from pickups. The weakness of doing three tests is that we get three P -values, one for each test alone. That doesn’t tell us how likely it is that three sample means are spread apart as far as these are. It may be that x 1 = 27.903 and x 3 = 21.286 are significantly different if we look at just two groups but not significantly different if we know that they are the smallest and the largest means in three groups. As we look at more groups, we expect the gap between the smallest and the largest sample mean to get larger. (Think of comparing the tallest and shortest person in larger and larger groups of people.) We can’t safely compare many parameters by doing tests or confidence intervals for two parameters at a time. The problem of how to do many comparisons at once with some overall measure of confidence in all our conclusions is common in statistics. This is the problem of multiple comparisons. Statistical methods for dealing with multiple comparisons usually have two steps: 1. An overall test to see if there is good evidence of any differences among the parameters that we want to compare. 2. A detailed follow-up analysis to decide which of the parameters differ and to estimate how large the differences are. multiple comparisons 599 P1: IXK/GDO PB286I-22 600 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means The overall test, though more complex than the tests we met earlier, is often reasonably straightforward. The follow-up analysis can be quite elaborate. In our basic introduction to statistical practice, we will concentrate on the overall test, along with data analysis that points to the nature of the differences. The analysis of variance F test We want to test the null hypothesis that there are no differences among the mean highway gas mileages for the three vehicle types: H0: µ1 = µ2 = µ3 The alternative hypothesis is that there is some difference, that not all three population means are equal: Ha : not all of µ1 , µ2 , and µ3 are equal analysis of variance F test The alternative hypothesis is no longer one-sided or two-sided. It is “manysided,” because it allows any relationship other than “all three equal.” For example, Ha includes the case in which µ2 = µ3 but µ1 has a different value. The test of H0 against Ha is called the analysis of variance F test. Analysis of variance is usually abbreviated as ANOVA. The ANOVA F test is almost always carried out by software that reports the test statistic and its P-value. EXAMPLE 22.2 Cars, SUVs, and pickup trucks: ANOVA As we will soon see, software tells us that for the gas mileage data in Table 22.1, F = 31.61 with P-value P < 0.001. There is very strong evidence that the three types of vehicle do not all have the same mean gas mileage. The F test does not say which of the three means are significantly different. It appears from our preliminary data analysis that SUVs and pickups have similar fuel economy and that both get distinctly poorer mileage than midsize cars. The gap of more than 5 miles per gallon between midsize cars and the other two types of vehicle is large enough to be of practical interest. Our conclusion: There is strong evidence (P < 0.001) that the means are not all equal. The most important difference among the means is that midsize cars have better gas mileage than SUVs and pickups. Example 22.2 illustrates our approach to comparing means. The ANOVA F test (done by software) assesses the evidence for some difference among the population means. In most cases, we expect the F test to be significant. We would not undertake a study if we did not expect to find some effect. The formal test is nonetheless important to guard against being misled by chance variation. We will not do the formal follow-up analysis that is often the most useful part of an ANOVA study. Follow-up analysis would allow us to say which means differ and by how much, with (say) 95% confidence that all our conclusions are PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 The analysis of variance F test correct. We rely instead on examination of the data to show what differences are present and whether they are large enough to be interesting. APPLY YOUR KNOWLEDGE 400 600 22.1 Do fruit flies sleep? Mammals and birds sleep. Insects such as fruit flies rest, but is this rest sleep? Biologists now think that insects do sleep. One experiment gave caffeine to fruit flies to see if it affected their rest. We know that caffeine reduces sleep in mammals, so if it reduces rest in fruit flies that’s another hint that the rest is really sleep. The paper reporting the study contains a graph similar to Figure 22.2 and states, “Flies given caffeine obtained less rest during the dark period in a dose-dependent fashion (n = 36 per group, P < 0.0001).”2 (a) The explanatory variable is amount of caffeine, in milligrams per milliliter of blood. The response variable is minutes of rest (measured by an infrared motion sensor) during a 12-hour dark period. Outline the design of this experiment. (b) The P-value in the report comes from the ANOVA F test. What means does this test compare? State in words the null and alternative hypotheses for the test in this setting. What do the graph and the statistical test together lead you to conclude? 200 0 PB286I-22 P2: IXK/GDO Rest (min) P1: IXK/GDO Control 1 mg/ml 2.5 mg/ml 5 mg/ml Caffeine Figure 22.2 Bar graph comparing the mean rest of fruit flies given different amounts of caffeine, for Exercise 22.1. 601 P1: IXK/GDO PB286I-22 602 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means 22.2 Road rage. A random-digit-dialing survey asked a random sample of 1382 drivers questions about their driving. Responses to a number of questions were then combined to yield a measure of “angry/threatening driving,” commonly called “road rage.” What driver characteristics go with road rage? There were no significant differences among races or levels of education. What about the effect of the driver’s age? Here are the mean responses for three age-groups:3 <30 yr 30–55 yr >55 yr 2.22 1.33 0.66 The report says that F = 34.96, with P < 0.01. (a) What are the null and alternative hypotheses for the ANOVA F test? Be sure to explain what means the test compares. (b) Based on the sample means and the F test, what do you conclude? Using technology Any technology used for statistics should perform analysis of variance. Figure 22.3 displays ANOVA output for the data of Table 22.1 from the Minitab statistical software, the Excel spreadsheet program, and the TI-83 graphing calculator. Minitab and Excel give the sizes of the three samples and their means. These agree with those in Example 22.1. The most important part of all three outputs reports the F test statistic, F = 31.61, and its P-value. Minitab sensibly reports P as 0.000, that is, 0 to three decimal places. Excel and the TI-83 show that P is about 0.0000000001. (The E-10 in these displays means move the decimal point 10 digits to the left.) There is very strong evidence that the three types of vehicle do not all have the same mean gas mileage. All three outputs also report sums of squares (SS) and mean squares (MS). We don’t need this information now. The optional section “Some Details of ANOVA” at the end of this chapter explains these terms. Minitab also gives confidence intervals for all three means that help us see which means differ and by how much. The SUV and pickup intervals overlap, and the midsize interval lies well above them on the mileage scale. These are 95% confidence intervals for each mean separately. We are not 95% confident that all three intervals cover the three means. This is another example of the peril of multiple comparisons. APPLY YOUR KNOWLEDGE 22.3 Logging in the rainforest. How does logging in a tropical rainforest affect the forest several years later? Researchers compared forest plots in Borneo that had never been logged (Group 1) with similar plots nearby P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO May 6, 2003 T1: IXK 21:23 Using technology Minitab Excel TI-83 Figure 22.3 Analysis of variance output for the gas mileage data from Minitab, Excel, and the TI-83. 603 P1: IXK/GDO PB286I-22 604 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means that had been logged 1 year earlier (Group 2) and 8 years earlier (Group 3). Although the study was not an experiment, the authors explain why we can consider the plots to be randomly selected. The data appear in Table 22.2. The variable Trees is the count of trees in a plot; Species is the count of tree species in a plot. The variable Richness is the ratio of number of species to number of individual trees, calculated as Species/Trees.4 TABLE 22.2 Data from a study of logging in Borneo Observation Group Trees Species Richness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 27 22 29 21 19 33 16 20 24 27 28 19 12 12 15 9 20 18 17 14 14 2 17 19 18 4 22 15 18 19 22 12 12 22 18 22 20 15 21 13 13 19 13 19 15 11 11 14 7 18 15 15 12 13 2 15 8 17 4 18 14 18 15 15 10 12 0.81481 0.81818 0.75862 0.95238 0.78947 0.63636 0.81250 0.65000 0.79167 0.48148 0.67857 0.78947 0.91667 0.91667 0.93333 0.77778 0.90000 0.83333 0.88235 0.85714 0.92857 1.00000 0.88235 0.42105 0.94444 1.00000 0.81818 0.93333 1.00000 0.78947 0.68182 0.83333 1.00000 P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO May 6, 2003 T1: IXK 20:10 Using technology Figure 22.4 Excel output for analysis of variance on the number of trees in forest plots, for Exercise 22.3. (a) Make side-by-side stemplots of Trees for the three groups. Use stems 0, 1, 2, and 3 and split the stems (see page 17). What effects of logging are visible? (b) Figure 22.4 shows Excel ANOVA output for Trees. What do the group means show about the effects of logging? (c) What are the values of the ANOVA F statistic and its P-value? What hypotheses does F test? What conclusions about the effects of logging on number of trees do the data lead to? 22.4 Dogs, friends, and stress. If you are a dog lover, perhaps having your dog along reduces the effect of stress. To examine the effect of pets in stressful situations, researchers recruited 45 women who said they were dog lovers. The EESEE story “Stress among Pets and Friends” describes the results. Fifteen of the subjects were randomly assigned to each of three groups to do a stressful task alone (the control group), with a good friend present, or with their dog present. The subject’s mean heart rate during the task is one measure of the effect of stress. Table 22.3 contains the data. (a) Make stemplots of the heart rates for the three groups (round to the nearest whole number of beats). Do any of the groups show outliers or extreme skewness? (b) Figure 22.5 gives the Minitab ANOVA output for these data. Do the mean heart rates for the groups appear to show that the presence of a pet or a friend reduces heart rate during a stressful task? (c) What are the values of the ANOVA F statistic and its P-value? What hypotheses does F test? Briefly describe the conclusions you draw from these data. Did you find anything surprising? EESEE 605 P1: IXK/GDO PB286I-22 606 P2: IXK/GDO QC: IXK/GDO PB286-Moore-V5.cls T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means TABLE 22.3 Mean heart rates during stress with a pet (P), with a friend (F), and for the control group (C) Group Rate Group Rate Group Rate P F P C C P F F P F C C C F F 69.169 99.692 70.169 80.369 87.446 75.985 83.400 102.154 86.446 80.277 90.015 99.046 75.477 88.015 92.492 P C P C C F F C P P F F C F P 68.862 87.231 64.169 91.754 87.785 91.354 100.877 77.800 97.538 85.000 101.062 97.046 62.646 81.600 72.262 C C P P P C C C F F F P P F P 84.738 84.877 58.692 79.662 69.231 73.277 84.523 70.877 89.815 98.200 76.908 69.538 70.077 86.985 65.446 F 14.08 P 0.000 Analysis of Variance for Beats Source DF SS MS Group 2 2387.7 1193.8 Error 42 3561.3 84.8 Total 44 5949.0 Individual 95% CIs For Mean Based on Pooled StDev Level Control Friend Pet N 15 15 15 Mean 82.524 91.325 73.483 Pooled StDev = 9.208 StDev 9.242 8.341 9.970 72.0 80.0 88.0 96.0 Figure 22.5 Minitab output for the data in Table 22.3 on heart rates (beats per minute) during stress, for Exercise 22.4. The “Control” group worked alone, the “Friend” group had a friend present, and the “Pet” group had a pet dog present. P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 The idea of analysis of variance (a ) (b) Figure 22.6 Boxplots for two sets of three samples each. The sample means are the same in (a) and (b). Analysis of variance will find a more significant difference among the means in (b) because there is less variation among the individuals within those samples. The idea of analysis of variance Here is the main idea for comparing means: what matters is not how far apart the sample means are but how far apart they are relative to the variability of individual observations. Look at the two sets of boxplots in Figure 22.6. For simplicity, these distributions are all symmetric, so that the mean and median are the same. The centerline in each boxplot is therefore the sample mean. Both sets of boxplots compare three samples with the same three means. Like the three vehicle types in Example 22.1, the means are different but not very different. Could differences this large easily arise just due to chance, or are they statistically significant? r The boxplots in Figure 22.6(a) have tall boxes, which show lots of variation among the individuals in each group. With this much variation among individuals, we would not be surprised if another set of samples gave quite different sample means. The observed differences among the sample means could easily happen just by chance. r The boxplots in Figure 22.6(b) have the same centers as those in Figure 22.6(a), but the boxes are much shorter. That is, there is much less variation among the individuals in each group. It is unlikely that any sample from the first group would have a mean as small as the mean of the second group. Because means as far apart as those observed would rarely arise just by chance in repeated sampling, they are good evidence of real differences among the means of the three populations we are sampling from. You can use the One-Way ANOVA applet to demonstrate the analysis of variance idea for yourself. The applet allows you to change both the group means and the spread within groups. You can watch the ANOVA F statistic and its P-value change as you work. This comparison of the two parts of Figure 22.6 is too simple in one way. It ignores the effect of the sample sizes, an effect that boxplots do not show. Small differences among sample means can be significant if the samples are very large. Large differences among sample means may fail to be significant if APPLET 607 P1: IXK/GDO PB286I-22 608 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means the samples are very small. All we can be sure of is that for the same sample size, Figure 22.6(b) will give a much smaller P-value than Figure 22.6(a). Despite this qualification, the big idea remains: if sample means are far apart relative to the variation among individuals in the same group, that’s evidence that something other than chance is at work. THE ANALYSIS OF VARIANCE IDEA Analysis of variance compares the variation due to specific sources with the variation among individuals who should be similar. In particular, ANOVA tests whether several populations have the same mean by comparing how far apart the sample means are with how much variation there is within the samples. one-way ANOVA It is one of the oddities of statistical language that methods for comparing means are named after the variance. The reason is that the test works by comparing two kinds of variation. Analysis of variance is a general method for studying sources of variation in responses. Comparing several means is the simplest form of ANOVA, called one-way ANOVA. One-way ANOVA is the only form of ANOVA that we will study. THE ANOVA F STATISTIC The analysis of variance F statistic for testing the equality of several means has this form: F = F distribution variation among the sample means variation among individuals in the same sample We give more detail later. Because ANOVA is in practice done by software, the idea is more important than the detail. The F statistic can only take values that are zero or positive. It is zero only when all the sample means are identical and gets larger as they move farther apart. Large values of F are evidence against the null hypothesis H0 that all population means are the same. Although the alternative hypothesis Ha is many-sided, the ANOVA F test is one-sided because any violation of H0 tends to produce a large value of F . How large must F be to provide significant evidence against H0 ? To answer questions of statistical significance, compare the F statistic with critical values from an F distribution. The F distributions are described on page 456 in Chapter 17. A specific F distribution is specified by two parameters: a numerator degrees of freedom and a denominator degrees of freedom. Table D in the back of the book contains critical values for F distributions with various degrees of freedom. P1: IXK/GDO PB286I-22 P2: IXK/GDO QC: IXK/GDO PB286-Moore-V5.cls T1: IXK April 17, 2003 15:59 The idea of analysis of variance EXAMPLE 22.3 Using the F table Look again at the software output for the gas mileage data in Figure 22.3. All three outputs give the degrees of freedom for the F test, labeled “df ” or “DF.” There are 2 degrees of freedom in the numerator and 73 in the denominator. In Table D, find the numerator degrees of freedom 2 at the top of the table. Then look for the denominator degrees of freedom 73 at the left of the table. There is no entry for 73, so we use the next smaller entry, 50 degrees of freedom. The upper critical values for 2 and 50 degrees of freedom are p Critical value 0.100 0.050 0.025 0.010 0.001 2.41 3.18 3.97 5.06 7.96 Figure 22.7 shows the F density curve with 2 and 50 degrees of freedom and the upper 5% critical value 3.18. The observed F = 31.61 lies far to the right on this curve. We see from the table that F = 31.61 is larger than the 0.001 critical value, so P < 0.001. Software, of course, uses the F distribution with 2 and 73 degrees of freedom to calculate P-values. Excel also tells us how large F must be to be significant at the α = 0.05 level. This is the 5% critical value for F with 2 and 73 degrees of freedom. It is 3.122. Approximating the P-value from Table D is usually adequate in practice, as it is in this example. F density curve, df = 2 and 50 Area = 0.05 0 1 3.18 10.00 Figure 22.7 The density curve of the F distribution with 2 degrees of freedom in the numerator and 50 degrees of freedom in the denominator, for Example 22.3. The value F ∗ = 3.18 is the 0.05 critical value. 609 P1: IXK/GDO PB286I-22 610 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means The degrees of freedom of the F statistic depend on the number of means we are comparing and the number of observations in each sample. That is, the F test takes into account the number of observations. Here are the details. DEGREES OF FREEDOM FOR THE F TEST We want to compare the means of I populations. We have an SRS of size n i from the ith population, so that the total number of observations in all samples combined is N = n1 + n2 + · · · + n I If the null hypothesis that all population means are equal is true, the ANOVA F statistic has the F distribution with I − 1 degrees of freedom in the numerator and N − I degrees of freedom in the denominator. EXAMPLE 22.4 Degrees of freedom for F In Examples 22.1 and 22.2, we compared the mean highway gas mileage for three types of vehicle, so I = 3. The three sample sizes are n 1 = 31 n 2 = 31 n 3 = 14 The total number of observations is therefore N = 31 + 31 + 14 = 76 The ANOVA F test has numerator degrees of freedom I −1=3−1=2 and denominator degrees of freedom N − I = 76 − 3 = 73 These degrees of freedom are given in the outputs in Figure 22.3. APPLY YOUR KNOWLEDGE 22.5 Logging in the rainforest, continued. Exercise 22.3 compares the number of tree species in rainforest plots that had never been logged (Group 1) with similar plots nearby that had been logged 1 year earlier (Group 2) and 8 years earlier (Group 3). (a) What are I , the n i , and N for these data? Identify these quantities in words and give their numerical values. (b) Find the degrees of freedom for the ANOVA F statistic. Check your work against the Excel output in Figure 22.4. (c) For these data, F = 11.43. What does Table D tell you about the P-value of this statistic? 22.6 Dogs, friends, and stress, continued. Exercise 22.4 compares the mean heart rates for women performing a stressful task under three conditions. P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 The ANOVA model 611 (a) What are I , the n i , and N for these data? State in words what each of these numbers means. (b) Find both degrees of freedom for the ANOVA F statistic. Check your results against the Minitab output in Figure 22.5. (c) The output shows that F = 14.08. What does Table D tell you about the significance of this result? 22.7 Smoking among French men. A study of smoking categorized a sample of French men aged 20 to 60 years as nonsmokers (146 men), former smokers (125 men), moderate smokers (104 men), or heavy smokers (84 men). One set of response variables was the amount of several types of food consumed, in grams.5 (a) What are the degrees of freedom for an ANOVA F statistic that compares the mean weights of a food consumed in the 4 populations? Use Table D with 3 and 200 degrees of freedom to give an approximate P-value for each of the following foods: (b) For milk, F = 3.58. (c) For eggs, F = 1.08. (d) For vegetables, F = 6.02. The ANOVA model Like all inference procedures, ANOVA is valid only in some circumstances. Here is the model that describes when we can use ANOVA to compare population means. Where are the details? THE ANOVA MODEL r We have I independent SRSs, one from each of I populations. r The i th population has a Normal distribution with unknown mean µi . The means may be different in the different populations. The ANOVA F statistic tests the null hypothesis that all of the populations have the same mean: H0: µ1 = µ2 = · · · = µ I Ha : not all of the µi are equal r All of the populations have the same standard deviation σ , whose value is unknown. The ANOVA model has I + 1 parameters, the I population means and the standard deviation σ . The first two requirements are familiar from our study of the two-sample t procedures for comparing two means. As usual, the design of the data production is the most important foundation for inference. Biased sampling or confounding can make any inference meaningless. If we do not actually draw Papers reporting scientific research must be short. Brevity allows researchers to hide details about their data. Did they choose their subjects in a biased way? Did they report data on only some of their subjects? Did they try several statistical analyses and report the one that looked best? The statistician John Bailar screened more than 4000 papers for the New England Journal of Medicine. He says, “When it came to the statistical review, it was often clear that critical information was lacking, and the gaps nearly always had the practical effect of making the authors’ conclusions look stronger than they should have.” P1: IXK/GDO PB286I-22 612 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means robustness separate SRSs from each population or carry out a randomized comparative experiment, it is often unclear to what population the conclusions of inference apply. This is the case in Example 22.1, for example. ANOVA, like other inference procedures, is often used when random samples are not available. You must judge each use on its merits, a judgment that usually requires some knowledge of the subject of the study in addition to some knowledge of statistics. We might consider the 2003 model year vehicles in Example 22.1 to be samples from vehicles of their types produced in recent years. Because no real population has exactly a Normal distribution, the usefulness of inference procedures that assume Normality depends on how sensitive they are to departures from Normality. Fortunately, procedures for comparing means are not very sensitive to lack of Normality. The ANOVA F test, like the t procedures, is robust. What matters is Normality of the sample means, so ANOVA becomes safer as the sample sizes get larger, because of the central limit theorem effect. Remember to check for outliers that change the value of sample means and for extreme skewness. When there are no outliers and the distributions are roughly symmetric, you can safely use ANOVA for sample sizes as small as 4 or 5. (Don’t confuse the ANOVA F , which compares several means, with the F statistic discussed in Chapter 17, which compares two standard deviations and is not robust against non-Normality.) The third assumption is annoying: ANOVA assumes that the variability of observations, measured by the standard deviation, is the same in all populations. You may recall from Chapter 17 (page 454) that there is a special version of the two-sample t test that assumes equal standard deviations in both populations. The ANOVA F for comparing two means is exactly the square of this special t statistic. We prefer the t test that does not assume equal standard deviations, but for comparing more than two means there is no general alternative to the ANOVA F . It is not easy to check the assumption that the populations have equal standard deviations. Statistical tests for equality of standard deviations are very sensitive to lack of Normality, so much so that they are of little practical value. You must either seek expert advice or rely on the robustness of ANOVA. How serious are unequal standard deviations? ANOVA is not too sensitive to violations of the assumption, especially when all samples have the same or similar sizes and no sample is very small. When designing a study, try to take samples of the same size from all the groups you want to compare. The sample standard deviations estimate the population standard deviations, so check before doing ANOVA that the sample standard deviations are similar to each other. We expect some variation among them due to chance. Here is a rule of thumb that is safe in almost all situations. CHECKING STANDARD DEVIATIONS IN ANOVA The results of the ANOVA F test are approximately correct when the largest sample standard deviation is no more than twice as large as the smallest sample standard deviation. P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 The ANOVA model 613 EXAMPLE 22.5 Do the standard deviations allow ANOVA? In the gas mileage study, the sample standard deviations for midsize cars, SUVs, and pickups are s 1 = 2.561 s 2 = 3.673 s 3 = 2.758 These standard deviations easily satisfy our rule of thumb. We can safely use ANOVA to compare the mean gas mileage for the three vehicle types. The report from which Table 22.1 was taken also contained data on 22 subcompact cars. Can we use ANOVA to compare the means for all four types of vehicle? The standard deviation for subcompact cars is s 4 = 5.262 miles per gallon. This is slightly more than twice the smallest standard deviation: largest s 5.262 = = 2.05 smallest s 2.561 Our rule of thumb is conservative, so many statisticians would go ahead with ANOVA in this circumstance. A large standard deviation is often due to skewness or outliers. It turns out that the government’s “subcompact” category includes three rather exotic cars, a Bentley, a Ferrari, and a Maserati. All three get poor gas mileage. The standard deviation for the remaining 19 subcompact cars is s = 3.304, well within the guidelines for ANOVA. EXAMPLE 22.6 Which color attracts beetles best? To detect the presence of harmful insects in farm fields, we can put up boards covered with a sticky material and examine the insects trapped on the boards. Which colors attract insects best? Experimenters placed six boards of each of four colors at random locations in a field of oats and measured the number of cereal leaf beetles trapped. Here are the data:6 Board color Blue Green White Yellow Insects trapped 16 37 21 45 11 32 12 59 20 20 14 48 21 29 17 46 14 37 13 38 7 32 20 47 We would like to use ANOVA to compare the mean numbers of beetles that would be trapped by all boards of each color. Because the samples are small, we plot the data in side-by-side stemplots in Figure 22.8. Minitab output for ANOVA appears Blue Green White 0 7 0 0 0 1 146 1 1 2347 1 2 01 2 09 2 01 2 Yellow 3 3 2277 3 3 8 4 4 4 4 5678 5 5 5 5 9 Figure 22.8 Side-by-side stemplots comparing the counts of insects attracted by six boards for each of four board colors, for Example 22.6. P1: IXK/GDO PB286I-22 614 P2: IXK/GDO QC: IXK/GDO PB286-Moore-V5.cls T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means in Figure 22.9. The yellow boards attract by far the most insects (x 4 = 47.167), with green next (x 2 = 31.167) and blue and white far behind. Check that we can safely use ANOVA to test equality of the four means. The largest of the four sample standard deviations is 6.795 and the smallest is 3.764. The ratio largest s 6.795 = = 1.8 smallest s 3.764 is less than 2, so these data satisfy our rule of thumb. The shapes of the four distributions are irregular, as we expect with only 6 observations in each group, but there are no outliers. The ANOVA results will be approximately correct. There are I = 4 groups and N = 24 observations overall, so the degrees of freedom for F are numerator: denominator: I −1= 4−1 = 3 N − I = 24 − 4 = 20 This agrees with Minitab’s results. The F statistic is F = 42.84, a large F with P < 0.001. Despite the small samples, the experiment gives very strong evidence of differences among the colors. Yellow boards appear best at attracting leaf beetles. Analysis of Variance for Beetles Source DF SS MS Color 3 4134.0 1378.0 Error 20 643.3 32.2 Total 23 4777.3 F 42.84 P 0.000 Individual 95% CIs For Mean Based on Pooled StDev Level Blue Green White Yellow N 6 6 6 6 Pooled StDev = Mean 14.833 31.167 16.167 47.167 5.672 StDev 5.345 6.306 3.764 6.795 12 24 36 48 Figure 22.9 Minitab ANOVA output for comparing the four board colors in Example 22.6. P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 The ANOVA model APPLY YOUR KNOWLEDGE 22.8 Checking standard deviations. Verify that the sample standard deviations for these sets of data do allow use of ANOVA to compare the population means. (a) The counts of trees in Exercise 22.3 and Figure 22.4. (b) The heart rates of Exercise 22.4 and Figure 22.5. 22.9 Species richness after logging. Table 22.2 gives data on the species richness in rainforest plots, defined as the number of tree species in a plot divided by the number of trees in the plot. ANOVA may not be trustworthy for the richness data. Do data analysis: make side-by-side stemplots to examine the distributions of the response variable in the three groups and also compare the standard deviations. What about the data makes ANOVA risky? 22.10 Marital status and salary. Married men tend to earn more than single men. An investigation of the relationship between marital status and income collected data on all 8235 men employed as managers or professionals by a large manufacturing firm in 1976. Suppose (this is risky) we regard these men as a random sample from the population of all men employed in managerial or professional positions in large companies. Here are summary statistics for the salaries of these men:7 ni xi si Single Married Divorced Widowed 337 $21,384 $5,731 7,730 $26,873 $7,159 126 $25,594 $6,347 42 $26,936 $8,119 (a) Briefly describe the relationship between marital status and salary. (b) Do the sample standard deviations allow use of the ANOVA F test? (The distributions are skewed to the right. We expect right-skewness in income distributions. The investigators actually applied ANOVA to the logarithms of the salaries, which are more symmetric.) (c) What are the degrees of freedom of the ANOVA F test? (d) The F test is a formality for these data, because we are sure that the P-value will be very small. Why are we sure? (e) Single men earn less on the average than men who are or have been married. Do the highly significant differences in mean salary show that getting married raises men’s mean income? Explain your answer. 615 P1: IXK/GDO PB286I-22 616 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means Some details of ANOVA* Now we will give the actual formula for the ANOVA F statistic. We have SRSs from each of I populations. Subscripts from 1 to I tell us which sample a statistic refers to: Population Sample size Sample mean Sample std. dev. 1 2 · · · I n1 n2 · · · nI x1 x2 · · · xI s1 s2 · · · sI You can find the F statistic from just the sample sizes n i , the sample means x i , and the sample standard deviations s i . You don’t need to go back to the individual observations. The ANOVA F statistic has the form F = mean squares variation among the sample means variation among individuals within samples The measures of variation in the numerator and denominator of F are called mean squares. A mean square is a more general form of a sample variance. An ordinary sample variance s 2 is an average (or mean) of the squared deviations of observations from their mean, so it qualifies as a “mean square.” The numerator of F is a mean square that measures variation among the I sample means x 1 , x 2 , . . . , x I . Call the overall mean response (the mean of all N observations together) x. You can find x from the I sample means by x= MSG n1 x 1 + n2 x 2 + · · · + n I x I N The sum of each mean multiplied by the number of observations it represents is the sum of all the individual observations. Dividing this sum by N, the total number of observations, gives the overall mean x. The numerator mean square in F is an average of the I squared deviations of the means of the samples from x. We call it the mean square for groups, abbreviated as MSG. MSG = n 1 (x 1 − x)2 + n 2 (x 2 − x)2 + · · · + n I (x I − x)2 I −1 Each squared deviation is weighted by n i , the number of observations it represents. * This more advanced section is optional if you are using software to find the F statistic. P1: IXK/GDO PB286I-22 P2: IXK/GDO QC: IXK/GDO PB286-Moore-V5.cls T1: IXK April 17, 2003 15:59 Some details of ANOVA The mean square in the denominator of F measures variation among individual observations in the same sample. For any one sample, the sample 2 variance s i does this job. For all I samples together, we use an average of the 2 individual sample variances. It is again a weighted average in which each s i is weighted by one fewer than the number of observations it represents, n i − 1. 2 Another way to put this is that each s i is weighted by its degrees of freedom n i − 1. The resulting mean square is called the mean square for error, MSE. 2 MSE = 2 MSE 2 (n 1 − 1)s 1 + (n 2 − 1)s 2 + · · · + (n I − 1)s I N−I “Error” doesn’t mean a mistake has been made. It’s a traditional term for chance variation. Here is a summary of the ANOVA test. THE ANOVA F TEST Draw an independent SRS from each of I populations. The i th population has the N(µi , σ ) distribution, where σ is the common standard deviation in all the populations. The i th sample has size n i , sample mean x i , and sample standard deviation s i . The ANOVA F statistic tests the null hypothesis that all I populations have the same mean: H0: µ1 = µ2 = · · · = µ I Ha : not all of the µi are equal The statistic is F = MSG MSE The numerator of F is the mean square for groups MSG = n 1 (x 1 − x)2 + n 2 (x 2 − x)2 + · · · + n I (x I − x)2 I −1 The denominator of F is the mean square for error 2 MSE = 2 2 (n 1 − 1)s 1 + (n 2 − 1)s 2 + · · · + (n I − 1)s I N−I When H0 is true, F has the F distribution with I − 1 and N − I degrees of freedom. The denominators in the recipes for MSG and MSE are the two degrees of freedom I − 1 and N − I of the F test. The numerators are called sums of squares, from their algebraic form. It is usual to present the results of ANOVA in an ANOVA table. Output from software usually includes an ANOVA table. sums of squares ANOVA table 617 P1: IXK/GDO PB286I-22 618 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means EXAMPLE 22.7 ANOVA calculations: software Look again at the three outputs in Figure 22.3 (page 603). Minitab and Excel give the ANOVA table. The TI-83, with its small screen, gives the degrees of freedom, sums of squares, and mean squares separately. Each output uses slightly different language. The basic ANOVA table is: Source of variation df SS MS F statistic Variation among samples Variation within samples 2 73 606.45 700.34 MSG = 303.22 MSE = 9.59 31.61 You can check that each mean square MS is the corresponding sum of squares SS divided by its degrees of freedom df. The F statistic is MSG divided by MSE. pooled standard deviation Because MSE is an average of the individual sample variances, it is also called 2 the pooled sample variance, written as s p . When all I populations have the same 2 population variance σ 2 (ANOVA assumes that they do), s p estimates the common variance σ 2 . The square root of MSE is the pooled standard deviation s p . It estimates the common standard deviation σ of observations in each group. The Minitab and TI-83 outputs in Figure 22.3 give the value s p = 3.097. The pooled standard deviation s p is a better estimator of the common σ than any individual sample standard deviation s i because it combines (pools) the information in all I samples. We can get a confidence interval for any of the means µi from the usual form estimate ± t ∗ SEestimate using s p to estimate σ . The confidence interval for µi is sp xi ± t∗ √ ni Use the critical value t ∗ from the t distribution with N − I degrees of freedom, because s p has N − I degrees of freedom. These are the confidence intervals that appear in Minitab ANOVA output. EXAMPLE 22.8 ANOVA calculations: without software We can do the ANOVA test comparing the gas mileages of midsize cars, SUVs, and pickup trucks using only the sample sizes, sample means, and sample standard deviations. These appear in Example 22.1, but it is easy to find them with a calculator. There are I = 3 groups with a total of N = 76 vehicles. The overall mean of the 76 gas mileages in Table 22.1 (page 598) is n1 x 1 + n2 x 2 + n3 x 3 N (31)(27.903) + (31)(22.677) + (14)(21.286) = 76 1865.984 = = 24.552 76 x= P1: IXK/GDO PB286I-22 P2: IXK/GDO QC: IXK/GDO PB286-Moore-V5.cls T1: IXK April 17, 2003 15:59 Some details of ANOVA The mean square for groups is n 1 (x 1 − x)2 + n 2 (x 2 − x)2 + n 3 (x 3 − x)2 I −1 1 = [(31)(27.903 − 24.552)2 + (31)(22.677 − 24.552)2 3−1 + (14)(21.286 − 24.552)2 ] 606.424 = = 303.212 2 MSG = The mean square for error is 2 MSE = 2 2 (n 1 − 1)s 1 + (n 2 − 1)s 2 + (n 3 − 1)s 3 N−I (30)(2.561 ) + (30)(3.6732 ) + (13)(2.7582 ) = 73 700.375 = = 9.594 73 2 Finally, the ANOVA test statistic is F = MSG 303.212 = = 31.60 MSE 9.594 Our work differs slightly from the output in Figure 22.3 because of roundoff error. We don’t recommend doing these calculations, because tedium and roundoff errors cause frequent mistakes. APPLY YOUR KNOWLEDGE The calculations of ANOVA use only the sample sizes n i , the sample means x i , and the sample standard deviations s i . You can therefore re-create the ANOVA calculations when a report gives these summaries but does not give the actual data. These optional exercises ask you to do the ANOVA calculations starting with the summary statistics. 22.11 Road rage. Exercise 22.2 describes a study of road rage. Here are the means and standard deviations for a measure of “angry/threatening driving” for random samples of drivers in three age-groups: Age-group Less than 30 years 30 to 55 years Over 55 years n x s 244 734 364 2.22 1.33 0.66 3.11 2.21 1.60 (a) The distributions of responses are somewhat right-skewed. ANOVA is nonetheless safe for these data. Why? (b) Check that the standard deviations satisfy the guideline for ANOVA inference. 619 P1: IXK/GDO PB286I-22 620 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means (c) Calculate the overall mean response x, the mean squares MSG and MSE, and the ANOVA F statistic. (d) What are the degrees of freedom for F ? How significant are the differences among the three mean responses? 22.12 Exercise and weight loss. What conditions help overweight people exercise regularly? Subjects were randomly assigned to three treatments: a single long exercise period 5 days per week; several 10-minute exercise periods 5 days per week; and several 10-minute periods 5 days per week on a home treadmill that was provided to the subjects. The study report contains the following information about weight loss (in kilograms) after six months of treatment:8 Treatment n x s Long exercise periods Short exercise periods Short periods with equipment 37 36 42 10.2 9.3 10.2 4.2 4.5 5.2 (a) Do the standard deviations satisfy the rule of thumb for safe use of ANOVA? (b) Calculate the overall mean response x and the mean square for groups MSG. (c) Calculate the mean square for error MSE. (d) Find the ANOVA F statistic and its approximate P-value. Is there evidence that the mean weight losses of people who follow the three exercise programs differ? 22.13 Weights of newly hatched pythons. A study of the effect of nest temperature on the development of water pythons separated python eggs at random into nests at three temperatures: cold, neutral, and hot. Exercise 20.22 (page 555) shows that the proportions of eggs that hatched at each temperature did not differ significantly. Now we will examine the little pythons. In all, 16 eggs hatched at the cold temperature, 38 at the neutral temperature, and 75 at the hot temperature. The report of the study summarizes the data in the common form “mean ± standard error” as follows:9 Temperature n Weight (grams) at hatching Propensity to strike Cold Neutral Hot 16 38 75 28.89 ± 8.08 32.93 ± 5.61 32.27 ± 4.10 6.40 ± 5.67 5.82 ± 4.24 4.30 ± 2.70 (a) We will compare the mean weights √ at hatching. Recall that the standard error of the mean is s / n. Find the standard deviations P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 Statistics in Summary of the weights in the three groups and verify that they satisfy our rule of thumb for using ANOVA. (b) Starting from the sample sizes n i , the means x i , and the standard deviations s i , carry out an ANOVA. That is, find MSG, MSE, and the F statistic, and use Table D to approximate the P-value. Is there evidence that nest temperature affects the mean weight of newly hatched pythons? 22.14 Python strikes. The data in the previous exercise also describe the “propensity to strike” of the hatched pythons at 30 days of age. This is the number of taps on the head with a small brush until the python launches a strike. (Don’t try this with adult pythons.) The data are again summarized in the form “sample mean ± standard error of the mean.” Follow the outline in (a) and (b) of the previous exercise for propensity to strike. Does nest temperature appear to influence propensity to strike? Chapter 22 SUMMARY One-way analysis of variance (ANOVA) compares the means of several populations. The ANOVA F test tests the overall H0 that all the populations have the same mean. If the F test shows significant differences, examine the data to see where the differences lie and whether they are large enough to be important. The ANOVA model states that we have an independent SRS from each population; that each population has a Normal distribution; and that all populations have the same standard deviation. In practice, ANOVA inference is relatively robust when the populations are non-Normal, especially when the samples are large. Before doing the F test, check the observations in each sample for outliers or strong skewness. Also verify that the largest sample standard deviation is no more than twice as large as the smallest standard deviation. When the null hypothesis is true, the ANOVA F statistic for comparing I means from a total of N observations in all samples combined has the F distribution with I − 1 and N − I degrees of freedom. ANOVA calculations are reported in an ANOVA table that gives sums of squares, mean squares, and degrees of freedom for variation among groups and for variation within groups. In practice, we use software to do the calculations. STATISTICS IN SUMMARY After studying this chapter, you should be able to do the following. A. RECOGNITION 1. Recognize when testing the equality of several means is helpful in understanding data. 621 P1: IXK/GDO PB286I-22 622 P2: IXK/GDO QC: IXK/GDO PB286-Moore-V5.cls T1: IXK April 17, 2003 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means 2. 3. Recognize that the statistical significance of differences among sample means depends on the sizes of the samples and on how much variation there is within the samples. Recognize when you can safely use ANOVA to compare means. Check the data production, the presence of outliers, and the sample standard deviations for the groups you want to compare. B. INTERPRETING ANOVA 1. Explain what null hypothesis F tests in a specific setting. 2. Locate the F statistic and its P-value on the output of analysis of variance software. 3. Find the degrees of freedom for the F statistic from the number and sizes of the samples. Use Table D of the F distributions to approximate the P-value when software does not give it. 4. If the test is significant, use graphs and descriptive statistics to see what differences among the means are most important. Chapter 22 EXERCISES 22.15 22.16 22.17 (Shmuel Thaler/Index Stock Imager/ PictureQuest) 22.18 22.19 (Harry Sieplinga/HMS Images/The Image Bank/Getty Images) Exercises 22.15 to 22.19 describe situations in which we want to compare the mean responses in several populations. For each setting, identify the populations and the response variable. Then give I , the n i , and N. Finally, give the degrees of freedom of the ANOVA F test. Smoking and sleep. A study of the effects of smoking classifies subjects as nonsmokers, moderate smokers, or heavy smokers. The investigators interview a sample of 200 people in each group. Among the questions is “How many hours do you sleep on a typical night?” Which package design is best? A maker of detergents wants to compare the attractiveness to consumers of six package designs. Each package is shown to 120 different consumers who rate the attractiveness of the design on a 1 to 10 scale. Growing tomatoes. Do four tomato varieties differ in mean yield? Grow 10 plants of each variety and record the yield of each plant in pounds of tomatoes. Strong concrete. The strength of concrete depends on the mixture of sand, gravel, and cement used to prepare it. A study compares five different mixtures. Workers prepare six batches of each mixture and measure the strength of the concrete made from each batch. Teaching sign language. Which of four methods of teaching American Sign Language is most effective? Assign 10 of the 42 students in a class at random to each of three methods. Teach the remaining 12 students by the fourth method. Record the students’ scores on a standard test of sign language after a semester’s study. P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 Chapter 22 Exercises 22.20 Plant diversity. Does the presence of more species of plants increase the productivity of a natural area, as measured by the total mass of plant material? An experiment to investigate this question started as follows: In a 7-year experiment, we controlled one component of diversity, the number of plant species, in 168 plots, each 9 m by 9 m. We seeded the plots, in May 1994, to have 1, 2, 4, 8, or 16 species, with 39, 35, 29, 30, and 35 replicates, respectively.10 Total plant mass in each plot was measured in 2000. (a) What null and alternative hypotheses does ANOVA test here? (b) What are the values of N, I , and the n i ? (c) What are the degrees of freedom of the ANOVA F statistic? 22.21 Plants defend themselves. When some plants are attacked by leaf-eating insects, they release chemical compounds that attract other insects that prey on the leaf-eaters. A study carried out on plants growing naturally in the Utah desert demonstrated both the release of the compounds and that they not only repel the leaf-eaters but attract predators that act as the plants’ bodyguards.11 The investigators chose 8 plants attacked by each of three leaf-eaters and 8 more that were undamaged, 32 plants of the same species in all. They then measured emissions of several compounds during seven hours. Here are data (mean ± standard error of the mean for eight plants) for one compound. The emission rate is measured in nanograms (ng) per hour. Group Emission rate (ng/hr) Control Hornworm Leaf bug Flea beetle 9.22 ± 5.93 31.03 ± 8.75 18.97 ± 6.64 27.12 ± 8.62 (a) Make a graph that compares the mean emission rates for the four groups. Does it appear that emissions increase when the plant is attacked? (b) What hypotheses does ANOVA test in this setting? (c) We do not have the full data. What would you look for in deciding whether you can safely use ANOVA? (d) What is the relationship between the standard error of the mean (SEM) and the standard deviation for a sample? Do the standard deviations satisfy our rule of thumb for safe use of ANOVA? 22.22 Does polyester decay? How quickly do synthetic fabrics such as polyester decay in landfills? A researcher buried polyester strips in the 623 P1: IXK/GDO PB286I-22 624 P2: IXK/GDO QC: IXK/GDO PB286-Moore-V5.cls April 17, 2003 T1: IXK 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means soil for different lengths of time, then dug up the strips and measured the force required to break them. Breaking strength is easy to measure and is a good indicator of decay; lower strength means the fabric has decayed. Part of the study buried 20 polyester strips in well-drained soil in the summer. Five of the strips, chosen at random, were dug up after 2 weeks; another 5 were dug up after 4 weeks, 8 weeks, and 16 weeks. Here are the breaking strengths in pounds:12 2 weeks 4 weeks 8 weeks 16 weeks (Alix Phanie, Rex Interstock/Stock Connection/PictureQuest) 118 130 122 124 126 120 136 98 126 114 128 110 120 126 146 140 129 128 140 110 (a) Find the mean strength for each group and plot the means against time. Does it appear that polyester loses strength consistently over time after it is buried? (b) Find the standard deviations for each group. Do they meet our criterion for applying ANOVA? (c) In Examples 17.2 and 17.3 (pages 441 and 445), we used the two-sample t test to compare the mean breaking strengths for strips buried for 2 weeks and for 16 weeks. The ANOVA F test extends the two-sample t to more than two groups. Explain carefully why use of the two-sample t for two of the groups was acceptable but using the F test on all four groups is not acceptable. 22.23 Can you hear these words? To test whether a hearing aid is right for a patient, audiologists play a tape on which words are pronounced at low volume. The patient tries to repeat the words. There are several different lists of words that are supposed to be equally difficult. Are the lists equally difficult when there is background noise? To find out, an experimenter had subjects with normal hearing listen to four lists with a noisy background. The response variable was the percent of the 50 words in a list that the subject repeated correctly. The data set contains 96 responses.13 Here are two study designs that could produce these data: Design A. The experimenter assigns 96 subjects to 4 groups at random. Each group of 24 subjects listens to one of the lists. All individuals listen and respond separately. Design B. The experimenter has 24 subjects. Each subject listens to all four lists in random order. All individuals listen and respond separately. Does Design A allow use of one-way ANOVA to compare the lists? Does Design B allow use of one-way ANOVA to compare the lists? Briefly explain your answers. P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 Chapter 22 Exercises Analysis of Variance for Percent Source DF SS MS List 3 920.5 306.8 Error 92 5738.2 62.4 Total 95 6658.6 F 4.92 P 0.003 Individual 95% CIs For Mean Based on Pooled StDev Level 1 2 3 4 N 24 24 24 24 Pooled StDev = Mean 32.750 29.667 25.250 25.583 StDev 7.409 8.058 8.316 7.779 7.898 24.0 28.0 32.0 36.0 Figure 22.10 Minitab ANOVA output for comparing the percents heard correctly in four lists of words, for Exercise 22.24. 22.24 Can you hear these words? Figure 22.10 displays the Minitab output for one-way ANOVA applied to the hearing data described in the previous exercise. The response variable is “Percent,” and “List” identifies the four lists of words. Based on this analysis, is there good reason to think that the four lists are not all equally difficult? Write a brief summary of the study findings. 22.25 How much corn should I plant? How much corn per acre should a farmer plant to obtain the highest yield? Too few plants will give a low yield. On the other hand, if there are too many plants, they will compete with each other for moisture and nutrients, and yields will fall. To find out, plant at different rates on several plots of ground and measure the harvest. (Treat all the plots the same except for the planting rate.) Use software to analyze these data from such an experiment:14 Plants per acre 12,000 16,000 20,000 24,000 28,000 Yield (bushels per acre) 150.1 166.9 165.3 134.7 119.0 113.0 120.7 130.1 138.4 150.5 118.4 135.2 139.6 156.1 142.6 149.8 149.9 (Craig Aurness/CORBIS) 625 P1: IXK/GDO PB286I-22 626 P2: IXK/GDO QC: IXK/GDO PB286-Moore-V5.cls April 17, 2003 T1: IXK 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means (a) Do data analysis to see what the data appear to show about the influence of plants per acre on yield and also to check the conditions for ANOVA. (b) Carry out the ANOVA F test. State hypotheses; give F and its P-value. What do you conclude? (c) The observed differences among the mean yields in the samples are quite large. Why are they not statistically significant? 22.26 Logging in the rainforest: species counts. Table 22.2 (page 604) gives data on the number of trees per forest plot, the number of species per plot, and species richness. Exercise 22.3 analyzed the effect of logging on number of trees. Exercise 22.9 concludes that it would be risky to use ANOVA to analyze richness. Use software to analyze the effect of logging on the number of species. (a) Make a table of the group means and standard deviations. Do the standard deviations satisfy our rule of thumb for safe use of ANOVA? What do the means suggest about the effect of logging on the number of species? (b) Carry out the ANOVA. Report the F statistic and its P-value and state your conclusion. 22.27 Nematodes and tomato plants. How do nematodes (microscopic worms) affect plant growth? A botanist prepares 16 identical planting pots and then introduces different numbers of nematodes into the pots. He transplants a tomato seedling into each pot. Here are data on the increase in height of the seedlings (in centimeters) 16 days after planting:15 Nematodes (Eye of Science/Photo Researchers) 0 1,000 5,000 10,000 Seedling growth 10.8 11.1 5.4 5.8 9.1 11.1 4.6 5.3 13.5 8.2 7.4 3.2 9.2 11.3 5.0 7.5 (a) Make a table of means and standard deviations for the four treatments. Make side-by-side stemplots to compare the treatments. What do the data appear to show about the effect of nematodes on growth? (b) State H0 and Ha for the ANOVA test for these data, and explain in words what ANOVA tests in this setting. (c) Use software to carry out the ANOVA. Report your overall conclusions about the effect of nematodes on plant growth. P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO T1: IXK April 17, 2003 15:59 Chapter 22 Exercises 22.28 Earnings of athletic trainers (Optional). How much do newly hired athletic trainers earn? Earnings depend on the educational and other qualifications of the trainer and on the type of job. Here are summaries from a sample survey that contacted institutions advertising for trainers and asked about the people they hired:16 Type of institution High school Clinic College n Mean salary x Std. dev. s 57 108 94 $26,470 $30,610 $30,019 $9507 $4504 $7158 Calculate the ANOVA table and the F statistic. Are there significant differences among the mean salaries for the three types of job in the population of all newly hired trainers? (Comments: The study authors did an ANOVA. As is often the case, the published work does not answer questions about the suitability of the analysis. We wonder if responses from 60% of employers who advertised positions generate a random sample of new hires. The actual data do not appear in the report. The salary distributions are no doubt right-skewed; we hope that strong skewness and outliers are absent because the subjects in each group hold similar jobs. The large samples will then justify ANOVA. Also, the sample standard deviations do not quite satisfy our rule of thumb for safe use of ANOVA.) 22.29 Plant defenses (Optional). The calculations of ANOVA use only the sample sizes n i , the sample means x i , and the sample standard deviations s i . You can therefore re-create the ANOVA calculations when a report gives these summaries but does not give the actual data. Use the information in Exercise 22.21 to calculate the ANOVA table (sums of squares, degrees of freedom, mean squares, and the F statistic). Note that the report gives the standard error of the mean (SEM) rather than the standard deviation. Are there significant differences among the mean emission rates for the four populations of plants? 22.30 F versus t (Optional). We have two methods to compare the means of two groups: the two-sample t test of Chapter 17 and the ANOVA F test with I = 2. We prefer the t test because it allows one-sided alternatives and does not assume that both populations have the same standard deviation. Let us apply both tests to the same data. There are two types of life insurance companies. “Stock” companies have shareholders, and “mutual” companies are owned by their policyholders. Take an SRS of each type of company from those listed in a directory of the industry. Then ask the annual cost per $1000 of 627 P1: IXK/GDO PB286I-22 628 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 CHAPTER 22 r One-Way Analysis of Variance: Comparing Several Means insurance for a $50,000 policy insuring the life of a 35-year-old man who does not smoke. Here are the data summaries:17 ni xi si Stock companies Mutual companies 13 $2.31 $0.38 17 $2.37 $0.58 (a) Calculate the two-sample t statistic for testing H0: µ1 = µ2 against the two-sided alternative. Use the conservative method to find the P-value. (b) Calculate MSG, MSE, and the ANOVA F statistic for the same hypotheses. What is the P-value of F ? (c) How close are the two P-values? (The square root of the F statistic is a t statistic with N − I = n 1 + n 2 − 2 degrees of freedom. This is the “pooled two-sample t” mentioned on page 454. So F for I = 2 is exactly equivalent to a t statistic, but it is a slightly different t from the one we use.) 22.31 Exercising to lose weight (Optional). Exercise 22.12 describes a randomized, comparative experiment that assigned subjects to three types of exercise program intended to help them lose weight. Some of the results of this study were analyzed using the chi-square test for two-way tables, and some others were analyzed using one-way ANOVA. For each of the following excerpts from the study report, say which analysis is appropriate and explain how you made your choice. (a) “Overall, 115 subjects (78% of 148 subjects randomized) completed 18 months of treatment, with no significant difference in attrition rates between the groups (P = .12).” (b) “In analyses using only the 115 subjects who completed 18 months of treatment, there were no significant differences in weight loss at 6 months among the groups.” (c) “The duration of exercise for weeks 1 through 4 was significantly greater in the SB compared with both LB and SBEQ groups (P < .05). . . . However, exercise duration was greater in SBEQ compared with both LB and SB groups for months 13 through 18 (P < .05).” Chapter 22 MEDIA EXERCISES Analysis of variance works by comparing the differences among the sample means in several groups to the spread of observations within the groups. A P1: IXK/GDO PB286I-22 P2: IXK/GDO PB286-Moore-V5.cls QC: IXK/GDO April 17, 2003 T1: IXK 15:59 Chapter 22 Media Exercises set of means is significantly different if they are farther apart than random variation would allow to happen just by chance. The extent of random variation is assessed by the random spread among observations in the same group. These exercises use an applet to demonstrate this ANOVA idea. 22.32 ANOVA compares several means. The One-Way ANOVA applet displays the observations in three groups, with the group means highlighted by black dots. When you open or reset the applet, the scale at the bottom of the display shows that for these groups the ANOVA F statistic is F = 31.74, with P < 0.001. (a) The middle group has larger mean than the other two. Grab its mean point with the mouse. How small can you make F ? What did you do to the mean to make F -small? Roughly how significant is your small F ? (b) Starting with the three means aligned from your configuration at the end of (a), drag any one of the group means either up or down. What happens to F ? What happens to P ? Convince yourself that the same thing happens if you move any one of the means, or if you move one slightly and then another slightly in the opposite direction. 22.33 ANOVA uses within-group variation. Reset the One-Way ANOVA applet to its original state. As in Figure 22.6(b) (page 607), the differences among the three means are highly significant (large F , small P ) because the observations in each group cluster tightly about the group mean. (a) Use the mouse to slide the Pooled Standard Error at the top of the display to the right. You see that the group means do not change, but the spread of the observations in each group increases. What happens to F and P as the spread among the observations in each group increases? What are the values of F and P when the slider is all the way to the right? This is similar to Figure 22.6(a): variation within groups hides the differences among the group means. (b) Leave the Standard Error slider at the extreme right of its scale, so that spread within groups stays fixed. Use the mouse to move the group means apart. What happens to F and P as you do this? APPLET APPLET 629