Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ANOVA Overview Josh Klugman March 19th, 2009 1.0 The Logic of Significance Tests Social science researchers are interested in proving causal relationships between variables. That is, they want to prove that a change in one variable will affect change in another variable. xy x = independent variable a.k.a. predictor a.k.a. explanatory variable y = dependent variable a.k.a. outcome a.k.a. response variable When psychologists conduct an experiment on a sample of 50 people, they often are not interested in the results for the 50 people per se – they want to show that the results generalize to a broader population, a population that we generally cannot directly observe. We use significance tests to show that an observed relationship in a sample can be generalized to a population. We see variable x affects variable y in our sample, but that does not prove that the relationship exists in the population. It could be that by random chance our sample was a fluke and we have a relationship that exists in our sample but is not “real” (occurs in the population).1 To see if we can support the notion that a relationship exists in a population, we carry out a significance test. A significance test is a thought experiment. We set up a null hypothesis that says there is NO relationship in the population. We assume the null hypothesis is right, and we calculate the pvalue, which is the probability that we would see a relationship at least as strong as the one we observed in our sample. If the p-value is “low enough” we say we reject the null hypothesis. If the p-value is not low enough we say we have to retain the null hypothesis. = the threshold for whether or not the p-value is “low enough”. Conventionally it is set to .05. 1 We assume that the data we have is a random sample of the population of interest. 1 Erroneously rejecting the null hypothesis is called Type I error (probability of making Type I error = ). Erroneously retaining the null hypothesis is called Type II error. When you commit a Type I error, you are saying there is a relationship when in fact the relationship does NOT exist in the population. When you commit a Type II error, you are saying you cannot prove a relationship exists when in fact it does exist in the population. In social science, committing a Type I error is considered a bigger sin than Type II error. This is because social science is a conservative enterprise. Our default assumption is there is not a relationship between a given set of variables UNLESS we can prove otherwise. When you say a relationship does exist, you are challenging what we traditionally thought. When you say a relationship does not exist, you are upholding our default assumption. To do this incorrectly is bad, but at least the door is left open for someone else to test the assumption. 2.0 The Logic of One-Way ANOVA We use ANOVA to test the proposition that is a causal relationship where a categorical variable (say, experimental condition) affects an outcome. With one-way ANOVA we are interested in testing for “significant” differences between three or more groups on some outcome. Example: We are interested in determining if we can induce specific moods in our subjects. We use clips from movies as the mood-induction treatments, and we have three conditions: pleasant, neutral, and unpleasant. After the subject watches the clips, we measure their affect on a scale from 1 to 8, where 1 indicates sadness, and 8 indicates happiness. We get this data: 2 yj sj s 2 j Pleasant 7.4 6 5.1 7.8 7.4 7.2 5 6.4 6.7 6.1 6.51 0.97 Neutral 3.1 3.8 2.9 4.6 3.9 4 3.8 3.6 3.9 3.4 3.7 0.48 0.94 0.23 Unpleasant 5.1 2.1 2.4 2.2 3.3 1.5 3.1 3.6 2.3 2.2 2.78 1.03 1.06 y j = mean for condition j in the sample sj = standard deviation for condition j in the sample sj2 = variance for condition j in the sample We see that there are differences between the means for the three conditions. The question is, are these differences real, or are they just random differences caused by sampling variability? We setup a null hypotheses that are no differences between the true population means (j). We want to knock down this null hypothesis. H0: pleasant = neutral = unpleasant Ha: Not all of the population means are equal j – mean for condition j in the population In order to see if the populations have the same mean, the logic of ANOVA is to see how far apart the samples means are from each other—relative to the variation that occurs within the groups. 3 (let’s assume all six of these groups are normally distributed, where the mean equals the median). In A and B, the differences between the means are the same. But in B, the differences are larger relatively to the variability within the groups. The logic of ANOVA says that in A, the mean differences are more likely due to random chance. In B, it is less likely we see these mean differences due to chance. We are more likely to reject the null hypothesis under B than under A. 3.0 Conducting a One-Way ANOVA (the omnibus F-test) To conduct an ANOVA, we calculate an F statistic: n j ( y. j y.. ) 2 j F variance between groups Mean Between - Group Sum of Squares variance within groups Mean Withi n - Group Sum of Squares a 1 ( yij y. j ) 2 j i N a a = Number of groups N = Total number of people nj = number of people in group j yij = Value of y for person i in group j y. j = Mean value of y for group j y.. = Grand mean (mean of whole sample) 4 Thick Blue horizontal bar: the grand mean ( y.. ) Thin Black horizontal bars: group means ( y. j ) Light blue vertical lines: distance between the group means and the grand mean ( y. j y.. ) Thin black vertical lines: distance between individuals and their respective group means ( yij y. j ) For our example, F MSB 37.759 50.61 MSW .746 MSB = mean between-group sum of squares MSW = mean within-group sum of squares The “between-group” sum of squares are sometimes called the “model sum of squares”. They represent the differences explained by our model. The “within-group” sum of squares are also called the residual sum of squares or the unexplained sums of squares. These are differences unexplained by our model. 4.0 The F-Statistic In order for us to say that there is a real difference among the groups, the F statistic has to be, at a minimum, above 1. In other words, the between-group mean sum of squares has to be bigger than the within-group mean sum of squares. This is because between-group differences in the sample are actually caused by between-group AND within-group differences in the population. 5 If F is less than or equal to 1, we can never be sure that the observed group differences reflect TRUE group differences. Population (truth) Sample (what we actually observe) Differences Between Groups Differences Between Groups Differences Within Groups Differences Within Groups But in order for us to conclude that there are true group differences, we need an F that is much larger than 1. How much larger? For that, we need to look at the F distribution. The F distribution is a right-skewed distribution that is a ratio of two chi-squares. It is specified with a numerator degrees of freedom (a-1) and a denominator degrees of freedom (N-a). Probability density functions for various F distributions: Here is a probability density function for an F distribution with 2, 27 degrees of freedom: 6 1 5 F 10 Social scientists evaluate the calculated F-statistic in two ways. Method 1: Calculate the p-value—the probability of getting a higher F-statistic (finding the area in the right-tail). If this area is “low enough” we can say we reject the null hypothesis and retain the alternative hypothesis. We denote the “low enough” threshold with . The conventional is .05. should be determined at the outset. P value for this example: 7.34 × 10-10, which we commonly round to <.001. The P-value represents the probability of observing group differences at least as big as we have observed them if there are no true differences. The statistical package SPSS automatically gives researchers the p-value (in the “Sig.” box). ANOVA affect3 Between Groups W ithin Groups Total Sum of Squares 75.518 20.145 95.663 df 2 27 29 Mean Square 37.759 .746 F 50.608 7 Sig. .000 Method 2: Determine , and find then find the “critical value” on the F distribution that bounds (we denote critical F values as F*). If the F statistic is larger than F*, then you can reject the null hypothesis. Critical value for an F(2,27) distribution , =.05: F2*, 27;..05 3.35. You can calculate p-values using the FDIST function in Excel, and you can calculate critical values using the FINV function. 5.0 Assumptions of ANOVA The samples are drawn from populations with normal distributions (the continuous variable is normally distributed) The samples are drawn from populations with equal variances The cases in the samples are statistically independent of each other. Violating the normality & equality of variances assumptions is usually not a big deal unless you have small groups with wildly different group sizes. If you do have small groups or different group sizes, you lose control of the probabilities of committing Type I and Type II errors. In that case, you will need to use special techniques to account for these violations. 8 6.0 Contrasts Let us use a different example. Here is some hypothetical data on an experiment looking at various ways to treat hypertension (the outcome is systolic blood pressure, measured in mmHg). Mean Variance nj ( y. j ) ( s 2j ) Drug Therapy 94 105 103 114 Biofeedback 81 84 92 101 80 108 Diet 98 95 86 87 94 Combination 81 68 75 70 71 104 91 92 73 67.33333333 4 132 6 27.5 5 26.5 5 H0: DrugTherapy = Biofeedback = Diet= Combo ANOVA sbp Between Groups W ithin Groups Total Sum of Squares 2246.550 1078.000 3324.550 df 3 16 19 Mean Square 748.850 67.375 F 11.115 Sig. .000 Because the p-value is so low, we can reject the null hypothesis. We see there is at least one significant difference among these four groups, but maybe we are interested in testing for specific differences. Say, the differences between biofeedback and drug therapy. Mean Drug Therapy – Biofeedback difference: 104 – 91 = 13. We see that in our sample people with biofeedback have a lower systolic blood pressure by 13 mmHg. But again, we have to ask if this is a TRUE difference—if it really occurs in the population. Again, we have to turn to the F statistic. 9 H0: DrugTherapy = Biofeedback Alternatively: H0: (1)DrugTherapy – (1)Biofeedback + 0Diet + 0Combo=0 F y.DT y.B 2 1 1 MSW nDT nB F (1,16) 104 912 1 1 1078 / 16 4 6 132 6.02 28.07 p = .026 Let us test a more complicated contrast: the combination treatment versus all three others. 1 H 0 : ( DT B D ) C 0 3 1 [ H 0 : ( DT B D ) C ] 3 1 [ ( x DT x B x D ) C ] 2 F (1,16) 3 a MSW c 2j / n j j (95.67 73) 2 (.33) 2 (.33) 2 (.33) 2 (1) 2 67.38( ) 4 6 5 5 22.67 2 67.38(.269) 28.40 p = 6.78 × 10-5 10 In general: F (1, df N a ) 2 a MSW c 2j / n j j Where a x jc j j cj = coefficient for group j as specified in null hypothesis 7.0 The Problem of Multiple Comparisons For contrasts, alpha is the pairwise error rate. It is the rate of making a Type I error for a particular of groups. If we set alpha to .05, the probability that we will incorrectly reject the null hypothesis for a particular comparison is .05. If we did a hundred contrasts, we will incorrectly reject the null hypothesis five times. However, the probability that we will incorrectly reject at least one null hypothesis in the whole experiment is considerably larger. The experimentwise error rate depends on the type of contrast you do (for the sake of parsimony I will not get into this). The calculation for the highest possible experimentwise error rate is: Experimentwise error rate (EW)= 1-(1-)C If you do two contrasts, the highest possible experimentwise error rate is 1-(.95)2=.0975 . If we did a hundred experiments and did two contrasts for each experiment, we will incorrect reject a null hypothesis in 9.75 experiments. Usually we want to minimize our experimentwise error rate. There are a couple of approaches to do this. Approach #1. Ignore omnibus F test, and just do a small number of planned, theoreticallyinformed contrasts (no more than three). Approach #2. Do as many planned contrasts as you want, but use the Bonferroni adjustment. (Although if you plan on having a large number of contrasts, you are shading over into posthoc territory.) 11 Approach #3. If the omnibus test is significant, test for any contrasts that look interesting (or test for all possible contrasts) but use post-hoc adjustments – Tukey’s Wholly Signifigant Differences (WSD) for simple contrasts and the Scheffe adjustment for complex contrasts.. Planned contrast – a contrast you were interested in BEFORE the data is collected. Usually guided by theory. Post-hoc contrast – a contrast you want to test AFTER looking at the data (or when you test for all possible comparisons). (post-hoc contrasts are more likely to lead to Type I error because you are testing for differences irregardless of theory) 7.1 Bonferroni Adjustment Bonferroni contrasts involve simply setting: PW EW / C PC = Pairwise alpha EW = Experimentwise alpha C = number of planned contrasts In practice, this usually boils down to setting PW to equal .05/C. If you plan on having three contrasts, then PW will equal .0167. Example: Take the contrast we did between biofeedback and drug therapy. Let us say that was one of three planned contrasts. We saw that F(1,16) = 6.02 and unadjusted p = .026. You can use one of three ways to figure out the significance of the adjusted contrast. 1. Compare the unadjusted p to PW. Our p is greater than .05/3 = .0167, so we have to retain the null hypothesis. 2. Create an “adjusted p” by multiplying the unadjusted p by C and compare it to EW. Adjusted p = .078 is greater than .05. Retain the null hypothesis. 3. Find the adjusted critical value for F1, N a;..05 / C . F1*,16;..0167 7.14. F < F*, so retain the null hypothesis. 12 Danger of Bonferroni adjustment: If you have a lot of planned contrasts the Bonferroni adjustment will be less powerful (more likely to commit Type II error) than post-hoc contrasts. 7.2 Tukey’s Wholly Significant Differences We use the Tukey WSD for post-hoc contrasts involving only two groups. With the Tukey WSD, the critical value and the p-value comes from a different distribution—the “studentized range distribution”. The logic of the studentized range distribution is that you can get a critical value for testing the difference between the group with the lowest mean and the group with the highest mean and still keep PC(Min-Max) and EW at .05. If there is going to be any difference between the groups, it is definitely going to occur between the group with the lowest mean and the group with the highest mean (in our hypertension example, this would be between the drug therapy and combination groups). We use the same critical value for other pairwise comparisons, which means for non-maximum pairwise comparisons PW is < .05 and EW is still .05. Values from the studentized range distribution are denoted as q. To get the critical value from this distribution, use qa2, N a;.05 / 2 . Example: For the blood pressure experiment we did, we had four groups (a = 4) and we had 20 subjects (20-4 =16). We need to find q42,16;.05 / 2 . We find q by looking it up in a statistical table; q = 4.046, and q42,16;.05 / 2 =8.185. For our drug therapy – biofeedback contrast, F(1,16) = 6.02 which is less than q42,16;.05 / 2 . According to the Tukey WSD, we must retain the null hypothesis. We cannot prove a difference exists in the population. 7.3 Scheffé Adjustment For Complex Adjustments We use the Scheffé test for all of our post-hoc contrasts if any of them are complex. If none of your post-hoc contrasts are complex, then do not use the Scheffé test as it is much less powerful than other techniques we have talked about for pairwise comparisons. The Scheffé adjustment has a similar logic to the Tukey adjustment – the critical values come from a probability distribution for testing the biggest possible difference among the groups. 13 For the Scheffé we can go back to the F distribution. The critical value for a Scheffé test are: (a 1) Fa*1, N a;.05 Example: For the complex contrast we tested for above, we found that F(1,16) = 28.40. (a 1) Fa*1, N a;.05 3F3*,16;.05 3(3.24) 9.72 Our test statistic is greater than the critical value, so we can reject the null hypothesis with the Scheffé test. 14 8.0 Two-Way ANOVA Most of the time, researchers are not interested in the relationships between only two variables. More often, they want to examine how multiple variables affect a particular outcome. Hypertension experiment (outcome: systolic blood pressure: Control Drug Therapy Biofeedback Biofeedback & Drug 185 186 188 158 190 191 183 163 195 196 198 173 200 181 178 178 180 176 193 168 Mean 190 186 188 168 Grand Mean 183 s2 7.91 7.91 7.91 7.91 Two Way Approach: Biofeedback Absent Present Average Drug Absent 190 188 189 Therapy Present 186 168 177 Average 188 178 183 In this factorial ANOVA, we are looking at: (a) the main effect of drug therapy; (b) the main effect of biofeedback; and (c) the interaction effect of both the drug therapy and biofeedback treatments. Main effects are the effect of a factor averaging across all the levels of all the other factors. An interaction effect is when the effect of a factor is contingent on the level of another factor. Main effect of drug therapy: Compare SBP (systolic blood pressure) of people without drug therapy to the SBP of people with drug therapy. Subjects undergoing drug therapy see a decline in SBP of 12 mmHg (189-177). 15 Main effect of Biofeedback: Compare SBP of people without biofeedback to SBP of people with biofeedback. Subjects undergoing biofeedback see a decline in SBP of 10 mmHg (188-178). Interaction effect: We can talk about the interaction between any two variables (a and b) in two ways: How does the effect of a differ across levels of b? o Effect of biofeedback without drug therapy (190 - 188 = 2) o Effect of biofeedback with drug therapy (186 – 168 = 18) How does the effect of b differ across levels of a? o Effect of drug therapy without biofeedback (190 - 186 = 4) o Effect of drug therapy with biofeedback (188 - 168 = 20) 8.01 Terminology Factor: Independent variable Level: Value of a single independent variable In the example, we have two factors (a two-way ANOVA). Each factor has two levels (absent/present). We designate a two-way factorial ANOVA with this notation: a × b, where a is the number of levels in the first factor, and b is the number of levels in the second factor. In the example, we have a 2 × 2 ANOVA. Cell: Combination of two or more levels from different independent variables. In the example, we have 4 cells (neither biofeedback nor drug therapy; biofeedback only; drug therapy only; both biofeedback & drug therapy. 16 8.02 Omnibus F-tests Biofeedback Absent Present Average Drug Absent 190 188 189 Therapy Present 186 168 177 Average 188 178 183 Hypertension Experiment Individual Values BFB Absent Absent Drug Therapy Present 185 190 195 186 191 196 Present 200 180 188 183 198 158 163 173 181 176 178 193 178 168 Sum of Squares Within: BFB Absent Absent (185-190) 2 (190-190) 2 Present (200-190) 2 (180-190) 2 (181-186) 2 (176-186) 2 (195-190)2 Drug Therapy Present (186-186) 2 (191-186) 2 (188-188) 2 (178-188) (183-188) 2 2 (193-188)2 (198-188)2 (196-186)2 (158-168) 2 (178-168) (163-168) 2 (168-168)2 (173-168)2 SSW = 1000 17 2 For main effect A: n ( y. j . y... ) 2 j j F a 1 y n a i b j y. jk 2 ijk k N ab BFB Absent (188-183) 2 (188-183) (188-183)2 Absent Drug Therapy (188-183) 2 (188-183) 2 (188-183) (178-183) (188-183)2 (188-183) (188-183)2 Present Present 2 2 (178-183) (178-183)2 2 (188-183)2 (178-183) 2 (178-183) 2 (178-183)2 (178-183) (178-183)2 2 2 2 (178-183)2 (178-183) 2 (189-183) 2 (189-183) (189-183) 2 (189-183)2 (189-183) 2 (177-183) 2 (177-183) (177-183) 2 (177-183)2 (177-183) 2 SSB = 500 For Main Effect B: n k ( y.. k y... ) 2 k F b 1 y n i a j b y. jk 2 ijk k N ab BFB Absent Absent Drug Therapy Present (189-183) 2 (189-183) 2 (189-183) 2 (177-183) 2 (177-183) 2 (177-183) 2 Present (189-183) 2 (189-183) 2 (177-183) 2 (177-183) 2 SSA = 720 18 2 2 For Interaction Effect AB: y. n i a b j F y. j . y..k y... 2 jk k (a 1)(b 1) y n i a j b y. jk 2 ijk k N ab BFB Absent (190-189-188+193) Absent Drug Therapy Present 2 (190-189-188+193)2 (190-189-188+193) 2 (186-177-188+183) 2 (186-177-188+183)2 (186-177-188+183) Present (190-189-188+193) 2 (190-189-188+193)2 (186-177-188+183) 2 (186-177-188+183)2 2 (188-189-178+183) (188-189-178+183)2 (188-189-178+183) 2 (168-177-178+183) 2 (168-177-178+183)2 (168-177-178+183) SSAB = 320 i indexes individuals j indexes groups in the A factor k indexes groups in the B factor a = number of levels in the A f actor b = number of levels in the B factor n = sample size Degrees of freedom for factor A = a – 1 Degrees of freedom for factor B = b – 1 Degrees of freedom for A*B = (a-1)(b-1) Within-group degrees of freedom = N – a*b 19 2 2 (188-189-178+183) 2 (188-189-178+183)2 (168-177-178+183) 2 (168-177-178+183)2 Tests of Between-Subjects Effects Dependent Variable: sbp Source Corrected Model Intercept bfb drug bfb * drug Error Total Corrected Total Type III Sum of Squares 1540.000a 669780.000 500.000 720.000 320.000 1000.000 672320.000 2540.000 df 3 1 1 1 1 16 20 19 Mean Square F 513.333 8.213 669780.000 10716.480 500.000 8.000 720.000 11.520 320.000 5.120 62.500 Sig. .002 .000 .012 .004 .038 a. R Squared = .606 (Adjusted R Squared = .532) For SPSS output for two-way ANOVAs, disregard the “Intercept” and “Total” rows. In this case, we see that both main factors are significant, and there is a significant interaction between biofeedback and drug therapy. 9.0 Contrasts For Two-Way ANOVAs Consider the following example: New example: police job performance First factor: location of office Second factor: training duration Upper Class Location of Office Middle Class Lower Class Mean y n y n y n Training Duration 5 weeks 10 weeks 15 weeks 33 35 38 5 5 5 30 31 36 5 5 5 20 40 52 5 5 5 27.67 35.33 42 20 Mean 35.33 32.33 37.33 Tests of Between-Subjects Effects Dependent Variable: jobperf Type III Sum of Squares 2970.000a 55125.000 190.000 1543.333 1236.667 2250.000 60345.000 5220.000 Source Corrected Model Intercept location weeks location * weeks Error Total Corrected Total df 8 1 2 2 4 36 45 44 Mean Square 371.250 55125.000 95.000 771.667 309.167 62.500 F 5.940 882.000 1.520 12.347 4.947 Sig. .000 .000 .232 .000 .003 a. R Squared = .569 (Adjusted R Squared = .473) With a two-way ANOVA, you can do three different kinds of contrasts: Main effects contrasts. Comparing two levels in a main effect. E.g. is the difference between middle and lower class precincts significant? (32.33 vs. 37.33) Simple effects contrasts. E.g. Comparing levels A1 and A2 for level B1. Example: Is there a significant difference between middle- and lower-class precincts in the five-week condition? Interaction contrasts. E.g. comparing the A1-A2 difference in B1 to the A1-A2 difference in B2. Example: Are the middle-lower class differences the same in the 5 week and 10 week condition? 9.1 Main Effects Contrasts To get the F-statistic (to test for main effect of Factor A): F1, N ab 2 a MSWF c 2j / n j . j a ˆ c j y. j. j Example: H 0 : LC MC Null hypothesis expressed differently: H 0 : (1) LC (1) MC (0)UC 0 21 Upper Class Location of Office y Training Duration 5 weeks 10 weeks 15 weeks 33 35 38 Mean 35.33 Middle Class y 30 31 36 32.33 Lower Class y 20 40 52 37.33 27.67 35.33 42 Mean ˆ 0(35.33) 1(32.33) 1(37.33) 5 F 52 25 3.00 2 2 2 0 1 1 8.33 62.5 15 15 15 F(1*,36) 4.11 We retain the null hypothesis that police officers from middle- and lower-class precincts have the same job performance. Adjustments to Critical Value To test for differences between levels of Factor A: Bonferroni: F : F1*, N ab; .05 / C Tukey: F : (qa*, N ab; .05 ) 2 / 2 Scheffé: F : (a 1) Fa*1, N ab; .05 22 9.2 Simple Effects Contrasts F-test: 2 F1, N ab a b MSWF c 2jk / n jk j k a b j k ˆ c jk y. jk Example: Is the difference between the upper- and middle-class precincts significant in the five-week condition? H 0 : MC ,5W UC ,5W Null hypothesis expressed differently: H 0 : (1) MC ,5W (1) UC ,5W (0) LC ,5W (0) MC ,10W (0) UC ,10W (0) LC ,10W (0) MC ,15W (0) UC ,15W (0) LC ,15W 0 y Upper Class Location of Office Training Duration 5 weeks 10 weeks 15 weeks 33 35 38 Mean 35.33 Middle Class y 30 31 36 32.33 Lower Class y 20 40 52 37.33 27.67 35.33 42 Mean ˆ 1(33) 0(35) 0(38) 1(30) 0(31) 0(36) 0(20) 0(40) 0(52) 3 32 9 .36 1 1 25 62.5( ) 5 5 * F(1,36) 4.11 F Again, we retain the null hypothesis. Adjustments to Critical Value To test if Factor A (focal factor) has an effect within levels of Factor B (moderating factor): Bonferroni: F : F1*, N ab; .05/ C 23 C is usually b (it would be greater than b if you had more planned contrasts), the number of levels in Factor B, the moderating factor. Tukey (use if focal factor has only two levels) F : (qa*, N ab; .05 / b ) 2 / 2 In practice this is difficult to do because neither SPSS nor Excel have functions for the studentized range distribution, so alternatively: * 2 F : (qab , N ab; .05 ) / 2 Scheffé: F : (a 1) F(*a1), N ab; .05 / b 9.3 Interaction Contrasts F-statistic: 2 F1, N ab a b MSWF c 2jk / n jk j k a b j k ˆ c jk y. jk Example: Upper Class Location of Office y Training Duration 5 weeks 10 weeks 15 weeks 33 35 38 Mean 35.33 Middle Class y 30 31 36 32.33 Lower Class y 20 40 52 37.33 27.67 35.33 42 Mean According to this data, officers in lower-class precincts who trained for 15 weeks have a better job performance rating than officers in middle-class precincts with the same training level. If we look at officers who trained for 5 weeks, we see that officers in lower-class precincts have a worse job performance rating than officers in middle-class districts. Is the difference between differences significant? The null hypothesis for this contrast is: 24 H 0 LC ,15W MC ,15W LC ,5W MC ,5W Using algebra, we can rework this equation so that: H 0 ˆ LC ,15W MC ,15W LC ,5W MC ,5W 0 a ˆ c jk y. jk j F1, N ab 2 a b MSWF c 2jk / n jk j k ˆ 1(52) 1(36) 1(20) 1(30) 26 676 =13.52 1 1 1 1 62.5( ) 5 5 5 5 4.11 F1, N ab F1*,36 We reject the null hypothesis—the lower/middle difference significantly varies between the 5 week and 15 week conditions. Adjusting the Critical Values: Bonferroni: F : F1*, N ab; .05 / C Tukey’s WSD Tukey’s WSD is for pairwise contrasts only. Interaction contrasts are always complex. You cannot do a Tukey’s WSD for an interaction contrast. Scheffé F : (a 1)(b 1) F(*a1)(b1), N ab; .05 25