Download ANOVA Overview Josh Klugman March 19th, 2009 1.0 The Logic of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
ANOVA Overview
Josh Klugman
March 19th, 2009
1.0 The Logic of Significance Tests
Social science researchers are interested in proving causal relationships between variables. That
is, they want to prove that a change in one variable will affect change in another variable.
xy
x = independent variable a.k.a. predictor a.k.a. explanatory variable
y = dependent variable a.k.a. outcome a.k.a. response variable
When psychologists conduct an experiment on a sample of 50 people, they often are not
interested in the results for the 50 people per se – they want to show that the results generalize to
a broader population, a population that we generally cannot directly observe.
We use significance tests to show that an observed relationship in a sample can be generalized to
a population. We see variable x affects variable y in our sample, but that does not prove that the
relationship exists in the population. It could be that by random chance our sample was a fluke
and we have a relationship that exists in our sample but is not “real” (occurs in the population).1
To see if we can support the notion that a relationship exists in a population, we carry out a
significance test.
A significance test is a thought experiment. We set up a null hypothesis that says there is NO
relationship in the population. We assume the null hypothesis is right, and we calculate the pvalue, which is the probability that we would see a relationship at least as strong as the one we
observed in our sample. If the p-value is “low enough” we say we reject the null hypothesis. If
the p-value is not low enough we say we have to retain the null hypothesis.
 = the threshold for whether or not the p-value is “low enough”. Conventionally it is set to .05.
1
We assume that the data we have is a random sample of the population of interest.
1
Erroneously rejecting the null hypothesis is called Type I error (probability of making Type I
error = ).
Erroneously retaining the null hypothesis is called Type II error.
When you commit a Type I error, you are saying there is a relationship when in fact the
relationship does NOT exist in the population.
When you commit a Type II error, you are saying you cannot prove a relationship exists when in
fact it does exist in the population.
In social science, committing a Type I error is considered a bigger sin than Type II error. This is
because social science is a conservative enterprise. Our default assumption is there is not a
relationship between a given set of variables UNLESS we can prove otherwise. When you say a
relationship does exist, you are challenging what we traditionally thought.
When you say a relationship does not exist, you are upholding our default assumption. To do
this incorrectly is bad, but at least the door is left open for someone else to test the assumption.
2.0 The Logic of One-Way ANOVA
We use ANOVA to test the proposition that is a causal relationship where a categorical variable
(say, experimental condition) affects an outcome.
With one-way ANOVA we are interested in testing for “significant” differences between three or
more groups on some outcome.
Example: We are interested in determining if we can induce specific moods in our subjects. We
use clips from movies as the mood-induction treatments, and we have three conditions: pleasant,
neutral, and unpleasant. After the subject watches the clips, we measure their affect on a scale
from 1 to 8, where 1 indicates sadness, and 8 indicates happiness. We get this data:
2
yj
sj
s
2
j
Pleasant
7.4
6
5.1
7.8
7.4
7.2
5
6.4
6.7
6.1
6.51
0.97
Neutral
3.1
3.8
2.9
4.6
3.9
4
3.8
3.6
3.9
3.4
3.7
0.48
0.94
0.23
Unpleasant
5.1
2.1
2.4
2.2
3.3
1.5
3.1
3.6
2.3
2.2
2.78
1.03
1.06
y j = mean for condition j in the sample
sj = standard deviation for condition j in the sample
sj2 = variance for condition j in the sample
We see that there are differences between the means for the three conditions.
The question is, are these differences real, or are they just random differences caused by
sampling variability?
We setup a null hypotheses that are no differences between the true population means (j). We
want to knock down this null hypothesis.
H0: pleasant = neutral = unpleasant
Ha: Not all of the population means are equal
j – mean for condition j in the population
In order to see if the populations have the same mean, the logic of ANOVA is to see how far
apart the samples means are from each other—relative to the variation that occurs within the
groups.
3
(let’s assume all six of these groups are normally distributed, where the mean equals the
median).
In A and B, the differences between the means are the same. But in B, the differences are larger
relatively to the variability within the groups.
The logic of ANOVA says that in A, the mean differences are more likely due to random chance.
In B, it is less likely we see these mean differences due to chance. We are more likely to reject
the null hypothesis under B than under A.
3.0 Conducting a One-Way ANOVA (the omnibus F-test)
To conduct an ANOVA, we calculate an F statistic:
n
j
( y. j  y.. ) 2
j
F
variance between groups Mean Between - Group Sum of Squares


variance within groups
Mean Withi n - Group Sum of Squares
a 1
 ( yij  y. j ) 2
j
i
N a
a = Number of groups
N = Total number of people
nj = number of people in group j
yij = Value of y for person i in group j
y. j = Mean value of y for group j
y.. = Grand mean (mean of whole sample)
4
Thick Blue horizontal bar: the grand mean ( y.. )
Thin Black horizontal bars: group means ( y. j )
Light blue vertical lines: distance between the group means and the grand mean ( y. j  y.. )
Thin black vertical lines: distance between individuals and their respective group means
( yij  y. j )
For our example,
F
MSB 37.759

 50.61
MSW
.746
MSB = mean between-group sum of squares
MSW = mean within-group sum of squares
The “between-group” sum of squares are sometimes called the “model sum of squares”. They
represent the differences explained by our model.
The “within-group” sum of squares are also called the residual sum of squares or the
unexplained sums of squares. These are differences unexplained by our model.
4.0 The F-Statistic
In order for us to say that there is a real difference among the groups, the F statistic has to be, at a
minimum, above 1. In other words, the between-group mean sum of squares has to be bigger
than the within-group mean sum of squares.
This is because between-group differences in the sample are actually caused by between-group
AND within-group differences in the population.
5
If F is less than or equal to 1, we can never be sure that the observed group differences reflect
TRUE group differences.
Population (truth)
Sample (what we actually observe)
Differences Between Groups
Differences Between Groups
Differences Within Groups
Differences Within Groups
But in order for us to conclude that there are true group differences, we need an F that is much
larger than 1. How much larger? For that, we need to look at the F distribution.
The F distribution is a right-skewed distribution that is a ratio of two chi-squares. It is specified
with a numerator degrees of freedom (a-1) and a denominator degrees of freedom (N-a).
Probability density functions for various F distributions:
Here is a probability density function for an F distribution with 2, 27 degrees of freedom:
6
1
5
F
10
Social scientists evaluate the calculated F-statistic in two ways.
Method 1: Calculate the p-value—the probability of getting a higher F-statistic (finding the area
in the right-tail). If this area is “low enough” we can say we reject the null hypothesis and retain
the alternative hypothesis. We denote the “low enough” threshold with . The conventional 
is .05.  should be determined at the outset.
P value for this example: 7.34 × 10-10, which we commonly round to <.001.
The P-value represents the probability of observing group differences at least as big as we have
observed them if there are no true differences.
The statistical package SPSS automatically gives researchers the p-value (in the “Sig.” box).
ANOVA
affect3
Between Groups
W ithin Groups
Total
Sum of
Squares
75.518
20.145
95.663
df
2
27
29
Mean Square
37.759
.746
F
50.608
7
Sig.
.000
Method 2: Determine , and find then find the “critical value” on the F distribution that bounds
 (we denote critical F values as F*). If the F statistic is larger than F*, then you can reject the
null hypothesis.
Critical value for an F(2,27) distribution , =.05: F2*, 27;..05  3.35.
You can calculate p-values using the FDIST function in Excel, and you can calculate critical
values using the FINV function.
5.0 Assumptions of ANOVA
 The samples are drawn from populations with normal distributions (the continuous
variable is normally distributed)
 The samples are drawn from populations with equal variances
 The cases in the samples are statistically independent of each other.
Violating the normality & equality of variances assumptions is usually not a big deal unless you
have small groups with wildly different group sizes. If you do have small groups or different
group sizes, you lose control of the probabilities of committing Type I and Type II errors. In that
case, you will need to use special techniques to account for these violations.
8
6.0 Contrasts
Let us use a different example.
Here is some hypothetical data on an experiment looking at various ways to treat hypertension
(the outcome is systolic blood pressure, measured in mmHg).
Mean
Variance
nj
( y. j )
( s 2j )
Drug Therapy
94
105
103
114
Biofeedback
81
84
92
101
80
108
Diet
98
95
86
87
94
Combination
81
68
75
70
71
104
91
92
73
67.33333333
4
132
6
27.5
5
26.5
5
H0: DrugTherapy = Biofeedback = Diet= Combo
ANOVA
sbp
Between Groups
W ithin Groups
Total
Sum of
Squares
2246.550
1078.000
3324.550
df
3
16
19
Mean Square
748.850
67.375
F
11.115
Sig.
.000
Because the p-value is so low, we can reject the null hypothesis. We see there is at least one
significant difference among these four groups, but maybe we are interested in testing for
specific differences. Say, the differences between biofeedback and drug therapy.
Mean Drug Therapy – Biofeedback difference: 104 – 91 = 13.
We see that in our sample people with biofeedback have a lower systolic blood pressure by 13
mmHg. But again, we have to ask if this is a TRUE difference—if it really occurs in the
population.
Again, we have to turn to the F statistic.
9
H0: DrugTherapy = Biofeedback
Alternatively:
H0: (1)DrugTherapy – (1)Biofeedback + 0Diet + 0Combo=0
F
 y.DT  y.B 2
 1
1 

 MSW
 nDT nB 
F (1,16) 
104  912
1 1
  1078 / 16
4 6

132
 6.02
28.07
p = .026
Let us test a more complicated contrast: the combination treatment versus all three others.
1
H 0 : (  DT   B   D )   C  0
3
1
[ H 0 : (  DT   B   D )   C ]
3
1
[ ( x DT  x B  x D )   C ] 2
F (1,16)  3
 a

MSW   c 2j / n j 
 j

(95.67  73) 2

(.33) 2 (.33) 2 (.33) 2 (1) 2
67.38(



)
4
6
5
5
22.67 2

67.38(.269)
 28.40
p = 6.78 × 10-5


10
In general:
F (1, df N a ) 

2
 a
MSW   c 2j / n j
 j



Where
a
   x jc j
j
cj = coefficient for group j as specified in null hypothesis
7.0 The Problem of Multiple Comparisons
For contrasts, alpha is the pairwise error rate. It is the rate of making a Type I error for a
particular of groups. If we set alpha to .05, the probability that we will incorrectly reject the null
hypothesis for a particular comparison is .05. If we did a hundred contrasts, we will incorrectly
reject the null hypothesis five times.
However, the probability that we will incorrectly reject at least one null hypothesis in the whole
experiment is considerably larger. The experimentwise error rate depends on the type of contrast
you do (for the sake of parsimony I will not get into this). The calculation for the highest
possible experimentwise error rate is:
Experimentwise error rate (EW)= 1-(1-)C
If you do two contrasts, the highest possible experimentwise error rate is 1-(.95)2=.0975 . If we
did a hundred experiments and did two contrasts for each experiment, we will incorrect reject a
null hypothesis in 9.75 experiments.
Usually we want to minimize our experimentwise error rate. There are a couple of approaches to
do this.
Approach #1. Ignore omnibus F test, and just do a small number of planned, theoreticallyinformed contrasts (no more than three).
Approach #2. Do as many planned contrasts as you want, but use the Bonferroni adjustment.
(Although if you plan on having a large number of contrasts, you are shading over into posthoc
territory.)
11
Approach #3. If the omnibus test is significant, test for any contrasts that look interesting (or test
for all possible contrasts) but use post-hoc adjustments – Tukey’s Wholly Signifigant
Differences (WSD) for simple contrasts and the Scheffe adjustment for complex contrasts..
Planned contrast – a contrast you were interested in BEFORE the data is collected. Usually
guided by theory.
Post-hoc contrast – a contrast you want to test AFTER looking at the data (or when you test for
all possible comparisons).
(post-hoc contrasts are more likely to lead to Type I error because you are testing for differences
irregardless of theory)
7.1 Bonferroni Adjustment
Bonferroni contrasts involve simply setting:
 PW   EW / C
PC = Pairwise alpha
EW = Experimentwise alpha
C = number of planned contrasts
In practice, this usually boils down to setting PW to equal .05/C. If you plan on having three
contrasts, then PW will equal .0167.
Example:
Take the contrast we did between biofeedback and drug therapy. Let us say that was one of three
planned contrasts. We saw that F(1,16) = 6.02 and unadjusted p = .026.
You can use one of three ways to figure out the significance of the adjusted contrast.
1. Compare the unadjusted p to PW. Our p is greater than .05/3 = .0167, so we have to
retain the null hypothesis.
2. Create an “adjusted p” by multiplying the unadjusted p by C and compare it to EW.
Adjusted p = .078 is greater than .05. Retain the null hypothesis.
3. Find the adjusted critical value for F1, N a;..05 / C . F1*,16;..0167  7.14. F < F*, so retain the
null hypothesis.
12
Danger of Bonferroni adjustment: If you have a lot of planned contrasts the Bonferroni
adjustment will be less powerful (more likely to commit Type II error) than post-hoc contrasts.
7.2 Tukey’s Wholly Significant Differences
We use the Tukey WSD for post-hoc contrasts involving only two groups. With the Tukey
WSD, the critical value and the p-value comes from a different distribution—the “studentized
range distribution”. The logic of the studentized range distribution is that you can get a critical
value for testing the difference between the group with the lowest mean and the group with the
highest mean and still keep PC(Min-Max) and EW at .05. If there is going to be any difference
between the groups, it is definitely going to occur between the group with the lowest mean and
the group with the highest mean (in our hypertension example, this would be between the drug
therapy and combination groups). We use the same critical value for other pairwise
comparisons, which means for non-maximum pairwise comparisons PW is < .05 and EW is still
.05.
Values from the studentized range distribution are denoted as q. To get the critical value from
this distribution, use qa2, N a;.05 / 2 .
Example:
For the blood pressure experiment we did, we had four groups (a = 4) and we had 20 subjects
(20-4 =16). We need to find q42,16;.05 / 2 . We find q by looking it up in a statistical table; q =
4.046, and q42,16;.05 / 2 =8.185.
For our drug therapy – biofeedback contrast, F(1,16) = 6.02 which is less than q42,16;.05 / 2 .
According to the Tukey WSD, we must retain the null hypothesis. We cannot prove a difference
exists in the population.
7.3 Scheffé Adjustment For Complex Adjustments
We use the Scheffé test for all of our post-hoc contrasts if any of them are complex. If none of
your post-hoc contrasts are complex, then do not use the Scheffé test as it is much less powerful
than other techniques we have talked about for pairwise comparisons.
The Scheffé adjustment has a similar logic to the Tukey adjustment – the critical values come
from a probability distribution for testing the biggest possible difference among the groups.
13
For the Scheffé we can go back to the F distribution. The critical value for a Scheffé test are:
(a 1) Fa*1, N a;.05
Example:
For the complex contrast we tested for above, we found that F(1,16) = 28.40.
(a  1) Fa*1, N a;.05  3F3*,16;.05  3(3.24)  9.72
Our test statistic is greater than the critical value, so we can reject the null hypothesis with the
Scheffé test.
14
8.0 Two-Way ANOVA
Most of the time, researchers are not interested in the relationships between only two variables.
More often, they want to examine how multiple variables affect a particular outcome.
Hypertension experiment (outcome: systolic blood pressure:
Control Drug Therapy Biofeedback Biofeedback
& Drug
185
186
188
158
190
191
183
163
195
196
198
173
200
181
178
178
180
176
193
168
Mean
190
186
188
168
Grand Mean
183
s2
7.91
7.91
7.91
7.91
Two Way Approach:
Biofeedback
Absent Present Average
Drug
Absent
190
188
189
Therapy Present
186
168
177
Average
188
178
183
In this factorial ANOVA, we are looking at: (a) the main effect of drug therapy; (b) the main
effect of biofeedback; and (c) the interaction effect of both the drug therapy and biofeedback
treatments.
Main effects are the effect of a factor averaging across all the levels of all the other factors.
An interaction effect is when the effect of a factor is contingent on the level of another factor.
Main effect of drug therapy: Compare SBP (systolic blood pressure) of people without drug
therapy to the SBP of people with drug therapy. Subjects undergoing drug therapy see a decline
in SBP of 12 mmHg (189-177).
15
Main effect of Biofeedback: Compare SBP of people without biofeedback to SBP of people with
biofeedback. Subjects undergoing biofeedback see a decline in SBP of 10 mmHg (188-178).
Interaction effect:
We can talk about the interaction between any two variables (a and b) in two ways:


How does the effect of a differ across levels of b?
o Effect of biofeedback without drug therapy (190 - 188 = 2)
o Effect of biofeedback with drug therapy (186 – 168 = 18)
How does the effect of b differ across levels of a?
o Effect of drug therapy without biofeedback (190 - 186 = 4)
o Effect of drug therapy with biofeedback (188 - 168 = 20)
8.01 Terminology
Factor: Independent variable
Level: Value of a single independent variable
In the example, we have two factors (a two-way ANOVA). Each factor has two levels
(absent/present). We designate a two-way factorial ANOVA with this notation: a × b, where a is
the number of levels in the first factor, and b is the number of levels in the second factor.
In the example, we have a 2 × 2 ANOVA.
Cell: Combination of two or more levels from different independent variables.
In the example, we have 4 cells (neither biofeedback nor drug therapy; biofeedback only; drug
therapy only; both biofeedback & drug therapy.
16
8.02 Omnibus F-tests
Biofeedback
Absent Present Average
Drug
Absent
190
188
189
Therapy Present
186
168
177
Average
188
178
183
Hypertension Experiment Individual Values
BFB
Absent
Absent
Drug Therapy
Present
185
190
195
186
191
196
Present
200
180
188
183
198
158
163
173
181
176
178
193
178
168
Sum of Squares Within:
BFB
Absent
Absent
(185-190)
2
(190-190)
2
Present
(200-190)
2
(180-190)
2
(181-186)
2
(176-186)
2
(195-190)2
Drug Therapy
Present
(186-186)
2
(191-186)
2
(188-188)
2
(178-188)
(183-188)
2
2
(193-188)2
(198-188)2
(196-186)2
(158-168)
2
(178-168)
(163-168)
2
(168-168)2
(173-168)2
SSW = 1000
17
2
For main effect A:
n
( y. j .  y... ) 2
j
j
F
a 1
 y
n
a
i
b
j
 y. jk 
2
ijk
k
N  ab
BFB
Absent
(188-183)
2
(188-183)
(188-183)2
Absent
Drug Therapy
(188-183)
2
(188-183)
2
(188-183)
(178-183)
(188-183)2
(188-183)
(188-183)2
Present
Present
2
2
(178-183)
(178-183)2
2
(188-183)2
(178-183)
2
(178-183)
2
(178-183)2
(178-183)
(178-183)2
2
2
2
(178-183)2
(178-183)
2
(189-183)
2
(189-183)
(189-183)
2
(189-183)2
(189-183)
2
(177-183)
2
(177-183)
(177-183)
2
(177-183)2
(177-183)
2
SSB = 500
For Main Effect B:
n
k
( y.. k  y... ) 2
k
F
b 1
 y
n
i
a
j
b
 y. jk 
2
ijk
k
N  ab
BFB
Absent
Absent
Drug Therapy
Present
(189-183)
2
(189-183)
2
(189-183)
2
(177-183)
2
(177-183)
2
(177-183)
2
Present
(189-183)
2
(189-183)
2
(177-183)
2
(177-183)
2
SSA = 720
18
2
2
For Interaction Effect AB:
 y.
n
i
a
b
j
F
 y. j .  y..k  y...
2
jk
k
(a  1)(b  1)
 y
n
i
a
j
b
 y. jk 
2
ijk
k
N  ab
BFB
Absent
(190-189-188+193)
Absent
Drug Therapy
Present
2
(190-189-188+193)2
(190-189-188+193)
2
(186-177-188+183)
2
(186-177-188+183)2
(186-177-188+183)
Present
(190-189-188+193)
2
(190-189-188+193)2
(186-177-188+183)
2
(186-177-188+183)2
2
(188-189-178+183)
(188-189-178+183)2
(188-189-178+183)
2
(168-177-178+183)
2
(168-177-178+183)2
(168-177-178+183)
SSAB = 320
i indexes individuals
j indexes groups in the A factor
k indexes groups in the B factor
a = number of levels in the A f actor
b = number of levels in the B factor
n = sample size
Degrees of freedom for factor A = a – 1
Degrees of freedom for factor B = b – 1
Degrees of freedom for A*B = (a-1)(b-1)
Within-group degrees of freedom = N – a*b
19
2
2
(188-189-178+183)
2
(188-189-178+183)2
(168-177-178+183)
2
(168-177-178+183)2
Tests of Between-Subjects Effects
Dependent Variable: sbp
Source
Corrected Model
Intercept
bfb
drug
bfb * drug
Error
Total
Corrected Total
Type III Sum
of Squares
1540.000a
669780.000
500.000
720.000
320.000
1000.000
672320.000
2540.000
df
3
1
1
1
1
16
20
19
Mean Square
F
513.333
8.213
669780.000 10716.480
500.000
8.000
720.000
11.520
320.000
5.120
62.500
Sig.
.002
.000
.012
.004
.038
a. R Squared = .606 (Adjusted R Squared = .532)
For SPSS output for two-way ANOVAs, disregard the “Intercept” and “Total” rows.
In this case, we see that both main factors are significant, and there is a significant interaction
between biofeedback and drug therapy.
9.0 Contrasts For Two-Way ANOVAs
Consider the following example:
New example: police job performance
First factor: location of office
Second factor: training duration
Upper Class
Location
of Office
Middle Class
Lower Class
Mean
y
n
y
n
y
n
Training Duration
5 weeks 10 weeks 15 weeks
33
35
38
5
5
5
30
31
36
5
5
5
20
40
52
5
5
5
27.67
35.33
42
20
Mean
35.33
32.33
37.33
Tests of Between-Subjects Effects
Dependent Variable: jobperf
Type III Sum
of Squares
2970.000a
55125.000
190.000
1543.333
1236.667
2250.000
60345.000
5220.000
Source
Corrected Model
Intercept
location
weeks
location * weeks
Error
Total
Corrected Total
df
8
1
2
2
4
36
45
44
Mean Square
371.250
55125.000
95.000
771.667
309.167
62.500
F
5.940
882.000
1.520
12.347
4.947
Sig.
.000
.000
.232
.000
.003
a. R Squared = .569 (Adjusted R Squared = .473)
With a two-way ANOVA, you can do three different kinds of contrasts:
 Main effects contrasts. Comparing two levels in a main effect. E.g. is the difference
between middle and lower class precincts significant? (32.33 vs. 37.33)
 Simple effects contrasts. E.g. Comparing levels A1 and A2 for level B1. Example: Is
there a significant difference between middle- and lower-class precincts in the five-week
condition?
 Interaction contrasts. E.g. comparing the A1-A2 difference in B1 to the A1-A2
difference in B2. Example: Are the middle-lower class differences the same in the 5 week and
10 week condition?
9.1 Main Effects Contrasts
To get the F-statistic (to test for main effect of Factor A):
F1, N ab 

2
 a

MSWF   c 2j / n j . 
 j



a
ˆ   c j y. j.
j
Example:
H 0 :  LC   MC
Null hypothesis expressed differently:
H 0 : (1) LC  (1) MC  (0)UC  0
21
Upper Class
Location
of Office
y
Training Duration
5 weeks 10 weeks 15 weeks
33
35
38
Mean
35.33
Middle Class
y
30
31
36
32.33
Lower Class
y
20
40
52
37.33
27.67
35.33
42
Mean
ˆ  0(35.33)  1(32.33)  1(37.33)  5
F
52
25

 3.00
2
2
2
0
 1 1  8.33
62.5 
 
 15 15 15 
F(1*,36)  4.11
We retain the null hypothesis that police officers from middle- and lower-class precincts have the same
job performance.
Adjustments to Critical Value
To test for differences between levels of Factor A:
Bonferroni:
F : F1*, N ab; .05 / C
Tukey:
F : (qa*, N ab; .05 ) 2 / 2
Scheffé:
F : (a  1) Fa*1, N ab; .05
22
9.2 Simple Effects Contrasts
F-test:

2
F1, N ab 
 a b
MSWF   c 2jk / n jk
 j k

a
b
j
k


ˆ   c jk y. jk
Example:
Is the difference between the upper- and middle-class precincts significant in the five-week condition?
H 0 :  MC ,5W  UC ,5W
Null hypothesis expressed differently:
H 0 : (1)  MC ,5W  (1) UC ,5W  (0)  LC ,5W  (0)  MC ,10W  (0) UC ,10W  (0)  LC ,10W  (0)  MC ,15W  (0) UC ,15W  (0)  LC ,15W  0
y
Upper Class
Location
of Office
Training Duration
5 weeks 10 weeks 15 weeks
33
35
38
Mean
35.33
Middle Class
y
30
31
36
32.33
Lower Class
y
20
40
52
37.33
27.67
35.33
42
Mean
ˆ  1(33)  0(35)  0(38) 1(30)  0(31)  0(36)  0(20)  0(40)  0(52)  3
32
9

 .36
1 1
25
62.5(  )
5 5
*
F(1,36)  4.11
F
Again, we retain the null hypothesis.
Adjustments to Critical Value
To test if Factor A (focal factor) has an effect within levels of Factor B (moderating factor):
Bonferroni:
F : F1*, N ab; .05/ C
23
C is usually b (it would be greater than b if you had more planned contrasts), the number of levels in
Factor B, the moderating factor.
Tukey (use if focal factor has only two levels)
F : (qa*, N ab; .05 / b ) 2 / 2
In practice this is difficult to do because neither SPSS nor Excel have functions for the studentized range
distribution, so alternatively:
*
2
F : (qab
, N ab; .05 ) / 2
Scheffé:
F : (a  1) F(*a1), N ab; .05 / b
9.3 Interaction Contrasts
F-statistic:

2
F1, N ab 
 a b
MSWF   c 2jk / n jk
 j k

a
b
j
k


ˆ   c jk y. jk
Example:
Upper Class
Location
of Office
y
Training Duration
5 weeks 10 weeks 15 weeks
33
35
38
Mean
35.33
Middle Class
y
30
31
36
32.33
Lower Class
y
20
40
52
37.33
27.67
35.33
42
Mean
According to this data, officers in lower-class precincts who trained for 15 weeks have a better job
performance rating than officers in middle-class precincts with the same training level. If we look at
officers who trained for 5 weeks, we see that officers in lower-class precincts have a worse job
performance rating than officers in middle-class districts. Is the difference between differences
significant?
The null hypothesis for this contrast is:
24
H 0   LC ,15W   MC ,15W   LC ,5W   MC ,5W
Using algebra, we can rework this equation so that:
H 0  ˆ   LC ,15W   MC ,15W   LC ,5W   MC ,5W  0
a
ˆ   c jk y. jk
j
F1, N ab 

2
 a b

MSWF   c 2jk / n jk 
 j k

ˆ  1(52) 1(36) 1(20)  1(30)  26


676
=13.52
1 1 1 1
62.5(    )
5 5 5 5
 4.11
F1, N ab 
F1*,36
We reject the null hypothesis—the lower/middle difference significantly varies between the 5 week and
15 week conditions.
Adjusting the Critical Values:
Bonferroni:
F : F1*, N ab; .05 / C
Tukey’s WSD
Tukey’s WSD is for pairwise contrasts only. Interaction contrasts are always complex. You cannot do a
Tukey’s WSD for an interaction contrast.
Scheffé
F : (a  1)(b  1) F(*a1)(b1), N ab; .05
25