Download Post Hoc Comparisons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Omnibus test wikipedia , lookup

Analysis of variance wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Psyc 771/772
Turkheimer
Post Hoc Comparisons
Page 1 of 7
Post-Hoc Comparisons
When a factor has more than two levels it takes more than one contrast code to code for
it, which is why factors with k factors have k-1 degrees of freedom. The PREs for the
individual contrasts, as opposed to the k-1 df contrasts for the whole factor, are called
planned comparisons, and provide specific tests of the differences between the relevant
groups or means of groups. The multi-df effect of the whole factor is called the omnibus
effect. For a factor with k levels we can make k-1 independent comparisons among
groups, because that's all the independent information there is to work with.
Sometimes, however, we might go into an analysis not knowing what groups we want to
compare, and not wanting to limit ourselves to k-1 comparisons. We just want to look at
pairs of groups and have some way of knowing whether the differences among the pairs
are "significant." Or not necessarily pairs, we might also want to compare one group to
the mean of two others, or any of a very large number of possible comparisons.
So here is an example: Five treatments for fever blisters, including a placebo, were
randomly assigned to 30 patients. The data and SAS program on the master page describe
the number of days to healing in the five groups.
Here is the output from the main part of PROC GLM:
General Linear Models Procedure
Dependent Variable: DAYS
DF
Sum of
Squares
Mean
Square
F Value
Pr > F
Model
4
36.466667
9.116667
3.90
0.0136
Error
25
58.500000
2.340000
Corrected Total
29
94.966667
R-Square
C.V.
Root MSE
0.383994
27.15454
1.5297
DF
Type I SS
Mean Square
F Value
Pr > F
4
36.466667
9.116667
3.90
0.0136
DF
Type III SS
Mean Square
F Value
Pr > F
4
36.466667
9.116667
3.90
0.0136
Source
Source
GROUP
Source
GROUP
DAYS
Mean
5.6333
Psyc 771/772
Turkheimer
Post Hoc Comparisons
Page 2 of 7
So sr2 equals .38. But we learned last week that we are not done as long as there are 4 df
in the numerator. If we had specific hypotheses about which of the pairwise mean
differences we were interested in we could use a set of planned comparisons using a
CONTRAST statement in SAS. But sometimes we don't- what we want to do is to look at
the five means in a more exploratory way, and see where the large differences are.
Here are the means:
-------------------------------- GROUP=1 -----------------------------N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------6
7.5000000
1.6431677
5.0000000
10.0000000
----------------------------------------------------------------------------------------- GROUP=2 -----------------------------N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------6
5.0000000
1.2649111
3.0000000
6.0000000
----------------------------------------------------------------------------------------- GROUP=3 -----------------------------N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------6
4.3333333
1.0327956
3.0000000
6.0000000
----------------------------------------------------------------------------------------- GROUP=4 -----------------------------N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------6
5.1666667
1.4719601
3.0000000
7.0000000
----------------------------------------------------------------------------------------- GROUP=5 -----------------------------N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------6
6.1666667
2.0412415
3.0000000
9.0000000
----------------------------------------------------------
We could compute d or t between any pair of these means we chose, and in fact for
descriptive purposes this would be a very useful thing to do. But from the point of view
of significance testing, that is problematic because for five means there are 10 pairwise
comparisons, and many many more comparisons based on more than two groups (eg,
group 1 v. 2 and 3 combined). So if we were going to test the significance of all the
Psyc 771/772
Turkheimer
Post Hoc Comparisons
Page 3 of 7
comparisons we would do a whole lot of tests, which would inflate our experiment-wise
Type I error rate substantially.
The most general approach to the problem of multiple hypothesis tests is called the
Bonferroni Correction. This is based on the fact that if you do k tests with individual
error rates of /k, the overall error rate can't be any worse than k. This means that you're
always OK if you divide the total error rate you want to maintain (ie, .05) by the total
number of tests you are doing. In this case you are doing 10 pairwise tests, so you would
be safe if you used an individual  of .005.
The problem with this is that it is very conservative, often to the point of being
ridiculous. Check the power of testing pairwise hypotheses with an n of six at =.005. In
addition, the Bonferroni correction is often too strict, in that you can derive less severe
corrections that still do the job. There are a great many of these that all do pretty much
the same thing. We will learn about two.
Most methods of this type work by estimating a minimum difference between group
means that is significant at some level. You can then simply compare a difference to the
minimum difference to see if it makes the grade.
The next most general (and by the same token, next least powerful) method is called the
Scheffé method. The Scheffé method can be applied to either pairwise means, or groups
of means, and it doesn't matter if the group sizes are equal. If the group sizes are all equal
and you only want to compare pairs of means, the somewhat more powerful Tukey
method can be used.
SAS computes these post-hoc tests (and a wide variety of others) in the MEANS
statement included in PROC GLM.
proc glm;
class group;
model days=group;
means group/tukey scheffe lines;
We have seen the first part of the statement before, it just generates the means of the five
groups. Following the slash, we ask for tukey and scheffe posthoc comparisons; the
LINES options requests for a certain format in the output that I find useful.
General Linear Models Procedure
Tukey's Studentized Range (HSD) Test for variable: DAYS
NOTE: This test controls the type I experimentwise error rate,
but generally has a higher type II error rate than REGWQ.
Alpha= 0.05
df= 25
MSE= 2.34
Psyc 771/772
Turkheimer
Post Hoc Comparisons
Page 4 of 7
Critical Value of Studentized Range= 4.153
Minimum Significant Difference= 2.5938
Means with the same letter are not significantly different.
Tukey Grouping
Mean
N
GROUP
A
A
A
A
A
A
A
7.5000
6
1
6.1667
6
5
5.1667
6
4
5.0000
6
2
4.3333
6
3
B
B
B
B
B
B
B
General Linear Models Procedure
Scheffe's test for variable: DAYS
NOTE: This test controls the type I experimentwise error rate but
generally has a higher type II error rate than REGWF for
all pairwise comparisons
Alpha= 0.05 df= 25 MSE= 2.34
Critical Value of F= 2.75871
Minimum Significant Difference= 2.9338
Means with the same letter are not significantly different.
Scheffe Grouping
Mean
N
GROUP
A
A
A
A
A
A
A
7.5000
6
1
6.1667
6
5
5.1667
6
4
5.0000
6
2
4.3333
6
3
B
B
B
B
B
B
B
You see that both the Tukey and Scheffe methods compute a minimum difference
between means that is "significant". The Tukey difference is a little smaller because the
Tukey method is more powerful when its requirements are met. The A's and B's are the
result of the LINES option. A set of letters groups together means that are not different
from each other. So a pair of means is significantly different if they do not share any
letters. In this case that is only group 1 and group 3.
Psyc 771/772
Turkheimer
Post Hoc Comparisons
Page 5 of 7
I don’t much like post-hoc tests because it is too significance testing oriented.
It promotes what I consider to be the worst way to think about your data, dichotomizing
the comparisons into pairs that are “different” and “not different.” But that is silly—the
differences between the groups are whatever they are, big or small, and it is best, IMHO,
to just describe them as such.
To help us think about this it will be useful to develop a new measure of effect
size that is useful for categorical variable ANOVAs. Basically it is an extension of d,
which was the standardized difference between two groups, ie,
d
X1  X 2
sp
Of course, a design like this is a simple one-way ANOVA with two levels for the single
factor. Another way to express it would be as the average differences between the group
and the grand mean, but you would have to take into account that one group is on one
side of the grand mean and the other is on the other side, so the signs of the two
differences would cancel each other out. So you could take the average squared
difference between the group means and the grand mean, and take the square root when
you are done, ie,
f 
( X
i
 X G )2
k
sp
Note that this d would be one half the value of the d we started with, because we are
measuring the distance between the group means and the grand mean rather than the
difference between the means themselves.
This way of expressing d generalizes nicely to the multigroup case, where it seems
natural to compute the average deviation of the group means from the grand mean no
matter how many of them there are. What’s more, all the information we need to figure
this out is included in a standard source table. First of all, although we haven’t really
focused on interpreting it this way, one way to express SSG is as kind of a sum of squares
of the group means around the grand mean,
SSG  n j (  X i  X G )
So SSG/nj is the total of the squared difference of the groups around the grand mean, and
SSG/knj is the average of the squared difference across groups. K times nj is the number
of groups times the number of subjects per group, or N. Now we need to standardize it
by sp, and it turns out that sp is equal to MSE. Then we take the square root, and get,
Psyc 771/772
Turkheimer
Post Hoc Comparisons
Page 6 of 7
SSG
N (MSE )
f 
One good way to express this formula is as,
f  F
k 1
N
If you go back to the source table and work this through, you will see that the square root
of the average squared difference between the group means and the grand means is 0.72.
Let’s get this number from the actual values to see what it means. Here is a table of the
group means from the SAS output above:
Group
Y  5.633
Y
1
5
4
2
3
7.5
6.17
5.17
5
4.33
1.867
0.537
-.46
-.633
-1.30
(Y  5.633) 2
(Y  5.633) 2
2.34
3.49
0.29
0.21
0.40
1.70
1.49
0.12
0.09
0.17
0.73
(Y  5.633) 2
2.34
1.22
0.35
0.30
0.41
0.85
The mean of the shaded column is .52, and the sqrt of .52 is .72, as above. So what do
we conclude? In general, the ANOVA showed that the group means differ from the
grand mean by .72 SDs. An examination of the rightmost column allows you to compare
the individual group deviations to this average. Group 1 had a considerably larger
deviation, groups 2 4 and 5 had smaller deviations, and group 3 was pretty close to
average.
I can’t resist showing you something else I figured out. What if you are interested in
solving the problem as post hoc comparisons do, in terms of differences between pairs of
groups? It turns out that there is a relationship between the total squared deviations from
the mean and the total of the k(k-1)/2 squared pairwise differences, as follows:
 Y
i
Y

2
j
i j
k

 k Y i  Y G

2
i 1
With a little algebra, you can then show that the square root of the average squared
pairwise difference (call it f’) is equal to:
f '
2k
F
N
Psyc 771/772
Turkheimer
Post Hoc Comparisons
Page 7 of 7
In our case that works out to about 1.14 SDs. You can then use that as a basis for
examining the various pairwise differences.