Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Copyright 2001 by the American Psychological Association, Inc. I082-989X/01/S5.00 DOI: 10.1037//1082-989X.6.2.135 Psychological Methods 2001. Vol.6, No. 2, 135-146 Review of Assumptions and Problems in the Appropriate Conceptualization of Effect Size Robert J. Grissom and John J. Kim San Francisco State University Estimation of the effect size parameter, D, the standardized difference between population means, is sensitive to heterogeneity of variance (heteroscedasticity), which seems to abound in psychological data. Pooling s2s assumes homoscedasticity, as do methods for constructing a confidence interval for D, estimating D from t or analysis of variance results, formulas that adjust estimates for inflation by main effects or covariates, and the Q statistic. The common language effect size statistic as an estimate of Pr(X, > X2), the probability that a randomly sampled member of Population 1 will outscore a randomly sampled member of Population 2, also assumes normality and homoscedasticity. Various proposed solutions are reviewed, including measures that do not make these assumptions, such as the probability of superiority estimate of Pr(X} > X2). Ways to reconceptualize effect size when treatments may affect moments such as the variance are also discussed. Tomarken & Serlin, 1986; Wilcox, 1987). When group Xs differ, group s2s often differ in the same direction (Fisher & van Belle, 1993; Fleiss, 1986; Norusis, 1995; Raudenbush & Bryk, 1987; Sawilowsky & Blair, 1992; Snedecor & Cochran, 1989). Samples with greater positive skew often have the larger means and variances. Reaction time, latency, difference-threshold, and galvanic skin response data provide some examples. In a randomized design, homoscedasticity is to be expected a priori, but the treatment after randomization can produce heteroscedasticity as described in Bryk and Raudenbush (1988). However, research using nonrandomly formed groups may well involve a priori heteroscedasticity (Grissom, 2000). In educational research, Wilcox (1987) reported that ratios of largest to smallest sample variances (maximum sample variance ratios [VRs]) exceeding 16 are not uncommon, and VR values exceeding 400 have been reported (Lix et al., 1996). Maximum VRs of up to 12 are considered to be realistic by Tomarken and Serlin (1986). In clinical outcome research with children and adolescents, variances have been found to be significantly different in therapy and control groups (Weisz, Weiss, Han, Granger, & Morton, 1995). In comparing a systematic desensitization group with an implosive therapy and control group, Hekmat (1973) found sample VRs of over 12 and nearly 29, respectively, on the Behavior Avoidance In univariate research that compares the means of two independent groups, the parameter that is typically estimated to measure effect size is D = (|x, |X2)/CT (Cohen, 1988). Estimation of D is problematic when there is heterogeneity of variance (heteroscedasticity) because there is no commonCT,and different estimators of cr can lead to very different estimates of D. Moreover, heteroscedasticity should prompt different conceptions of effect size. As discussed later, if two treatments produce distributions with different variances and/or different shapes, measures of effect size other than the standardized difference between means would be informative. This article describes the problems and reviews proposed solutions. Numerous results in samples suggest heteroscedasticity in a wide variety of areas of research (Grissom, 2000; Keppel, 1991; Lix, Cribbie, & Keselman, 1996; Robert J. Grissom and John J. Kim, Department of Psychology, San Francisco State University. We gratefully acknowledge Rebecca Ray and Julie Gorecki for assistance with the preparation of an earlier version of this article, and Jeffrey Keyes for assistance with data analysis. Correspondence concerning this article should be addressed to Robert J. Grissom, 564 Apple Grove Lane, Santa Barbara, California 93105. Electronic mail may be sent to [email protected]. 135 136 GRISSOM AND KIM Test. Inspection of several articles in one recent issue of the Journal of Consulting and Clinical Psychology yielded sample VR values of 3.24, 4.00 (several), 6.48, 6.67, 7.32, 7.84, 25.00, and 281.79. The latter VR involved skewed distributions of the number of drinks per day under two different treatments for depression in alcoholism, data that the authors transformed to normalize (R. A. Brown, Evans, Miller, Burgess, & Mueller, 1997). When control and therapy conditions for number of posttest panic attacks were compared, VRs of 8.02 and 6.65 were found for control s2/treated s2 and Therapy 1 s2/Therapy 2 s2, respectively (Feske & Goldstein, 1997). When comparing activity levels of drugged rats with those of a control group Conti and Musty (1984) found a maximum sample VR of 6.87 across dosages of the active component of marijuana. The magnitudes of reported sample VRs strongly suggest heteroscedasticity. However, because of the great sampling variability of s2, estimates of population VRs are needed. It would also be useful to have Monte Carlo studies of the "robustness" to heteroscedasticity of the many measures of effect size that assume homoscedasticity. However, under heteroscedasticity, a major issue beyond robustness of traditional measures of effect size is the conceptualization of appropriate alternative measures of effects of treatments on moments about the mean. It is hoped that the present review will prompt (a) more cautious interpretation of current measures and conversion formulas of effect size that assume homoscedasticity; (b) use of measures of overall effect of treatments on centers, tails, spreads, and shapes of distributions; and (c) research on the estimation of population VRs. Choice of Denominator for Effect Size When Comparing Two Means and Assuming Normality In cases where there is heteroscedasticity and a control group, Glass, McGaw, and Smith (1981) have proposed using sc, the standard deviation of the control group, to estimate cr. In this case the effect size estimator, dc = (Xe - Xc)sc, is estimating (|o,e - JJLC)/ CTC, an informative parameter under normality that indicates the location of an average experimental population (e) participant with respect to the control population's (c) distribution. Unfortunately for the measures of effect size that do so, assuming normality may not be realistic (Micceri, 1989; Wilcox, 1996). When estimating £>e = (|ie - |xc)Are, using the standard deviation of an experimental group, se, instead of sc can result in a numerically very different estimate as well as a different definition of effect size. Treatment x Subject interaction or a ceiling or floor effect could cause heteroscedasticity and, therefore, Dc + De. The problem of widely varying values of d depending on choice of 5 exists even if there is equality of population variances because sample variances are still likely to vary greatly. With df < 20, point estimates of CT can be in error by hundreds of percent (Hoaglin, Mosteller, & Tukey, 1991). In addition, although there is a case for using some standard measure to estimate CT, such as sc, one could question whether it is always appropriate to use o-c just because there is a comparison (control) group. The comparison group may represent the current standard treatment, but it may not turn out to be essentially different from the "experimental" treatment. Under heteroscedasticity, Cohen (1988) suggested estimating the parameter D' that uses the denominator a' = [(o-,2 + o-22)/2]1/2. Huynh'^(1989) Monte Carlo simulations showed that d' = (X, - X2)l[(s2 + s22)/ 2]1/2, which uses n-} - 1 in the denominator to calculate each s2, is generally a biased estimator of D', but is only slightly biased when D' = 1 and is not at all biased when D' = 0. (There is also some bias when the usual pooled estimator is used). The estimator d' was also found to be unstable, the variance of d' increasing with D' and, not surprisingly, decreasing monotonically with increasing ns. The simulation supported the use of h = cf d', an adjusted estimator of D' to correct bias, where cf = T(//2)/[(//2)1/2 F((/-l)/2)], F is the gamma function, and/ = [(n{— 1)(«2 - l)(s2 + s 2 ) 2 ] l [ ( n , - l)s,4 + («2 - I)s24]. The simulation assumed normality of X, and X2. See Huynh, 1989, for a detailed discussion of a method for addressing the instability of d'. However, issues of bias and instability aside, D' is difficult to interpret because it uses theCTof a contrived population. In the case of homoscedasticity, the pooled estimator, sp, is a less biased and less variable estimator of the common a than is sc, an estimator that is only available when there is a control group. Thus, in twogroup studies that rightly or wrongly assume homoscedasticity, Hedges' g = (X, -X2)/sp (or the version that removes small-sample bias) is the most widely used estimator of D (Hedges & Olkin, 1984). However, one could argue that, if JJLS differ, homoscedasticity is unlikely and pooling is inappropriate. The formulas provided by Hedges and Olkin (1985) REVIEW OF ASSUMPTIONS 137 for confidence limits for D also assume homoscedasticity. When comparing two groups by estimating a D in a multigroup study that assumes homoscedasticity, the commendable goal of attempting to obtain a less biased and less variable estimate of a results in pooling within-group variances, sp = (MSW)V2. However, some researchers may be assuming homoscedasticity without testing for it, having tested for it with one of the relatively low-power tests that are often provided in statistical packages (e.g., Levene, 1960) or with one of the possibly more powerful but not very powerful tests (M. B. Brown & Forsythe, 1974; R. G. O'Brien, 1978; Weston & Hopkins, 1998) that still failed to detect heteroscedasticity. Moreover, the amount of heteroscedasticity that would be troublesome for estimating D is likely much less than what would be troublesome for F testing in ANOVA. Under heteroscedasticity, the estimator of Dp based on pooling, dp = (X, - X2)/(MSw)l/2, is again problematic because there is no common a to be estimated (the noncentrality parameter is not defined). In summary, when homoscedasticity is assumed, the most efficient estimator of the common CT is sp. However, using an sp that weights the two ss (approximately) proportionally to n{ and «2 is problematic under heteroscedasticity. In the latter case, one can use the s of whichever group is to be the comparison group to estimate that population's a, or use s} and s2 to estimate the square root of the mean of o^2, and a22 to measure D' by using a' (or the mean of o^, and a2). Finally, under heteroscedasticity, the choice of denominator should depend on the research context. For example, if the two genders are being compared, it may not be sensible to use CTM or CTF for the denominator for the purpose of estimating a DM or D¥. On the other hand, using a' results in a D' in a contrived population whose CT is the average of the male and female populations' crs. ized longitudinal design for comparing two or more groups, but this method also assumes homoscedasticity, and the sensitivity of this measure to heteroscedasticity also is not known. When factorial designs are used, various choices for estimating cr arise, depending on which two means are being compared. For example, consider a 2 x 2 design in which one factor compares treatment and control and the other factor compares men and women. One might want to estimate D for treatment versus control overall. Alternatively, one might want to estimate treatment effect sizes separately for men and women as if only the single treatment factor had been involved. The latter makes sense if there is interaction. However, in a factorial design in which there is a classification factor, such as gender, and a treatment factor with respect to which effect sizes are to be estimated as if treatment were the only factor, a main effect of the classification factor reduces withincell MS values. Oliver and Hyde (1995) cite some meta-analyses that have inflated estimates of D by not using formulas to correct for such reduction of MSW values for the purpose at hand. Correction formulas are provided by Glass et al. (1981) and Smith, Glass, and Miller (1980), and are extended to larger designs by Ray and Shadish (1996). However, the correction formulas assume homoscedasticity. The correction formulas by Glass and his coworkers that attempt to render estimates of D from factorial designs that are comparable to estimates from single-factor designs require SS values that are typically not available in meta-analysis. Morris and DeShon (1997) provided a modified correction formula that requires values of F and df instead of SS values, but the modification assumes homoscedasticity. Abelson and Prentice (1997) presented a measure of effect size for contrasts in two-way designs, but homoscedasticity was again assumed and sensitivity to heteroscedasticity is not known. Multigroup and Factorial Designs Meta-Analysis: Converting Statistics to Estimates of D and Other Problems of Synthesis Two overall estimators of standardized effect size for one-way ANOVA, dmm = (Xmax - XmJ/sp and/ = s-^ ... /sp, where s^ is the standard deviation of all of the sample means, assume homoscedasticity. The/ is related to estimating T|, the correlation between group membership and the dependent variable (Winer, Brown, & Michels, 1991), that also assumes homoscedasticity. Maxwell (1998) presented a promising measure of effect size for a powerful random- The problem of heteroscedasticity is compounded in meta-analysis. When attempting to cumulate effect sizes, meta-analysts are often averaging different kinds of estimates of different kinds of D arising from the underlying primary studies. This is because primary researchers usually provide the values of test statistics, such as t or F, but do not provide sufficient 138 GRISSOM AND KIM information to calculate dc or dp directly. When using the primary researcher's t or two-group F to estimate D by means of t(l/ne + l/n c ) 1/2 or [F(l/ne + l/nc)]1/2, meta-analysts indirectly assume homoscedasticity because these formulas so assume. When se2 > sc2, dc > dp, and when se2 < sc2, dc < dp. (Glass et al., 1981, provide numerical values for the relationship between se2/5c2 and dc/dp). Therefore, whether attempting to estimate Dc or Dp, meta-analysts are often unwittingly averaging values of d, among some of which dp > dc and, among others, dp < dc. Unfortunately, the averaging procedure is not likely to result in overestimations canceling underestimations (Glass et al., 1981). The problem is compounded by the fact that one often has to convert to, and average, values of dp from a mix of primary studies that collectively provide a variety of forms of results, such as dp itself (rare), t, twogroup F, one-way or factorial ANOVA F, one-way or factorial repeated measures ANOVA F, analysis of covariance F, change-score t, change-score F, reported p value, and inferable-only p value. Ray and Shadish (1996) empirically demonstrated that different calculated conversions to dp arising from the different forms of results reported in primary studies can vary, sometimes greatly. Also, heteroscedasticity can affect the power and rate of Type I error of the Hedges and Olkin (1985) Q statistic that is often used to test for the significance of the variability of the set of estimates of D that arise from the underlying primary studies. See Harwell's 1997 Monte Carlo study for details on the sensitivity of Q to heteroscedasticity under varying conditions, within the primary studies, of population VR, distributional shape, and ns. When different primary studies underlying a metaanalysis use different measures of the same latent dependent variable, the primary effect sizes may vary simply because varying measurement reliabilities result in varying values of sc. The dependent variable can be corrected for the unreliability that can reduce an uncorrected estimate of D by increasing s (Becker, 1996; Schmidt & Hunter, 1996). However, the correction yields an estimate of a D that is theoretically possible but not currently realizable in practice. Different estimates of primary Dps or Des can also arise because of varying values of se when, in some primary studies, subjects may show greater variation in responsiveness to a treatment than in other studies of the same treatment. It is not surprising that meta-analysts rarely explicitly state the nature of the D parameter that they are estimating with the various kinds of d values that are being averaged. Moreover, if the type of estimate of D is confounded with varying substantive characteristics of the primary studies, interpretation of statistically significant moderators of effect size will be misleading. Editors of journals would greatly enhance the quality of primary studies and meta-analyses if they would require that primary studies include all ns, Xs, and ss. Because of heteroscedasticity, the Publication Manual of the American Psychological Association (American Psychological Association [APA], 1994) oversimplifies when it states that estimates of effect size are readily obtainable from statistics such as t or F with ns or J/s. The APA should consider endorsing the movement toward requiring researchers to archive data as a condition for research funding and publication (Eagly, 1997). Finally, because values of s can vary greatly when dfs are small, varying ns across studies can exacerbate the indeterminacy of CT and, therefore, of D. Some Additional Proposed Solutions to Heteroscedasticity in Two-Group Comparisons There are various additional proposed solutions to the problem of heteroscedasticity. Some solutions attempt to improve estimation of D, whereas more radical solutions offer measures of effect size that are conceptually different from D. A modest suggestion is that researchers in the two-sample case reduce the problem of heteroscedasticity by using ns > 10 and either equal ns or ns in which neither group contains more than 60% of the total number of participants (Huynh, 1989; Kraemer, 1983). In research domains in which the dependent variable is a score on a test that has been normed on a vast sample, such as some clinical and educational outcome studies, a possible solution to the estimation problem exists. In this case one can divide (Xe - Xn) by the sn of the normative (n) group to estimate Dn = (|xe - |JLn)/o-n (Kendall & Grove, 1988; Kendall, Marss-Garcia, Nath, & Sheldrick, 1999). The use of such a constant s is beneficial because, otherwise, values of (X, - X2) may actually vary only slightly among primary studies whose values of d vary greatly simply because of varying values of 5. In the two-group case Rosenthal (1991) suggested transforming the data to equate variances before calculating d, but he came to prefer a reconceptualization of effect size as a point-biserial correlation, ppb, a measure of effect size in terms of the strength and direction of the relationship between the grouping REVIEW OF ASSUMPTIONS variable and the dependent variable. (A variancestabilizing transformation of the data may greatly alter the values of D and possibly their interpretation, but this may be less of a problem for primary researchers who are using scales that are already contrived.) Trimming to reduce heteroscedasticity is possible before calculating rpb, but this may result in a value of rpb that estimates an ill-defined parameter (Wilcox, 1994). Although heteroscedasticity does not necessarily directly influence the value of ppb, it can influence the t test that is used to test the significance of rpb, and converting d or t to rpb by the usual formulas assumes homoscedasticity. Also, if heteroscedasticity is associated with one or more outliers, the latter can greatly affect ppb. Even slightly heavy tails in the distribution of the dependent variable scores can greatly affect ppb and its associated confidence interval. Although the formula for rpb should be corrected for attenuation caused by unequal «s (rpbc) this seems to be rarely done in practice. The attenuation increases with increasing disproportionality of ns and increasing magnitude of ppb (McNemar, 1969). The corrected rpbc = a rpb/[(a2 - 1) r2^ + 1]1/2, where a = .25pq and p and q are the proportions of total sample size in each group (Hunter & Schmidt, 1990). When a metaanalyst averages uncorrected rpbs, some of the variability in rpbs may arise artifactually from varying degrees of inequality of sample sizes from primary study to primary study. Wilcox (1996) critiqued alternative measures of correlation in terms of resistance to outliers and provided Minitab macros for them. These rs may prove to be useful alternative robust estimators of correlational effect size measures. Measures Using Estimators That Are More Robust We have seen that, if normality and homoscedasticity are assumed, D can appropriately be estimated by using Xc, Xe, and sp. Assuming normality, the usual interpretation of Dc locates |xe as Dc x ac units away from JJLC. Thus, if u,e > u,c and, for example, dc = +1, it is estimated that the average experimental population subject scores at the 84th percentile (1 crc unit above (ic) of the control group's population distribution. However, if there is heteroscedasticity associated with one or more outliers normality does not hold and such interpretation of the estimate dc is invalid. Also, because a can be very sensitive to shape, nonnormality can greatly affect the value of a D type 139 of effect size, as compellingly demonstrated by Wilcox and Muska (1999) for heavy-tailed distributions. One of the solutions offered for this problem of estimation and interpretation is from Hedges and Olkin (1985), who suggested trimming the highest and lowest scores from the control group, replacing (Xe - Xc) with (Mrin^. - Mdnc), and replacing sc with the range of the trimmed data or some other measure of variability. One such alternative measure of variability is the median absolute deviation from the median (MAD). A statistic should not only be resistant to outliers, as is MAD, but should also have relatively small sampling variability to increase power and narrow confidence intervals. In this latter regard, the biweight standard deviation, sbw (Goldberg & Iglewicz, 1992; Lax, 1985), is superior to MAD. Calculation of s2bw by hand is laborious. First, calculate Y-t = (X^ - Mdri)l 9(MAD) and set a{ = 1 if I7J < 1 and a-, = 0 if 1. Then i/2-,4-11/2 bw (D Wilcox (1997) presented an S-PLUS software function for calculating biweight mid variances (s2bw)- Because raw scores are required to calculate .$2bw, primary researchers, but not meta-analysts who are synthesizing unarchived data, can thus use (Mdne Mdnc~)/sbv,c as an estimate of effect size. Laird and Mosteller (1990) suggested dividing (Mdne - Mdnc) by .75 (interquartile range of control group) to provide resistance to outliers while maintaining a denominator that approximates sc. Under normality, .75 (interquartile range) approximates s. However, the interquartile range is just one example of an interquantile range. Quantiles may be generally defined as scores that are equal to or greater than certain specified proportions of the other scores in the distribution. Results from Shoemaker (1999) indicate that a superior robust measure of spread may sometimes be obtained by using interquantile ranges of more extreme quantiles than are used in the interquartile range. These results suggest additional possible denominators for measuring effect size, but the method may perform poorly under extreme skew. Kraemer and Andrews (1982) presented a nonparametric estimator of effect size, but this estimator requires a pretest-posttest design. Hedges and Olkin (1984, 1985) presented several nonparametric estimators, one of which, d*c, does not require homoscedas- 140 GRISSOM AND KIM ticity or pretest scores. Assuming normality, this method estimates Dc by using d*c = <&~l pc, where <5~' is the inverse normal cumulative distribution function and pc is the proportion of the control group scores that are below Mdne. Because it requires raw scores, this method is of use to primary researchers but, again, not to meta-analysts who are synthesizing unarchived data. Moreover, the sampling distribution of this estimator is not known, so significance testing and construction of confidence intervals are not developed. The methods reviewed in this section address heteroscedasticity and nonnormality as problems to be circumvented when estimating a D rather than as characteristics of data that can be accommodated by new conceptual approaches to measuring effect size. All of the suggested formulas are estimating parameters that are conceptually similar to D. Replacing owith a more resistant measure of scale is only a computational solution. A more radical solution would involve reconceptualization of effect size as a measure of the effect of a treatment throughout a distribution, not just at its center. Heteroscedasticity would then no longer be a mere nuisance, as it is when attempting to measure effect size in the traditional manner, but a utilized reflection of a treatment's effect on moments. The next two sections address this issue. Robust Estimation of Pr(Xv > X2) An intuitively appealing measure of effect size in the two-group case is the probability that a randomly sampled member of a population given one treatment will have a score (X,) that is superior to the score (X2) of a randomly sampled member of a population given another treatment. A theoretically unbiased and consistent estimator of this probability has been called the probability of superiority (PS; Grissom, 1994a, 1994b, 1996). Theoretically, the PS has the smallest variance of all unbiased estimators of Pr(X} > X2). The PS can be based on raw data that are continuous, ordinal, or ordinal-categorical (Grissom, 1994a, 1994b). The applicability of the PS estimator to ordinal data is an advantage in much psychological research, in which dependent variables are often likely to be monotonically, but not necessarily linearly, related to underlying latent variables. Monotonic transformations leave Pr(X\ > X2) invariant. The PS = Ulmn, where U is the Mann-Whitney statistic and m and n are the sample sizes. The value of U indicates the number of times that the m subjects given Treatment 1 have scores that outrank those of the n subjects given Treatment 2, assuming no ties or equal allocation of ties, in all possible comparisons of X, and X2 values. There are mn such head-to-head comparisons possible. Therefore, the proportion of times that subjects given Treatment 1 are superior in their scores to subjects given Treatment 2 is Ulmn. For example, if .7 of the comparisons of all treated subjects with all control subjects result in better scores being observed in the treated group, PS = .7. A proportion in a sample estimates a probability in a population. Estimating /V(X, > X2) from the PS is robust to nonnormality. Wilcox presented a Minitab macro (Wilcox, 1996) and S-PLUS software functions (Wilcox, 1997) for constructing a confidence interval for Pr(Xl > X2) based on Fligner and Policello's (1981) adjusted U' statistic and on a method by Mee (1990) that appears to yield fairly accurate confidence levels. The Mann-Whitney U test is nonrobust to heteroscedasticity when testing for the equality of centers (Murphy, 1976; Pratt, 1964; Zimmerman & Zumbo, 1993) because values of U are functions of higher moments as well as means. However, sensitivity to the effect of a treatment on variance and on higher moments may be considered to be an advantage for a broader measure of effect size. In this context we are not testing H0: Mdnl = Mdn2, against H,: Mdn{ + Mdn2 but a broader H0: Pr(Xl > X2) = .5 against H{: Pr(X} > X2) * .5. The alternative H,: Mdnt * Mdn2 (any other location parameter could have been used) is an example of a special-case model, a shift-model in which a treatment merely adds a constant to what the score of each participant would have been without the treatment. In this model, homoscedasticity is assumed because adding a constant to each score merely shifts a distribution without changing its spread. However, the alternative H,: Pr(Xt > X2) + .5, tested against H0: Pr(Xt > X2) = .5, is the more general case of stochastic superiority of one treatment over another. Under this H,, the effect of a treatment need not be constant, so that the treatment may affect spread and, therefore, homoscedasticity need not be assumed. This general case is perhaps more realistic because treatments may well affect spread, and possibly shape, as well as the center of a distribution. Using the PS to test H0: Pr(X} > X2) = .5 against H,: Pr(X{ > X2) =£ .5, or against a one-tailed alternative, is a consistent test in the sense that its power approaches unity as sample sizes approach infinity. When raw data are not available, the parameter 141 REVIEW OF ASSUMPTIONS Pr(X} > X2) can be estimated by using the common language effect size statistic (CL; McGraw & Wong, 1992), which is calculated from sample means and variances. The CL is based on a z score, ZCL= C^i ~ X2)/Cs,2 + s22)1'2. The value of Pr(Xl > X2) is estimated by the proportion of the area of the normal curve below ZCL. For example, if ZCL = +1.00 or -1.00, CL = .84 or .16, respectively. The value of /MX, > X2) is estimated to be .84 or . 16 in these cases. This CL method assumes normality of X, and X2, but it may not seem to some readers to require homoscedasticity because the variance of (X, - X2) would be <j\ _x = (cr2. + a2 ) regardless of the values of a2^ and cr22. However, in theory, CL only estimates Pr(X, > X2) under normality and homoscedasticity (Pratt & Gibbons, 1981). McGraw and Wong (1992), who introduced the CL to psychologists, assumed homoscedasticity and reported Monte Carlo simulations that indicated fairly good performance in the face of nonnormality but somewhat poorer performance under joint nonnormality and heteroscedasticity. In theory, the CL is not quite an unbiased estimator unless it is adjusted (Pratt & Gibbons, 1981). Various statistics and estimates of effect size can be converted to approximate values of ZCL (Grissom, 1994a). For example, if n, = n2, ZCL = -tin112, ZCL 1 = -.7074 and ZCL = -rpb[2(n - l)]1/2/[n(l - 4 )]1/2 For conversion formulas when n, =£ n2, see Grissom, 1994a. All of these conversion formulas assume homoscedasticity and normality, and their sensitivity to violation of these assumptions is unknown. Estimates of Pr(X, > X2) have been used in a metaanalysis (Mosteller & Chalmers, 1992), as well as in a meta-meta-analysis that was compelled to rely on the previously mentioned possibly nonrobust conversion formulas for ZCL because of the unavailability of raw data (Grissom, 1996). Again, the archiving of raw data would benefit meta-analyses that attempt to estimate Pr(X, > X2). For a meta-analysis using PS, the mean of the values of PS should be obtained by weighting each PS by the reciprocal of its variance. For the variance, use (l/12)[(l/m) + (1/n) + (1/mn)]. For an alternative approach, see Colditz, Miller, and Mosteller (1988). In the two-sample case, PK^i > X2) is a useful measure of effect size, but this measure may not be transitive in the k > 2 case. Group 1 may score stochastically higher than Group 2, and Group 2 may score higher than Group 3, but it is possible in this case that Pr(X3 > X,) > .5. For an approximate CL approach to estimating Pr(X} > X2 and X, > X3 a n d . . . ) , see McGraw and Wong (1992). Also see Whitney' s (1951) extension of the U test for k> 2 and a discussion and tables of critical values for this extension in Mosteller and Bush (1954). For more discussion of Pr(X, > X2) and its estimators, see Lehmann (1975), Laird and Mosteller (1990), Pratt and Gibbons (1981), and Vargha and Delaney (1998). Pratt and Gibbons (1981) discuss various methods for dealing with tied scores. Empirical Comparisons of PS and CL Wolfe and Hogg (1971) asserted that the estimates from PS and CL do not differ "too much" in practice. To test this assertion with real and then simulated data, we first searched two published manuals of actual data to find sets of data that compared two independent groups with respect to some psychologically relevant dependent variable. Eleven such sets were found, and PS and CL were used to calculate estimates of Pr(Xj > X2) for each set. In the four least discrepant of the 11 pairs of estimates from the PS and the CL, the larger estimate in each case was less than 3% larger than the smaller of the two estimates. However, in the three comparisons that resulted in the greatest discrepancies, the larger estimate was 12.1%, 15.8%, or 15.9% larger than the smaller estimate. Next, we inspected preliminary results from Monte Carlo comparisons of the behaviors of the PS and the CL. Results as of this writing are based on «, = n2, normality, population VRs ranging from 1 to 8, and 25,000 samples for each simulated condition. Generally, when |x, = |X2, the means of the sampling distributions of the PS and the CL are very similar and the correlation between the two sets of estimates is well over .9. However, as the difference between (i, and |x2 increases, this correlation sometimes decreases to a value as low as approximately .2, and the CL tends to exhibit more sampling error than does the PS. Even when assuming normality and homoscedasticity, primary researchers who are estimating Pr(X, > X2) should consider reporting both the CL and the PS. Multiple-Valued Measures Measures of effect size for comparing two groups can be misleading when there is heteroscedasticity, different shapes of distributions, or both. For example, suppose that a treatment makes some experimental subjects perform better and some perform worse than they would have had they been in the comparison group. In this case, the experimental 142 GRISSOM AND KIM group's variability will be greater or less than that of the comparison group, depending on whether it is the better or poorer performing subjects who are helped or harmed by the experimental treatment, but the two means or medians may be nearly the same. In this case, using (Xe - Xc) or (Mdne - Mdnc) in the numerator yields an estimated effect size near zero, but the treatment clearly has had an effect in the tails, if not in the center, of the distribution. In another case, the two centers and the two variabilities may differ. For example, if the more variable group has the higher mean (not uncommon), the proportions of its members among the high scorers and among the low scorers can be different than what is implied by the estimate of D. Consider a case in which, compared with controls, experimental participants have a small superiority, say, D = +.3 (using the a of the combined distribution), and a cre2 that is only 15% greater than that of the control group. Even in this unexceptional case, if normality is assumed, approximately 2.5 times as many treated as control subjects would be in the highest 5% of the combined distribution (Hedges & Nowell, 1995). For further discussion and examples see Feingold (1992b, 1995) and P. C. O'Brien (1988). The kinds of problems that have been discussed can arise even under homoscedasticity if there is nonhomomerity (inequality of shapes of distributions). Informative methods have been proposed for expressing group differences at different portions of a distribution, but at the cost of greater complexity. For example, Hedges and Friedman (1993), assuming normality for both populations, defined effect size in a portion of a tail beyond a fixed value, Xa, in a distribution of scores combined from Group 1 and Group 2. This Da is the difference between u,a, and |xa2, for those scores that are beyond Xa, divided by cr for those scores, cra. The value of Xa is chosen as a score that is exceeded by some k% of the scores in the combined distribution. Thus, if Xa is the score at the lOOa percentile point of the combined distribution, then the effect in the tail beyond Xa is Da = (|xal |xa2)/cra. Extensive computational details are found in the appendix of Hedges and Friedman (1993). The computations of the estimates of Da are repeated for various values of k that are of interest to the researcher. Note that when the overall \L\ ¥= n-2, the combined distribution departs from normality. Wilcox (1995, 1996, 1997) presented references for a variety of a graphical methods and illustrated a method, based on the work of Doksum (1977), for comparing two groups separately at various quantiles. This method results in a graph of "shift functions" that indicates whether a treatment becomes more or less effective as we observe from the poorer performing to the better performing of the control group's members. Each shift function indicates how far the comparison group has to be shifted to attain the outcome of the treated group at each quantile of interest. Quantiles of the control group's scores at their various qth quantiles, Xqc values, are plotted against the differences between Xqc and Xqe, the quantiles of the experimental group's scores, defining the shift function, AXqc = Xqe - Xqc. It is more informative to compare distributions at various quantiles, such as deciles, than to compare them only at their centers, such as their means or medians (.50 quantile, 5th decile). Wilcox (1996) provided a Mini tab macro for estimating shift functions and another for constructing confidence intervals for the difference between the two groups' deciles at various deciles throughout the comparison group's distribution. Wilcox (1997) also provided SPLUS software functions for making robust inferences about shift functions and for constructing robust simultaneous confidence intervals for them. Lunneborg (1986) provided discussion and a method for constructing a confidence interval for the difference between corresponding quantiles of two distributions. A general definition of quantiles suits the purpose of the present article. However, statistical packages use a variety of definitions of sample quantiles (Hyndman & Fan, 1996). For an introduction to quantiles, see Hoaglin, Mosteller, and Tukey (1985). See Dielman, Lowry, and Pfaffenberger (1994) for results of Monte Carlo studies of the behavior of 10 supposedly distribution-free estimators of quantiles for a variety of distributions. For discussion of distribution-free univariate and multivariate methods for comparing the probability density functions of two groups, see Silverman (1986) and Izenman (1991). Descriptive graphical methods for depicting differences between two distributions at levels in addition to their centers are the Wilk and Gnanadesikan (1968) percentile comparison graph, the Tukey (1984) sumdifference graph, both discussed by Cleveland (1985), and the ordinal dominance curve (Darlington, 1973) that is similar to the percentile comparison graph. Development of additional methods for graphic depiction of effect sizes may promote the greater use of measures of effect size. A simple example is the depiction of two boxplots, one above the other, using the Minitab STACK command. REVIEW OF ASSUMPTIONS Postscripts There is an embarrassment of riches in the variety of solutions to the problem of heteroscedasticity in significance testing (Grissom, 2000; Wilcox, 1996, 1997). Therefore, future primary studies may become a mix of reports of means, transformed means, trimmed means, medians, various other resistant measures of location, shift-functions, CLs, and pointbiserial and other measures of correlation. If so, some of these studies' tests of significance may be better at controlling rates of Type I and Type II error under heteroscedasticity by avoiding the use of means, as encouraged by Wilcox (1996, 1997), but the effect size synthesizing problems of meta-analysts will be greatly exacerbated. Workers in each area of research should learn about the degree of heteroscedasticity typically arising from its types of participants, measures, and treatments. Therefore, estimation of population VRs should be encouraged. In addition, development of a method for constructing simultaneous confidence intervals for population VRs that is robust to nonnormality would be very useful. Descriptive meta-analyses of sample VRs may be conducted (Feingold, 1992a). However, one should not use the arithmetic means of sa2/sb2 directly because, even if there is equality of population variances, variability of sample variances will cause the mean of the ratio sa2/sb2 to be equal to or greater than 1. For example, suppose that aa2 = crb2 and, because of sampling variability, say, sa2 = 2sb2 and sb2 = 2.sa2 equally often across primary studies, yielding primary VR values of 1/2 = .5 and 2/1 = 2.0, respectively, equally often. The arithmetic mean of .5 and 2 is not 1, but 1.25. Feingold (1992a) and Shaffer (1992) discussed the proper method of cumulation of VRs through the use of median VR or mean ratio of logarithms of variances. Sampling distributions for additional heteroscedastic estimators of effect size that do not assume normality are also needed so that significance of estimates can be tested and confidence intervals can be constructed, as they can be for the Pr(Xt > X2) measure. Monte Carlo simulations show that coverage for lower confidence bounds for Pr(X, > X2) is generally very close to the nominal .95 level when the PS estimator is used (Mee, 1990). References Abelson, R. P., & Prentice, D. A. (1997). Contrast tests of interaction hypotheses. Psychological Methods, 2, 315328. 143 American Psychological Association. (1994). Publication manual of the American Psychological Association (4th ed.). Washington, DC: Author. Becker, G. (1996). Bias in the assessment of gender differences. American Psychologist, 51, 154-155. Brown, M. B., & Forsythe, A. B. (1974). Robust tests for equality of variances. Journal of the American Statistical Association, 69, 364—367. Brown, R. A., Evans, D. M., Miller, I. W., Burgess, E. S., & Mueller, T. I. (1997). Cognitive-behavioral treatment for depression in alcoholism. Journal of Consulting and Clinical Psychology, 65, 715-726. Bryk, A. S., & Raudenbush, S. W. (1988). Heterogeneity of variance in experimental studies: A challenge to conventional interpretations. Psychological Bulletin, 104, 396404. Cleveland, W. S. (1985). The elements of graphing data. Monterey, CA: Wads worth. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press. Colditz, G. A., Miller, J. N., & Mosteller, F. (1988). Measuring gain in the evaluation of medical technology: The probability of a better outcome. International Journal of Technology Assessment in Health Care, 4, 637-642. Conti, L., & Musty, R. E. (1984). The effects of delta-9tetrahydro cannabinol injections to the nucleus accumbens on the locomotor activity of rats. In S. Agurell, W. L. Dewey, & R. E. White (Eds.), The cannabinoids: Chemical, pharmacologic, and therapeutic aspects. New York: Academic Press. Darlington, M. L. (1973). Comparing two groups by simple graphs. Psychological Bulletin, 79, 110-116. Dielman, T., Lowry, C., & Pfaffenberger, R. (1994). A comparison of quantile estimators. Communications in Statistics B: Simulation and Computation, 23, 355-371. Doksum, K. A. (1977). Some graphical methods in statistics: A review and some extensions. Statistica Neerlandica, 31, 53-68. Eagly, A. (1997). Data archiving in psychology. Science Agenda, 10, 14. Feingold, A. (1992a). Cumulation of variance ratios. Review of Educational Research, 62, 433^34. Feingold, A. (1992b). Sex differences in variability in intellectual abilities: A new look at an old controversy. Review of Educational Research, 62, 61-84. Feingold, A. (1995). The additive effects of differences in central tendency and variability are important in comparisons between groups. American Psychologist, 50, 5-13. Feske, U., & Goldstein, A. J. (1997). Eye movement desensitization and reprocessing treatment for panic disorder: A controlled outcome and partial dismantling study. 144 GRISSOM AND KIM Journal of Consulting and Clinical Psychology, 65, 1026-1035. Fisher, L. D., & van Belle, G. (1993). Biostatistics: A methodology for the health sciences. New York: Wiley. Fleiss, J. L. (1986). The design and analysis of clinical experiments. New York: Wiley. Fligner, M. A., & Policello II, G. E. (1981). Robust rank procedures for the Behrens-Fisher problem. Journal of the American Statistical Association, 76, 162-168. Glass, G. V., McGaw, B., & Smith, M. L. (1981). Metaanalysis in social research. Thousand Oaks, CA: Sage. Goldberg, K. M., & Iglewicz, B. (1992). Bivariate extensions of the boxplot. Technometrics, 34, 307-320. Grissom, R. J. (1994a). Probability of the superior outcome of one treatment over another. Journal of Applied Psychology, 79, 314-316. Grissom, R. J. (1994b). Statistical analysis of ordinal categorical status after therapies. Journal of Consulting and Clinical Psychology, 62, 281-284. Grissom, R. J. (1996). The magical number .7 ± .2: Metameta-analysis of the probability of superior outcome in comparisons involving therapy, placebo, and control. Journal of Consulting and Clinical Psychology, 64, 973982. Grissom, R. J. (2000). Heterogeneity of variance in clinical data. Journal of Consulting and Clinical Psychology, 68, 155-165. Harwell, M. (1997). An empirical study of Hedges's homogeneity test. Psychological Methods, 2, 219-231. Hedges, L. V., & Friedman, L. (1993). Gender differences in variability in intellectual abilities: A reanalysis of Feingold's results. Review of Educational Research, 63, 94105. Hedges, L. V., & Nowell, A. (1995). Sex differences in mental test scores, variability, and numbers of highscoring individuals. Science, 269, 41-45. Hedges, L. V., & Olkin, I. (1984). Nonparametric estimators of effect size in meta-analysis. Psychological Bulletin, 96, 573-580. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. San Diego, CA: Academic Press. Hekmat, H. (1973). Systematic versus semantic desensitization and implosive therapy: A comparative study. Journal of Consulting and Clinical Psychology, 40, 202-209. Hoaglin, D. C, Mosteller, F., & Tukey, J. W. (Eds.). (1985). Exploring data tables, trends, and shapes. New York: Wiley. Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.). (1991). Fundamentals of exploring analysis of variance. New York: Wiley. Hunter, J. E., & Schmidt, F. L. (1990) Methods of metaanalysis. Newbury Park, CA: Sage. Huynh, C. L. (1989, March). A unified approach to the estimation of effect size in meta-analysis. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco. (ERIC Document Reproduction Service No. ED 306 248). Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. The American Statistician, 50, 361365. Izenman, A. J. (1991). Recent developments in nonparametric density estimation. Journal of the American Statistical Association, 86, 205-224. Kendall, P. C., & Grove, W. M. (1988). Normative comparisons in therapy outcome. Behavioral Assessment, 10, 147-158. Kendall, P. C., Marss-Garcia, A., Nath, S. R., & Sheldrick, R. C. (1999). Normative comparisons for the evaluation of clinical significance. Journal of Consulting and Clinical Psychology, 67, 285-299. Keppel, G. (1991). Design and analysis: A researcher's handbook (3rd ed.). Englewood Cliffs, NJ: Prentice-Hall. Kraemer, H. C. (1983). Theory of estimation and testing of effect sizes: Use in meta-analysis. Journal of Educational Statistics, S, 93-101. Kraemer, H. C., & Andrews, G. (1982). A non-parametric technique for meta-analysis effect size calculation. Psychological Bulletin, 91, 404-412. Laird, N. M., & Mosteller, F. (1990). Some statistical methods for combining experimental results. International Journal of Technology Assessment in Health Care, 6, 5-30. Lax, D. A. (1985). Robust estimators of scale: Finite sample performance in long-tailed symmetric distributions. Journal of the American Statistical Association, 80, 736-741. Lehmann, E. L. (1975). Nonparametrics. San Francisco: Holden-Day. Levene, H. (1960). Robust tests for equality of variances. In I. Olkin (Ed.), Contributions to probability and statistics: Essays in honor of Harold Hotelling (pp. 278-292). Stanford, CA: Stanford University Press. Lix, L. M., Cribbie, R., & Keselman, H. J. (1996, June). The analysis of completely randomized univariate designs. Paper presented at the annual meeting of the Psychometric Society, Banff, Alberta, Canada. Lunneborg, C. E. (1986). Confidence intervals for a quantile contrast: Application of the bootstrap. Journal of Applied Psychology, 71, 451^56. Maxwell, S. E. (1998). Longitudinal designs in randomized group comparisons: When will intermediate observations REVIEW OF ASSUMPTIONS increase statistical power? Psychological Methods, 3, 275-290. McGraw, K. O., & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin, 111, 361-365. McNemar, Q. (1969). Psychological statistics (4th ed.). New York: Wiley. Mee, R. W. (1990). Confidence intervals for probabilities and tolerance regions based on a generalization of the Mann-Whitney statistic. Journal of the American Statistical Association, 85, 793-800. Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156166. Morris, S. B., & DeShon, R. P. (1997). Correcting effect sizes computed from factorial analysis of variance for use in meta-analysis. Psychological Methods, 2, 192-199. Mosteller, F., & Bush, R. R. (1954). Selected quantitative techniques. In G. Lindzey & E. Aronson (Eds.), Handbook of social psychology (Vol. 1, pp. 289-334). Cambridge, MA: Addison-Wesley. Mosteller, F., & Chalmers, T. C. (1992). Some progress and problems in meta-analysis of clinical trials. Statistical Science, 7, 227-236. Murphy, B. P. (1976). Comparison of some two sample means tests by simulation. Communications in Statistics B: Simulation and Computation, 5, 23-32. Norusis, M. J. (1995). SPSS 6.1 guide to data analysis. Englewood Cliffs, NJ: Prentice Hall. O'Brien, P. C. (1988). Comparing two samples: Extensions of the t, rank-sum, and log-rank tests. Journal of the American Statistical Association, 83, 52-61. O'Brien, R. G. (1978). Robust techniques for testing heterogeneity of variance. Psychometrika, 43, 327-342. Oliver, M. B., & Hyde, J. S. (1995). Gender differences in attitudes toward homosexuality: A reply to Whitely and Kite. Psychological Bulletin, 117, 155-158. Pratt, J. W. (1964). Obustness [sic] of some procedures for the two-sample location problem. American Statistical Association Journal, 59, 665-680. Pratt, J. W., & Gibbons, J. D. (1981). Concepts ofnonparametric theory. New York: Springer-Verlag. Raudenbush, S. W., & Bryk, A. S. (1987). Examining correlates of diversity. Journal of Educational Statistics, 12, 241-269. Ray, J. W., & Shadish, W. R. (1996). How interchangeable are different measures of effect size? Journal of Consulting and Clinical Psychology, 64, 1316-1325. Rosenthal, R. (1991). Meta-analytic procedures for social research. Newbury Park, CA: Sage. Sawilowsky, S. S., & Blair, R. C. (1992). A more realistic 145 look at the robustness and Type II error properties of the t test to departures from population normality. Psychological Bulletin, 111, 352-360. Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199-223. Shaffer, J. P. (1992). Caution on the use of variance ratios: A comment. Review of Educational Research, 62, 429432. Shoemaker, L. H. (1999). Interquantile tests for dispersion in skewed distributions. Communications in Statistics B: Simulation and Computation, 28, 189-205. Silverman, B. W. (1986) Density estimation for statistics and data analysis. New York: Chapman & Hall. Smith, M. L., Glass, G. V., & Miller, T. I. (1980). The benefits of psychotherapy. Baltimore: Johns Hopkins University Press. Snedecor, G. W., & Cochran, W. G. (1989). Statistical methods (8th ed.). Ames: Iowa State University. Tomarken, A. J., & Serlin, R. C. (1986). Comparison of ANOVA alternatives under variance heterogeneity and specific noncentrality structures. Psychological Bulletin, 99, 90-99. Tukey, J. W. (1984). The collected works of John W. Tukey. Monterey, CA: Wadsworth. Vargha, A., & Delaney, H. D. (1998). The Kruskal-Wallis test and stochastic homogeneity. Journal of Educational and Behavioral Statistics, 23, 170-192. Weisz, J. R., Weiss, B., Han, S. S., Granger, D. A., & Morton, T. (1995). Effects of psychotherapy with children and adolescents revisited: A meta-analysis of treatment outcome studies. Psychological Bulletin, 117, 450468. Weston, T., & Hopkins, K. D. (1998). Testing for homogeneity of variance: An evaluation of current practice. Unpublished manuscript, University of Colorado at Boulder. Whitney, D. R. (1951). A bivariate extension of the U statistic. Annals of Mathematical Statistics, 22, 274-282. Wilcox, R. R. (1987). New designs in analysis of variance. Annual Review of Psychology, 38, 29-60. Wilcox, R. R. (1994). The percentage bend correlation coefficient. Psychometrika, 59, 601-616. Wilcox, R. R. (1995). Comparing two independent groups via multiple quantiles. The Statistician, 44, 91-99. Wilcox, R. R. (1996). Statistics for the social sciences. San Diego, CA: Academic Press. Wilcox, R. R. (1997). Introduction to robust estimation and hypothesis testing. San Diego, CA: Academic Press. Wilcox, R. R., & Muska, J. (1999). Measuring effect size: A non-parametric analogue of u>2. British Journal of Mathematical and Statistical Psychology, 52, 93-110. 146 GRISSOM AND KIM Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika, 55, 1-17. Winer, B. J., Brown, D. R., & Michels, K. M. (1991). Statistical principles in experimental design (3rd ed.). New York: McGraw-Hill. Zimmerman, D. M., & Zumbo, B. D. (1993). Rank transformations and the power of the Student t test and Welch t' test for non-normal populations with unequal variances. Canadian Journal of Experimental Psychology, 47, 523539. Wolfe, D. A., & Hogg, R. V. (1971). On constructing statistics and reporting data. The American Statistician, 25, 27-30. Received January 20, 1999 Revision received August 28, 2000 Accepted December 21, 2000 • Members of Under-represented Groups: Reviewers for Journal Manuscripts Wanted If you are interested in reviewing manuscripts for APA journals, the APA Publications and Communications Board would like to invite your participation. Manuscript reviewers are vital to the publications process. As a reviewer, you would gain valuable experience in publishing. The P&C Board is particularly interested in encouraging members of underrepresented groups to participate more in this process. If you are interested in reviewing manuscripts, please write to Demarie Jackson at the address below. Please note the following important points: • • • • To be selected as a reviewer, you must have published articles in peer-reviewed journals. The experience of publishing provides a reviewer with the basis for preparing a thorough, objective review. To be selected, it is critical to be a regular reader of the five to six empirical journals that are most central to the area or journal for which you would like to review. Current knowledge of recently published research provides a reviewer with the knowledge base to evaluate a new submission within the context of existing research. To select the appropriate reviewers for each manuscript, the editor needs detailed information. Please include with your letter your vita. In your letter, please identify which APA journal(s) you are interested in, and describe your area of expertise. Be as specific as possible. For example, "social psychology" is not sufficient—you would need to specify "social cognition" or "attitude change" as well. Reviewing a manuscript takes time (1-4 hours per manuscript reviewed). If you are selected to review a manuscript, be prepared to invest the necessary time to evaluate the manuscript thoroughly. Write to Demarie Jackson, Journals Office, American Psychological Association, 750 First Street, NE, Washington, DC 20002-4242.