Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 4 Effect Sizes Based on Means Introduction Raw (unstandardized) mean difference D Standardized mean difference, d and g Response ratios INTRODUCTION When the studies report means and standard deviations, the preferred effect size is usually the raw mean difference, the standardized mean difference, or the response ratio. These effect sizes are discussed in this chapter. RAW (UNSTANDARDIZED) MEAN DIFFERENCE D When the outcome is reported on a meaningful scale and all studies in the analysis use the same scale, the meta-analysis can be performed directly on the raw difference in means (henceforth, we will use the more common term, raw mean difference). The primary advantage of the raw mean difference is that it is intuitively meaningful, either inherently (for example, blood pressure, which is measured on a known scale) or because of widespread use (for example, a national achievement test for students, where all relevant parties are familiar with the scale). Consider a study that reports means for two groups (Treated and Control) and suppose we wish to compare the means of these two groups. Let 1 and 2 be the true (population) means of the two groups. The population mean difference is defined as D ¼ 1 2 : ð4:1Þ In the two sections that follow we show how to compute an estimate D of this parameter and its variance from studies that used two independent groups and from studies that used paired groups or matched designs. Introduction to Meta-Analysis. Michael Borenstein, L. V. Hedges, J. P. T. Higgins and H. R. Rothstein © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-05724-7 22 Effect Size and Precision Computing D from studies that use independent groups We can estimate the mean difference D from a study that used two independent groups as follows. Let X 1 and X2 be the sample means of the two independent groups. The sample estimate of D is just the difference in sample means, namely D ¼ X 1 X 2: ð4:2Þ Note that uppercase D is used for the raw mean difference, whereas lowercase d will be used for the standardized mean difference (below). Let S1 and S2 be the sample standard deviations of the two groups, and n1 and n2 be the sample sizes in the two groups. If we assume that the two population standard deviations are the same (as is assumed to be the case in most parametric data analysis techniques), so that 1 5 2 5 , then the variance of D is VD 5 where n1 þ n2 2 S ; n1 n2 pooled sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðn1 1ÞS21 þ ðn2 1ÞS22 Spooled 5 : n1 þ n2 2 ð4:3Þ ð4:4Þ If we don’t assume that the two population standard deviations are the same, then the variance of D is S2 S2 VD 5 1 þ 2 : ð4:5Þ n1 n2 In either case, the standard error of D is then the square root of V, pffiffiffiffiffiffi SED 5 VD : ð4:6Þ For example, suppose that a study has sample means X1 5 103.00, X2 5 100.00, sample standard deviations S1 5 5.5, S2 5 4.5, and sample sizes n1 5 n2 5 50. The raw mean difference D is D 5 103:00 100:00 5 3:00: If we assume that 1 5 2 then the pooled standard deviation within groups is sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð50 1Þ 5:52 þ ð50 1Þ 4:52 Spooled 5 5 5:0249: 50 þ 50 2 The variance and standard error of D are given by VD 5 and 50 þ 50 5:02492 5 1:0100; 50 50 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi SED 5 1:0100 5 1:0050: Chapter 4: Effect Sizes Based on Means If we do not assume that 15 2 then the variance and standard error of D are given by VD 5 5:52 4:52 þ 5 1:0100 50 50 and pffiffiffiffiffiffiffiffiffiffiffiffiffiffi SED 5 1:0100 5 1:0050: In this example formulas (4.3) and (4.5) yield the same result, but this will be true only if the sample size and/or the estimate of the variances is the same in the two groups. Computing D from studies that use matched groups or pre-post scores The previous formulas are appropriate for studies that use two independent groups. Another study design is the use of matched groups, where pairs of participants are matched in some way (for example, siblings, or patients at the same stage of disease), with the two members of each pair then being assigned to different groups. The unit of analysis is the pair, and the advantage of this design is that each pair serves as its own control, reducing the error term and increasing the statistical power. The magnitude of the impact depends on the correlation between (for example) siblings, with a higher correlation yielding a lower variance (and increased precision). The sample estimate of D is just the sample mean difference, D. If we have the difference score for each pair, which gives us the mean difference Xdiff and the standard deviation of these differences (Sdiff), then D 5 X diff ; ð4:7Þ S2diff ; n ð4:8Þ VD 5 where n is the number of pairs, and pffiffiffiffiffiffi SED 5 VD : ð4:9Þ For example, if the mean difference is 5.00 with standard deviation of the difference of 10.00 and n of 50 pairs, then D 5 5:0000; 10:002 5 2:0000; 50 ð4:10Þ pffiffiffiffiffiffiffiffiffi SED 5 2:00 5 1:4142: ð4:11Þ VD 5 and 24 Effect Size and Precision Alternatively, if we have the mean and standard deviation for each set of scores (for example, siblings A and B), the difference is D ¼ X 1 X 2: ð4:12Þ The variance is again given by VD 5 S2diff ; n where n is the number of pairs, and the standard error is given by pffiffiffiffiffiffi SED 5 VD : ð4:13Þ ð4:14Þ However, in this case we need to compute the standard deviation of the difference scores from the standard deviation of each sibling’s scores. This is given by qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sdiff 5 S21 þ S22 2 r S1 S2 ð4:15Þ where r is the correlation between ‘siblings’ in matched pairs. If S1 5 S2, then (4.15) simplifies to qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sdiff 5 2 S2pooled ð1 rÞ: ð4:16Þ In either case, as r moves toward 1.0 the standard error of the paired difference will decrease, and when r 5 0 the standard error of the difference is the same as it would be for a study with two independent groups, each of size n. For example, suppose the means for siblings A and B are 105.00 and 100.00, with standard deviations 10 and 10, the correlation between the two sets of scores is 0.50, and the number of pairs is 50. Then D 5 105:00 100:00 5 5:0000; VD 5 and 10:002 5 2:0000; 50 pffiffiffiffiffiffiffiffiffi SED 5 2:00 5 1:4142: In the calculation of VD, the Sdiff is computed using pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sdiff 5 102 þ 102 2 0:50 10 10 5 10:0000 or Sdiff 5 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 102 ð1 0:50Þ 5 10:0000: The formulas for matched designs apply to pre-post designs as well. The pre and post means correspond to the means in the matched groups, n is the number of subjects, and r is the correlation between pre-scores and post-scores. Chapter 4: Effect Sizes Based on Means Calculation of effect size estimates from information that is reported When a researcher has access to a full set of summary data such as the mean, standard deviation, and sample size for each group, the computation of the effect size and its variance is relatively straightforward. In practice, however, the researcher will often be working with only partial data. For example, a paper may publish only the p-value, means and sample sizes from a test of significance, leaving it to the meta-analyst to back-compute the effect size and variance. For information on computing effect sizes from partial information, see Borenstein et al. (2009). Including different study designs in the same analysis Sometimes a systematic review will include studies that used independent groups and also studies that used matched groups. From a statistical perspective the effect size (D) has the same meaning regardless of the study design. Therefore, we can compute the effect size and variance from each study using the appropriate formula, and then include all studies in the same analysis. While there is no technical barrier to using different study designs in the same analysis, there may be a concern that studies which used different designs might differ in substantive ways as well (see Chapter 40). For all study designs (whether using independent or paired groups) the direction of the effect (X1 X2 or X2 X1 ) is arbitrary, except that the researcher must decide on a convention and then apply this consistently. For example, if a positive difference will indicate that the treated group did better than the control group, then this convention must apply for studies that used independent designs and for studies that used pre-post designs. In some cases it might be necessary to reverse the computed sign of the effect size to ensure that the convention is followed. STANDARDIZED MEAN DIFFERENCE, d AND g As noted, the raw mean difference is a useful index when the measure is meaningful, either inherently or because of widespread use. By contrast, when the measure is less well known (for example, a proprietary scale with limited distribution), the use of a raw mean difference has less to recommend it. In any event, the raw mean difference is an option only if all the studies in the meta-analysis use the same scale. If different studies use different instruments (such as different psychological or educational tests) to assess the outcome, then the scale of measurement will differ from study to study and it would not be meaningful to combine raw mean differences. In such cases we can divide the mean difference in each study by that study’s standard deviation to create an index (the standardized mean difference) that would be comparable across studies. This is the same approach suggested by Cohen (1969, 1987) in connection with describing the magnitude of effects in statistical power analysis. Effect Size and Precision The standardized mean difference can be considered as being comparable across studies based on either of two arguments (Hedges and Olkin, 1985). If the outcome measures in all studies are linear transformations of each other, the standardized mean difference can be seen as the mean difference that would have been obtained if all data were transformed to a scale where the standard deviation within-groups was equal to 1.0. The other argument for comparability of standardized mean differences is the fact that the standardized mean difference is a measure of overlap between distributions. In this telling, the standardized mean difference reflects the difference between the distributions in the two groups (and how each represents a distinct cluster of scores) even if they do not measure exactly the same outcome (see Cohen, 1987, Grissom and Kim, 2005). Consider a study that uses two independent groups, and suppose we wish to compare the means of these two groups. Let 1 and 1 be the true (population) mean and standard deviation of the first group and let 2 and 2 be the true (population) mean and standard deviation of the other group. If the two population standard deviations are the same (as is assumed in most parametric data analysis techniques), so that 1 5 2 5 , then the standardized mean difference parameter or population standardized mean difference is defined as 2 : ð4:17Þ 5 1 In the sections that follow, we show how to estimate from studies that used independent groups, and from studies that used pre-post or matched group designs. It is also possible to estimate from studies that used other designs (including clustered designs) but these are not addressed here (see resources at the end of this Part). We make the common assumption that 12 5 22, which allows us to pool the estimates of the standard deviation, and do not address the case where these are assumed to differ from each other. Computing d and g from studies that use independent groups We can estimate the standardized mean difference () from studies that used two independent groups as d5 X1 X 2 : Swithin ð4:18Þ In the numerator, X1 and X2 are the sample means in the two groups. In the denominator Swithin is the within-groups standard deviation, pooled across groups, sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðn1 1ÞS21 þ ðn2 1ÞS22 Swithin 5 n1 þ n2 2 ð4:19Þ where n1 and n2 are the sample sizes in the two groups, and S1 and S2 are the standard deviations in the two groups. The reason that we pool the two sample Chapter 4: Effect Sizes Based on Means estimates of the standard deviation is that even if we assume that the underlying population standard deviations are the same (that is 1 5 2 5 ), it is unlikely that the sample estimates S1 and S2 will be identical. By pooling the two estimates of the standard deviation, we obtain a more accurate estimate of their common value. The sample estimate of the standardized mean difference is often called Cohen’s d in research synthesis. Some confusion about the terminology has resulted from the fact that the index , originally proposed by Cohen as a population parameter for describing the size of effects for statistical power analysis is also sometimes called d. In this volume we use the symbol to denote the effect size parameter and d for the sample estimate of that parameter. The variance of d is given (to a very good approximation) by Vd 5 n1 þ n2 d2 : þ n1 n2 2ðn1 þ n2 Þ ð4:20Þ In this equation the first term on the right of the equals sign reflects uncertainty in the estimate of the mean difference (the numerator in (4.18)), and the second reflects uncertainty in the estimate of Swithin (the denominator in (4.18)). The standard error of d is the square root of Vd, pffiffiffiffiffi SEd 5 Vd : ð4:21Þ It turns out that d has a slight bias, tending to overestimate the absolute value of in small samples. This bias can be removed by a simple correction that yields an unbiased estimate of , with the unbiased estimate sometimes called Hedges’ g (Hedges, 1981). To convert from d to Hedges’ g we use a correction factor, which is called J. Hedges (1981) gives the exact formula for J, but in common practice researchers use an approximation, J51 3 : 4df 1 ð4:22Þ In this expression, df is the degrees of freedom used to estimate Swithin, which for two independent groups is n1 þ n2 – 2. This approximation always has error of less than 0.007 and less than 0.035 percent when df 10 (Hedges, 1981). Then, g 5 J d; ð4:23Þ Vg 5 J 2 Vd ; ð4:24Þ and SEg 5 pffiffiffiffiffi Vg : ð4:25Þ For example, suppose a study has sample means X1 5 103, X2 5 100, sample standard deviations S1 5 5.5, S2 5 4.5, and sample sizes n1 5 n2 5 50. We would estimate the pooled-within-groups standard deviation as 28 Effect Size and Precision sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð50 1Þ 5:52 þ ð50 1Þ 4:52 Swithin 5 5 5:0249: 50 þ 50 2 Then, d5 Vd 5 and 103 100 5 0:5970; 5:0249 50 þ 50 0:59702 þ 5 0:0418; 50 50 2ð50 þ 50Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi SEd 5 0:0418 5 0:2044: The correction factor (J), Hedges’ g, its variance and standard error are given by 3 J5 1 5 0:9923; 4 98 1 g 5 0:9923 0:5970 5 0:5924; vg 5 0:99232 0:0418 5 0:0411; and pffiffiffiffiffiffiffiffiffiffiffiffiffiffi SEg 5 0:0411 5 0:2028: The correction factor (J) is always less than 1.0, and so g will always be less than d in absolute value, and the variance of g will always be less than the variance of d. However, J will be very close to 1.0 unless df is very small (say, less than 10) and so (as in this example) the difference is usually trivial (Hedges, 1981). Some slightly different expressions for the variance of d (and g) have been given by different authors and even the same authors at different times. For example, the denominator of the second term of the variance of d is given here as 2(n1 þ n2). This expression is obtained by one method (assuming the n’s become pffiffiffi large with fixed). An alternate derivation (assuming n’s become large with n fixed) leads to a denominator in the second term that is slightly different, namely 2(n1 þ n2 – 2). Unless n1 and n2 are very small, these expressions will be almost identical. Similarly, the expression given here for the variance of g is J2 times the variance of d, but many authors ignore the J2 term because it is so close to unity in most cases. Again, while it is preferable to include this correction factor, the inclusion of this factor is likely to make little practical difference. Computing d and g from studies that use pre-post scores or matched groups We can estimate the standardized mean difference () from studies that used matched groups or pre-post scores in one group. The formula for the sample estimate of d is Chapter 4: Effect Sizes Based on Means d5 Ydiff Y1 Y2 5 : Swithin Swithin ð4:26Þ This is the same formula as for independent groups (4.18). However, when we are working with independent groups the natural unit of deviation is the standard deviation within groups and so this value is typically reported (or easily imputed). By contrast, when we are working with matched groups, the natural unit of deviation is the standard deviation of the difference scores, and so this is the value that is likely to be reported. To compute d from the standard deviation of the differences we need to impute the standard deviation within groups, which would then serve as the denominator in (4.26). Concretely, when working with a matched study, the standard deviation within groups can be imputed from the standard deviation of the difference, using Sdiff Swithin 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; 2ð1 rÞ ð4:27Þ where r is the correlation between pairs of observations (e.g., the pretest-posttest correlation). Then we can apply (4.26) to compute d. The variance of d is given by 1 d2 þ Vd 5 2ð1 r Þ; ð4:28Þ n 2n where n is the number of pairs. The standard error of d is just the square root of Vd, pffiffiffiffiffi SEd 5 Vd : ð4:29Þ Since the correlation between pre- and post-scores is required to impute the standard deviation within groups from the standard deviation of the difference, we must assume that this correlation is known or can be estimated with high precision. Otherwise we may estimate the correlation from related studies, and possibly perform a sensitivity analysis using a range of plausible correlations. To compute Hedges’ g and associated statistics we would use formulas (4.22) through (4.25). The degrees of freedom for computing J is n – 1, where n is the number of pairs. For example, suppose that a study has pre-test and post-test sample means X1 5 103, X2 5 100, sample standard deviation of the difference Sdiff 5 5.5, sample size n 5 50, and a correlation between pre-test and post-test of r 5 0.7. The standard deviation within groups is imputed from the standard deviation of the difference by 5:5 Swithin 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5 7:1005: 2ð1 0:7Þ Then d, its variance and standard error are computed as d5 103 100 5 0:4225; 7:1000 30 Effect Size and Precision vd 5 and 1 0:42252 þ ð2ð1 0:7ÞÞ 5 0:0131; 50 2 50 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi SEd 5 0:0131 5 0:1143: The correction factor J, Hedges’ g, its variance and standard error are given by 3 J5 1 5 0:9846; 4 49 1 g 5 0:9846 0:4225 5 0:4160; Vg 5 0:98462 0:0131 5 0:0127; and pffiffiffiffiffiffiffiffiffiffiffiffiffiffi SEg 5 0:0127 5 0:1126: Including different study designs in the same analysis As we noted earlier, a single systematic review can include studies that used independent groups and also studies that used matched groups. From a statistical perspective the effect size (d or g) has the same meaning regardless of the study design. Therefore, we can compute the effect size and variance from each study using the appropriate formula, and then include all studies in the same analysis. While there are no technical barriers to using studies with different designs in the same analysis, there may be a concern that these studies could differ in substantive ways as well (see Chapter 40). For all study designs the direction of the effect (X1 X2 or X2 X1 ) is arbitrary, except that the researcher must decide on a convention and then apply this consistently. For example, if a positive difference indicates that the treated group did better than the control group, then this convention must apply for studies that used independent designs and for studies that used pre-post designs. It must also apply for all outcome measures. In some cases (for example, if some studies defined outcome as the number of correct answers while others defined outcome as the number of mistakes) it will be necessary to reverse the computed sign of the effect size to ensure that the convention is applied consistently. RESPONSE RATIOS In research domains where the outcome is measured on a physical scale (such as length, area, or mass) and is unlikely to be zero, the ratio of the means in the two groups might serve as the effect size index. In experimental ecology this effect size index is called the response ratio (Hedges, Gurevitch, & Curtis, 1999). It is important to recognize that the response ratio is only meaningful when the outcome Chapter 4: Effect Sizes Based on Means Study A Response ratio Log response ratio Study B Response ratio Log response ratio Study C Response ratio Log response ratio Summary Response ratio Summary Log response ratio Figure 4.1 Response ratios are analyzed in log units. is measured on a true ratio scale. The response ratio is not meaningful for studies (such as most social science studies) that measure outcomes such as test scores, attitude measures, or judgments, since these have no natural scale units and no natural zero points. For response ratios, computations are carried out on a log scale (see the discussion under risk ratios, below, for an explanation). We compute the log response ratio and the standard error of the log response ratio, and use these numbers to perform all steps in the meta-analysis. Only then do we convert the results back into the original metric. This is shown schematically in Figure 4.1. The response ratio is computed as R5 X1 X2 ð4:30Þ where X1 is the mean of group 1 and X2 is the mean of group 2. The log response ratio is computed as X1 ð4:31Þ lnR 5 lnðRÞ 5 ln 5 ln X1 ln X2 : X2 The variance of the log response ratio is approximately VlnR 5 S2pooled ! 1 1 2 þ 2 ; n1 X1 n2 X2 ð4:32Þ where Spooled is the pooled standard deviation. The approximate standard error is pffiffiffiffiffiffiffiffi SE ln R 5 VlnR : ð4:33Þ Note that we do not compute a variance for the response ratio in its original metric. Rather, we use the log response ratio and its variance in the analysis to yield 32 Effect Size and Precision a summary effect, confidence limits, and so on, in log units. We then convert each of these values back to response ratios using R 5 expðlnRÞ; ð4:34Þ LLR 5 expðLLlnR Þ; ð4:35Þ ULR 5 expðULlnR Þ; ð4:36Þ and where LL and UL represent the lower and upper limits, respectively. For example, suppose that a study has two independent groups with means X1 5 61.515, X2 5 51.015, pooled within-group standard deviation 19.475, and sample size n1 5 n2 510. Then R, its variance and standard error are computed as R5 61:515 5 1:2058; 51:015 lnR 5 lnð1:2058Þ 5 0:1871; VlnR 5 19:475 2 1 10 ð61:515Þ2 þ 1 10 ð51:015Þ2 ! 5 0:0246: and pffiffiffiffiffiffiffiffiffiffiffiffiffiffi SElnR 5 0:0246 5 0:1581: SUMMARY POINTS The raw mean difference (D) may be used as the effect size when the outcome scale is either inherently meaningful or well known due to widespread use. This effect size can only be used when all studies in the analysis used precisely the same scale. The standardized mean difference (d or g) transforms all effect sizes to a common metric, and thus enables us to include different outcome measures in the same synthesis. This effect size is often used in primary research as well as meta-analysis, and therefore will be intuitive to many researchers. The response ratio (R) is often used in ecology. This effect size is only meaningful when the outcome has a natural zero point, but when this condition holds, it provides a unique perspective on the effect size. It is possible to compute an effect size and variance from studies that used two independent groups, from studies that used matched groups (or pre-post designs) and from studies that used clustered groups. These effect sizes may then be included in the same meta-analysis.