Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
J. Paediatr. Child Health (2003) 39, 309–311 Statistics for Clinicians 8: Non-parametric methods for continuous or ordered data JB Carlin1,3 and LW Doyle2,3,4 1Clinical Epidemiology and Biostatistics Unit, Murdoch Childrens Research Institute, Departments of 2Obstetrics and Gynaecology and 3Paediatrics, University of Melbourne and 4Division of Newborn Services, Royal Women’s Hospital, Melbourne, Victoria, Australia In previous articles in this series, we have discussed the use of the t-test and related confidence intervals for comparing (two) groups on the basis of a continuous outcome measure.1,2 These procedures rely on certain assumptions: in particular, that the data follow a normal distribution – unless we are dealing with large samples, where normality of the sample means can be assumed regardless of the distribution of the individual values. When we use the t-test methods with small samples, we are making a parametric assumption that the outcome values can be modelled by a normal distribution, characterized by two parameters: the mean and standard deviation (SD). In general, non-parametric methods avoid specific modelling assumptions such as this. In the present article, we discuss non-parametric methods commonly used for comparing continuous outcomes between two groups. These are generally used with variables that are badly skewed. As a quick check, if the SD of a continuous variable is similar to or greater than its mean, the distribution must be skewed. However, there are times when the mean exceeds the SD, but the data are still skewed (this is nearly always obvious when plotting data, as recommended as a first step in the second article in the series).3 The most common non-parametric method is the Mann–Whitney U-test (or Wilcoxon rank sum test, which is equivalent), a version of which had appeared in 15% of all original articles in our previous survey of statistics appearing in the Journal of Paediatrics and Child Health.4 The underlying logic of this test involves ranking the data values, usually from lowest to highest, and then comparing the sums of the ranks in each group. This comparison provides an intuitive measure of the extent to which one group tends to have higher values than another, and can be regarded as the ‘signal’ component of the signal : noise ratio for this procedure. We illustrate the Wilcoxon rank sum computation using the data in Table 1, which gives the days of assisted ventilation in boys and girls in a modified subset of very low birthweight (VLBW) infants from the dataset described in the second article in the series.3 Note that boys needed longer durations of assisted ventilation than girls. This is reflected in the mean values for the two groups, but we also see that the SDs are high relative to the means, indicating substantial skewness in the distributions. To calculate the test statistic, we first need to rank the observations across both groups; the ranks are shown in the table adjacent to the data values. The difference between the groups is reflected in the fact that the ranks for boys are generally higher than the ranks for girls. This is summarized in the sums of the ranks, although these need to be interpreted in light of the sample number in each group. The sum of the ranks in one of the groups is used to calculate a test statistic for which a P-value can be computed, under the null hypothesis that the two groups have the same distribution of values. The ‘signal’ is the difference between the observed sum of the ranks and its expected value if there were no difference between the groups. For moderate to large sample sizes, a P-value is obtained by comparing the ‘signal’ with its standard error (or ‘noise’ measure); for small samples, an exact probability is available in tables. We omit the details of these calculations, but they are available in standard texts5 and statistical packages. Although we have described the Wilcoxon version of the non-parametric test that was independently developed in a different form by Mann and Whitney, we recommend referring to the method as the Mann–Whitney U-test to avoid confusion with the paired test that we describe here. Interested readers are referred elsewhere for details of the Mann–Whitney U version of the calculation.6 For the data in Table 1, the sum of the ranks for the boys is 128. It can be shown that with nine observations in the first group (boys) and 12 in the second the expected sum of ranks for boys is 99. The difference, therefore, is 29. Dividing by its standard error, which turns out to be 14.1, gives a z-value of 2.06, with a corresponding P-value of 0.04. The conclusion would be that we have moderately strong evidence that boys need longer durations of assisted ventilation. The Mann–Whitney U-test is often presented as a comparison of medians. Although with skewed data the median generally represents a better summary of the distribution, as already stated, the test actually compares distributions, not medians.7 Unlike other tests described in previous articles in the series, such as the t -test and the χ 2, we cannot check the accuracy of someone else’s Mann–Whitney U-test using summary statistics (such as medians) without the raw data. As with means, the difference between medians can be expressed along with a 95% confidence interval (CI). This conveys both the size of the difference, reflecting its clinical significance, and its statistical significance at the 5% level. In the present example, the difference in medians is 20.5 days, and the 95% CI ranges from 3 to 60 days. This should be reported more in clinical journals, and the calculation is available in some statistical packages, as well as in some stand-alone programs.8 Correspondence: Associate Professor LW Doyle, Department of Obstetrics & Gynaecology, University of Melbourne, Parkville, Vic. 3010, Australia. Fax: +61 3 9347 1761; email: [email protected] Accepted for publication 3 February 2003. 310 JB Carlin and LW Doyle Table 1 Data on days of assisted ventilation for very low birthweight infants, modified for illustrative purposes Ventilation (days) Rank† Boys‡ 13 17 26 27 40 51 62 116 204 6 7 11 12 15 17 19 20 21 1 2 3 5 8 19 20 23 29 30 41 52 1 2 3 4 5 8 9 10 13 14 16 18 Girls§ † Sum of ranks (boys) 128, (girls) 103; ‡ median 40, mean 61.8, SD 61.9; §median 19.5, mean 19.4, SD 16.5. PAIRED COMPARISONS: SIGN TEST AND WILCOXON SIGNED-RANK TEST Just as was discussed in a previous article in relation to the t-test for comparing means,1 when a study uses a paired design to make comparisons between two treatments or conditions, the analysis should reflect the pairing, as this removes a potentially important source of variation. For example, it is natural to expect that observations taken on the same patient before and after administration of a drug will be more alike than observations taken on different patients. There are two commonly used non-parametric tests for making paired comparisons of continuous outcome data: the sign test and the Wilcoxon signed-rank test. The logic of the sign test is very simple: it examines the number of pairs in which one member of the pair (say the first) has a higher value than the other (say the second), and compares this number of pairs with what would be expected under the null hypothesis of no difference, which is just half the total number of pairs. The variation or ‘noise’ surrounding this difference is determined by the variance of the binomial distribution with probability 0.5, and a P-value can be obtained directly using this binomial distribution. An example of paired data arising from our VLBW study is in twins, where we might be interested in asking if the second twin has a need for longer durations of assisted ventilation than the first twin. Figure 1 illustrates the data for 11 pairs of VLBW twins (note the logarithmic plot on the vertical axis – we will explain the logic of this plot in the next article in the series). In four sets, the first twin required a longer duration of assisted ventilation, and in seven sets, the second twin required more assisted ventilation. To work out the probability of finding at least seven sets where one twin had a longer duration of assisted ventilation, we calculate the probability from the binomial distribution of Fig. 1 Days of assisted ventilation for 11 sets of very low birthweight twins. Note logarithmic scale on vertical axis. , Twin 1; , twin 2. finding a 7:4 split, and add to it the probabilities of 8:3, 9:2, 10:1 and 11:0 splits. Fortunately, computer programs will do this readily, and the answer happens to be a total probability (P-value) of 0.27. One problem with the sign test is that it ignores the size of the difference between the pairs of twins, whereas the Wilcoxon signed-rank test uses this information as well the direction of the difference between pairs. In the Wilcoxon test the differences between pairs are ranked from highest to lowest, ignoring the direction of the effect. The sum of the ranks for one of the pairs is then considered and compared with what would be expected from chance alone. In the present example, the expected sum of ranks is 33 for each group, and the observed sum of ranks where the second twin received a longer duration of assisted ventilation was 45. An approximate P-value can be obtained from tables, or statistical programs can calculate an exact probability; in the present example, this is 0.29, which is similar to the P-value for the sign test. In the present article we have focused on non-parametric methods for continuous or ordered variables. Commonly used tests for dichotomous variables, such as the χ2 and Fisher’s exact test, which we have discussed previously,9 can also be regarded as non-parametric. There is a wide range of other nonparametric methods for which interested readers are referred elsewhere.5,6 In the next article in the series, we will discuss alternative approaches to handling skewed distributions based on transforming data. In particular, we will explain the central role of the logarithmic transformation in much statistical analysis. REFERENCES 1 2 3 4 Carlin JB, Doyle LW. Statistics for clinicians 4: Basic concepts of statistical reasoning: Hypothesis tests and the t-test. J. Paediatr. Child Health 2001; 37: 72–7. Carlin JB, Doyle LW. Statistics for clinicians 6: Comparison of means and proportions using confidence intervals. J. Paediatr. Child Health 2001; 37: 583–6. Carlin J, Doyle L. Statistics for clinicians 2: Describing and displaying data. J. Paediatr. Child Health 2000; 36: 270–4. Doyle LW, Carlin JB. Statistics for clinicians 1: Introduction. J. Paediatr. Child Health 2000; 36: 74–5. Non-parametric methods 5 6 7 Altman DG. Practical Statistics for Medical Research. Chapman & Hall, London, 1991. Bland M. An Introduction to Medical Statistics. Oxford University Press, Oxford, 1995. Hart A. Mann–Whitney test is not just a test of medians: differences in spread can be important. BMJ 2001; 323: 391–3. 311 8 9 Altman DG, Machin D, Bryant TN, Gardner MJ. Statistics with Confidence. BMJ Books, London, 2000. Carlin JB, Doyle LW. Statistics for clinicians 5: Comparing proportions using the chi-squared test. J. Paediatr. Child Health 2001; 37: 392–4.