Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 5. Testing for differences - t-tests and their non-parametric equivalents In this section we review some of the most important tests for differences in the distributions of two sets of measurements. This is an extremely common situation in psychology and in general. For example, we may wish to investigate: potential differences in IQ between children who were breast-fed and those who were not differences in degree of independence for first and second-born siblings differences in degree of psychological trauma experienced by stroke victims and victims of road accidents. The kind of tests we apply will depend on whether we are comparing measurements taken on individuals sampled independently from two distinct populations, or paired measurements on a sample of individuals from a single population (as in the case of the example in 3.3.2.) The tests will also depend on how much you know (or can assume) about the distributions of the observations and the size of the sample. If we can assume that distributions are normal, or sample sizes are large, then a number of standard parametric tests can be applied (see 5.3, later). The t-test of 3.3.2 is one such example. However, if the observations do not appear to be normally distributed, then such tests may not be valid (particularly when sample sizes are small). In these situations we can use a class of methods known as non-parametric (or distributionfree) tests. These enable us to detect differences in the distributions of sets of measurements without making strong assumptions on what these distributions actually are. Because they make fewer assumptions than parametric tests they may tend to be less powerful (less likely to detect differences when they actually exist). 5.1. The one-sample t-test (see Howell Chapter 7) 5.1.1 Suppose that we take set of observations X1, ..., Xn on a random sample of size n from a population with mean and variance 2. (This means that we can think of the X's as a set of independent random variables all with the same distribution.) We wish to test the null hypothesis that = 0, where 0 is some specified value. Suppose that we believe that the Xi are normally distributed, or that the sample size n is large. The one-sample t-test is a simple way of quantifying the evidence against H0 from the observed values in our random sample. We have essentially seen it at the end of the last section where we applied it to the caffeine data. We discuss it in a little more detail here. Suppose that we observe the particular values x1, ..., xn in our sample. Let x denote the sample mean and s2 the sample variance. Intuitively, the further x is away from 0, the more evidence there is in the data against H0. However, we also need to qualify this by taking account of the variability in the sample as measured by s2. 2 x Observed value x Observed value The two situations above show the effect of sample variability on the strength of the evidence against H0. In both cases the sample mean x is the same but the evidence against H0 is intuitively less in the lower case because there is much more variability in the values of x, and hence the sample mean will vary more between experiments. X where S denotes the S n sample standard deviation. This statistic is distributed as tn-1, and its distribution can be found in the statistical tables (Table 9). The appropriate test statistic to use is the t-statistic The form of the test depends on whether we are carrying out a 1-sided or 2-sided test, and this in turn depends on the alternative hypothesis that we are considering. Our alternative hypothesis is 1-sided if is either H1: > 0 or H1: < 0. The 2-sided alternative is H1: ≠ 0. The steps are as follows. 1. Calculate t= x 0 s n n x 0 s from the sample. 2. For a 2-sided test calculate Pr(|tn-1| ≥ |t|) where |.| denotes the modulus of a number. This probability is 2(1 - Pr(tn-1 |t|)) and Pr(tn-1 |t|) can be read from Table 9. 3 3. For a 1-sided test you calculate Pr(tn-1 ≥ t) if the alternative is > 0 and Pr(tn-1 t) if the alternative is < 0. 5.1.2. Example: Suppose our sample size n =10 and we observe x = 4.5, s2 = 1.5 and we wish to calculate a p-value for a 2-sided test of H0: = 5. In this case 10 4.5 5.0 t= 1.5 = -1.29, and (from tables) Pr(t9 ≥ 1.29) = (1 - 0.887) = 0.113. Our p-value for the 2-sided test is therefore 0.226, which does not represent any real evidence against H0. For a test of H0 against the 1-sided alternative < 0, the p-value would be 0.113. Again this does not represent significant evidence against H0. 5.1.3 Discussion of p-values and their interpretation. In summary, a p-value calculated from a particular experiment tells you the frequency with which you would obtain a value of a test statistic which is at least as extreme as the one you have obtained when H0 is true. If a p-value is very small then either: H0 is true and your particular experimental data represent some kind of 'freak' occurrence; or H is false, and your experimental data are not so unusual. In practice a p-value of more than 0.1 is generally not seen as representing any real evidence against H0. A p-value in the range 0.05 - 0.1 might be seen as weak evidence against H0, while p-values in the range 0.01-0.05 can be claimed to represent some evidence. A p-value which is less than 0.01 may be interpreted as substantial evidence against H0. Once you get a p-value of 0.001 or less than that can be seen as overwhelming evidence. It is common practice to accept/reject a null hypothesis depending on whether a calculated p-value is greater than or less than 0.05. (This is known as accepting or rejecting at level = 0.05.) Many statisticians don't like this practice since: the conclusion is sensitive to the data; it can give the misleading impression that H0 has been shown to be true. Some prefer simply to report the p-value, or give a confidence interval, to summarise a plausible range for the value of , rather than try to draw conclusions about the validity of any particular value. 5.2. Paired-samples t-test (see Howell section 7.4). We have essentially already seen this test for the caffeine test in Section 3.3.2, where we had 8 subjects who took a test on 2 separate occasions with and without the aid of caffeine. This was an example of what is often referred to as paired data, matched samples, or repeated measures. In such experiments, we may have n subjects on which two measurements are taken under two different conditions (or treatments). 4 We may have a situation in which our observations take the form of pairs of measurements which are naturally related but are not taken on the same subjects, e.g. 1st and second siblings from the same family; partners in married couples. Each member of the pair can contribute a single measurement e.g. a measure of independence, or satisfaction with marriage, but the measurements may be strongly related to each other and the data should be analysed as pairs. We can represent the data as a set of pairs ((x11, x12), (x21, x22), ...., (xn1, xn2)) where xij denotes the jth measurement on subject i, j = 1, 2, i = 1, 2, ..., n. In this situation we wish to test for differences between the distribution of the first and second measurements in the population. This can be done using the paired-sample ttest and the methodology is essentially similar to that of the 1-sample t-test of 5.1. We first reduce the problem to a 1-sample situation by considering the differences between the paired measurements di = xi1 - xi2. We then use the 1-sample t-test to test the hypothesis H0: D = 0. For this to be valid we require that the differences are normally distributed or the sample size is large. See practical in week 3 for further example of 1-sample and paired sample t-tests. 5 5.3 Wilcoxon's matched-pairs signed-ranks test This can be considered as the non-parametric equivalent of the paired-sample t-test and can be applied in the situation of paired data discussed in 5.2. We can use it if we feel that it is invalid to assume normality of the differences in scores, particularly when sample sizes are small. Essentially, this test tests the null hypothesis that the differences d1, ..., dn are a random sample from a distribution whose density is symmetric about 0. We illustrate it with an example (see Howell, p. 653) Suppose we take a sample of 8 subjects and measure their blood pressure before and after a 6-month programme of running. The data from this experiment are shown below. Subject, i Before (Bi) After (Ai) Difference (Bi-Ai) Rank of |difference| Signed rank 1 130 120 10 5 5 2 170 163 7 4 4 3 125 120 5 2 2 4 170 135 35 7 7 5 130 143 -13 6 -6 6 130 136 -6 3 -3 7 145 144 1 1 1 8 160 120 40 8 8 Let's suppose that running does serve to reduce blood pressure. Then we would expect that most of the differences in the above table would be positive and that any negative differences would tend to have small magnitudes. At first sight this might seem to be the case - but is there sufficient evidence to reject the null hypothesis that running has no effect on blood pressure? Wilcoxon's matched-pairs signed-rank test seeks to answer this question using the following steps. 1. Rank the observations according to magnitude of difference from smallest to largest (row 4 of table) 2. Assign a sign to each rank and calculate the sums of the positive ranks and the negative ranks. Call these sums T+ = 27 and T- = -9. What you do now depends on whether you're doing a 1-sided or 2-sided test. In the case where, at the outset, we had stated an alternative hypothesis that running tended to reduce blood pressure (in which case we should expect the magnitude of T- to be particularly small) we use the value |T-| as our test statistic. The p-value we would report would be the frequency with which we would obtain a value of |T-| less than or equal to 9 under H0, i.e. p-value = P(T- 9). For a 2-sided test, we are testing against a general alternative hypothesis that says that running could increase or decrease blood pressure. Therefore our reported p-value should be P(|T-| 9) + P(T+ 9) = 2P(T- 9). In general for the 2-sided test one computes T = min(T+, |T-|), calculates the probability that |T-| is less than this value under H0, and then doubles this probability to get the 2-sided p-value. Note: 6 n( n 1) . 2 T+ and |T-| have exactly the same distribution under H0. (See discussion in lectures). n( n 1) E(T+) = under H0. 4 nn 12n 1 Var(T+) = 24 If you have n pairs of observations then T+ + |T-| = For larger sample sizes, it is approximately true that under H0, n(n 1) nn 12n 1 , |T-| or T ~ N 24 4 And we can use this fact to estimate the corresponding p-values. For small sample sizes (e.g. 8 as in this case) you would want to quote the exact p-values as computed in SPSS. Let's look at the results of analysing these data in SPSS. We will apply both the paired sample t-test and Wilcoxon's matched pairs signed ranks test. First the measurements are entered into the SPSS data window (see practical 1). Before 130.00 170.00 125.00 170.00 130.00 130.00 145.00 160.00 After 120.00 163.00 120.00 135.00 143.00 136.00 144.00 120.00 Diff 10.00 7.00 5.00 35.00 -13.00 -6.00 1.00 40.00 To carry out a paired-samples T-test: In the data window, the tool-bar options you want are: Analyse -> compare means -> paired-samples t-test Then put 'Before' and 'After' into the right-hand panel in the dialogue box and click 'OK.' Do not worry about the 'options' button - this relates to handling missing values and the level of confidence for the confidence interval that is automatically quoted. Just stick with the default settings. In the output window you should see: 7 T-Test Paired Samples Statistics Pair 1 Before Mean 145.0000 N 8 Std. Deviation 19.08627 Std. Error Mean 6.74802 After 135.1250 8 15.14159 5.35336 Pa ired Sa mples Corre lations N Pair 1 Before & After 8 Correlation .428 Sig. .291 Pa ired Sa mples Test Paired Differences Pair 1 Before - After Mean 9.87500 Std. Deviation 18.61211 Std. Error Mean 6.58038 95% Confidence Interval of the Difference Lower Upper -5.68512 25.43512 t 1.501 Now the non-parametric approach using Wilcoxon's signed-ranks test: Analyze -> Nonparametric tests -> 2 related samples Place 'Before' and 'After' into the right hand panel, check 'Wilcoxon' and click OK. df 7 Sig. (2-tailed) .177 8 Wilcoxon Signed Ranks Test Ranks After - Before Negative Ranks N 6(a) Mean Rank 4.50 Sum of Ranks 27.00 Positive Ranks 2(b) 4.50 9.00 Ties 0(c) Total 8 a After < Before b After > Before c After = Before Test Statistics(b) After - Before Z -1.260(a) Asymp. Sig. (2-tailed) .208 Exact Sig. (2-tailed) .250 Exact Sig. (1-tailed) .125 Point Probability .027 a Based on positive ranks. b Wilcoxon Signed Ranks Test We can check the asymptotic significance calculated by SPSS by hand using the normal approximation above. For n = 8, T+ ~ N(18, 51). We can get the Z-score by computing Z 9 18 51 1.26. The corresponding 2-sided p-value is 2P(Z < -1.26) and can be computed from the tables to be 2(1 - 0.8962) = 0.208. Tied ranks. When applying Wilcoxon's matched-pairs rank-sum test we will end up with tied ranks - when two or more differences have the same magnitude. There are various ways to resolve this. One way is to assign each of these ranks the average of the tied ranks. For example, if you get two equal magnitudes tied in 4th position, then they can both be assigned a rank of 4.5. This is generally the simplest thing to do and is what SPSS does by default. 9 5.4 The sign test This is an even cruder non-parametric test that can be applied to paired data. It tests a more general null hypothesis than the t-test or the Wilcoxon matched-pairs rank-sum test. The null hypothesis is simply that the difference in measurement for any subject is equally likely to be positive or negative. Under H0, the distribution of the number of positive differences must follow a Binomial(n, 0.5) distribution where n is the number of observations. For the blood pressure data we observe 2 out of 8 negative differences. The p-value for a 2-sided test is P(X 2) + P(X 6) = 20.1445 = 0.289 from tables. To analyse these data in SPSS, Analyze -> Nonparametric tests -> 2 related samples Place 'Before' and 'After' into the right hand panel, check 'Sign' and click OK. Frequencies N After - Before Negative Differences(a) Positive Differences(b) Ties(c) Total 6 2 0 8 a After < Before b After > Before c After = Before Test Statistics(b) After - Before Exact Sig. (2-tailed) .289(a) Exact Sig. (1-tailed) .145 Point Probability .109 a Binomial distribution used. b Sign Test 5.5. When to use which test? Generally speaking, most statisticians would prefer to use the paired samples t-test of 5.1 when they believe that the differences in the paired measurements will be normally distributed, with zero mean when the treatment has no effect. If sample sizes are large (say bigger than 30, or so) then the t-test is also considered to be valid, since the sample mean will be approximately normally distributed in such cases, even if the distribution of the individual observations isn't. If your sample size is small and the data do not appear to be normally distributed then you should consider using a non-parametric method. In generally these will be less powerful than the t-test i.e. less likely to detect differences when these are present. 10 However, there is no universal agreement among statisticians regarding which analysis to do. You should be aware of the assumptions that underlie any particular test and check that these are not obviously contradicted by the data. This often boils down to looking at dot-plots and checking for any obvious deviations from normality.