Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
1
5. Testing for differences - t-tests and their non-parametric equivalents
In this section we review some of the most important tests for differences in the
distributions of two sets of measurements. This is an extremely common situation in
psychology and in general. For example, we may wish to investigate:



potential differences in IQ between children who were breast-fed and those
who were not
differences in degree of independence for first and second-born siblings
differences in degree of psychological trauma experienced by stroke victims
and victims of road accidents.
The kind of tests we apply will depend on whether we are comparing measurements
taken on individuals sampled independently from two distinct populations, or paired
measurements on a sample of individuals from a single population (as in the case of
the example in 3.3.2.)
The tests will also depend on how much you know (or can assume) about the
distributions of the observations and the size of the sample. If we can assume that
distributions are normal, or sample sizes are large, then a number of standard
parametric tests can be applied (see 5.3, later). The t-test of 3.3.2 is one such
example. However, if the observations do not appear to be normally distributed, then
such tests may not be valid (particularly when sample sizes are small). In these
situations we can use a class of methods known as non-parametric (or distributionfree) tests. These enable us to detect differences in the distributions of sets of
measurements without making strong assumptions on what these distributions
actually are. Because they make fewer assumptions than parametric tests they may
tend to be less powerful (less likely to detect differences when they actually exist).
5.1. The one-sample t-test (see Howell Chapter 7)
5.1.1 Suppose that we take set of observations X1, ..., Xn on a random sample of size n
from a population with mean  and variance 2. (This means that we can think of the
X's as a set of independent random variables all with the same distribution.) We wish
to test the null hypothesis that  = 0, where 0 is some specified value.
Suppose that we believe that the Xi are normally distributed, or that the sample size n
is large. The one-sample t-test is a simple way of quantifying the evidence against H0
from the observed values in our random sample. We have essentially seen it at the
end of the last section where we applied it to the caffeine data. We discuss it in a little
more detail here.
Suppose that we observe the particular values x1, ..., xn in our sample. Let x denote
the sample mean and s2 the sample variance. Intuitively, the further x is away from
0, the more evidence there is in the data against H0. However, we also need to
qualify this by taking account of the variability in the sample as measured by s2.
2
x

Observed value
x

Observed value
The two situations above show the effect of sample variability on the strength of the
evidence against H0. In both cases the sample mean x is the same but the evidence
against H0 is intuitively less in the lower case because there is much more variability
in the values of x, and hence the sample mean will vary more between experiments.
X 
where S denotes the
S
n
sample standard deviation. This statistic is distributed as tn-1, and its distribution can
be found in the statistical tables (Table 9).
The appropriate test statistic to use is the t-statistic
The form of the test depends on whether we are carrying out a 1-sided or 2-sided test,
and this in turn depends on the alternative hypothesis that we are considering. Our
alternative hypothesis is 1-sided if is either H1:  > 0 or H1:  < 0. The 2-sided
alternative is H1:  ≠ 0. The steps are as follows.
1. Calculate
t=
x  0

s
n
n x   0 
s
from the sample.
2. For a 2-sided test calculate Pr(|tn-1| ≥ |t|) where |.| denotes the modulus of a
number. This probability is 2(1 - Pr(tn-1  |t|)) and Pr(tn-1  |t|) can be read
from Table 9.
3
3. For a 1-sided test you calculate Pr(tn-1 ≥ t) if the alternative is  > 0 and
Pr(tn-1  t) if the alternative is  < 0.
5.1.2. Example: Suppose our sample size n =10 and we observe x = 4.5, s2 = 1.5
and we wish to calculate a p-value for a 2-sided test of H0:  = 5. In this case
10 4.5  5.0
t=
1.5
= -1.29,
and (from tables) Pr(t9 ≥ 1.29) = (1 - 0.887) = 0.113. Our p-value for the 2-sided test
is therefore 0.226, which does not represent any real evidence against H0. For a test
of H0 against the 1-sided alternative  < 0, the p-value would be 0.113. Again this
does not represent significant evidence against H0.
5.1.3 Discussion of p-values and their interpretation.
In summary, a p-value calculated from a particular experiment tells you the frequency
with which you would obtain a value of a test statistic which is at least as extreme as
the one you have obtained when H0 is true. If a p-value is very small then either:


H0 is true and your particular experimental data represent some kind of 'freak'
occurrence; or
H is false, and your experimental data are not so unusual.
In practice a p-value of more than 0.1 is generally not seen as representing any real
evidence against H0. A p-value in the range 0.05 - 0.1 might be seen as weak
evidence against H0, while p-values in the range 0.01-0.05 can be claimed to represent
some evidence. A p-value which is less than 0.01 may be interpreted as substantial
evidence against H0. Once you get a p-value of 0.001 or less than that can be seen as
overwhelming evidence.
It is common practice to accept/reject a null hypothesis depending on whether a
calculated p-value is greater than or less than 0.05. (This is known as accepting or
rejecting at level  = 0.05.) Many statisticians don't like this practice since:


the conclusion is sensitive to the data;
it can give the misleading impression that H0 has been shown to be true.
Some prefer simply to report the p-value, or give a confidence interval, to summarise
a plausible range for the value of , rather than try to draw conclusions about the
validity of any particular value.
5.2. Paired-samples t-test (see Howell section 7.4).
We have essentially already seen this test for the caffeine test in Section 3.3.2, where
we had 8 subjects who took a test on 2 separate occasions with and without the aid of
caffeine. This was an example of what is often referred to as paired data, matched
samples, or repeated measures. In such experiments, we may have n subjects on
which two measurements are taken under two different conditions (or treatments).
4
We may have a situation in which our observations take the form of pairs of
measurements which are naturally related but are not taken on the same subjects, e.g.


1st and second siblings from the same family;
partners in married couples.
Each member of the pair can contribute a single measurement e.g. a measure of
independence, or satisfaction with marriage, but the measurements may be strongly
related to each other and the data should be analysed as pairs. We can represent the
data as a set of pairs
((x11, x12), (x21, x22), ...., (xn1, xn2))
where xij denotes the jth measurement on subject i, j = 1, 2, i = 1, 2, ..., n.
In this situation we wish to test for differences between the distribution of the first and
second measurements in the population. This can be done using the paired-sample ttest and the methodology is essentially similar to that of the 1-sample t-test of 5.1.
We first reduce the problem to a 1-sample situation by considering the differences
between the paired measurements
di = xi1 - xi2.
We then use the 1-sample t-test to test the hypothesis H0: D = 0. For this to be valid
we require that the differences are normally distributed or the sample size is large.
See practical in week 3 for further example of 1-sample and paired sample t-tests.
5
5.3 Wilcoxon's matched-pairs signed-ranks test
This can be considered as the non-parametric equivalent of the paired-sample t-test
and can be applied in the situation of paired data discussed in 5.2. We can use it if we
feel that it is invalid to assume normality of the differences in scores, particularly
when sample sizes are small. Essentially, this test tests the null hypothesis that the
differences d1, ..., dn are a random sample from a distribution whose density is
symmetric about 0. We illustrate it with an example (see Howell, p. 653)
Suppose we take a sample of 8 subjects and measure their blood pressure before and
after a 6-month programme of running. The data from this experiment are shown
below.
Subject, i
Before (Bi)
After (Ai)
Difference (Bi-Ai)
Rank of |difference|
Signed rank
1
130
120
10
5
5
2
170
163
7
4
4
3
125
120
5
2
2
4
170
135
35
7
7
5
130
143
-13
6
-6
6
130
136
-6
3
-3
7
145
144
1
1
1
8
160
120
40
8
8
Let's suppose that running does serve to reduce blood pressure. Then we would
expect that most of the differences in the above table would be positive and that any
negative differences would tend to have small magnitudes. At first sight this might
seem to be the case - but is there sufficient evidence to reject the null hypothesis that
running has no effect on blood pressure? Wilcoxon's matched-pairs signed-rank test
seeks to answer this question using the following steps.
1. Rank the observations according to magnitude of difference from smallest to
largest (row 4 of table)
2. Assign a sign to each rank and calculate the sums of the positive ranks and the
negative ranks. Call these sums T+ = 27 and T- = -9.
What you do now depends on whether you're doing a 1-sided or 2-sided test. In the
case where, at the outset, we had stated an alternative hypothesis that running tended
to reduce blood pressure (in which case we should expect the magnitude of T- to be
particularly small) we use the value |T-| as our test statistic. The p-value we would
report would be the frequency with which we would obtain a value of |T-| less than or
equal to 9 under H0, i.e. p-value = P(T-  9).
For a 2-sided test, we are testing against a general alternative hypothesis that says that
running could increase or decrease blood pressure. Therefore our reported p-value
should be P(|T-|  9) + P(T+  9) = 2P(T-  9).
In general for the 2-sided test one computes T = min(T+, |T-|), calculates the
probability that |T-| is less than this value under H0, and then doubles this probability
to get the 2-sided p-value.
Note:
6




n( n  1)
.
2
T+ and |T-| have exactly the same distribution under H0. (See discussion in
lectures).
n( n  1)
E(T+) =
under H0.
4
nn  12n  1
Var(T+) =
24
If you have n pairs of observations then T+ + |T-| =
For larger sample sizes, it is approximately true that under H0,
 n(n  1) nn  12n  1 
,
|T-| or T  ~ N 

24
 4

And we can use this fact to estimate the corresponding p-values. For small sample
sizes (e.g. 8 as in this case) you would want to quote the exact p-values as computed
in SPSS.
Let's look at the results of analysing these data in SPSS. We will apply both the
paired sample t-test and Wilcoxon's matched pairs signed ranks test.
First the measurements are entered into the SPSS data window (see practical 1).
Before
130.00
170.00
125.00
170.00
130.00
130.00
145.00
160.00
After
120.00
163.00
120.00
135.00
143.00
136.00
144.00
120.00
Diff
10.00
7.00
5.00
35.00
-13.00
-6.00
1.00
40.00
To carry out a paired-samples T-test:
In the data window, the tool-bar options you want are:
Analyse -> compare means -> paired-samples t-test
Then put 'Before' and 'After' into the right-hand panel in the dialogue box and click
'OK.'
Do not worry about the 'options' button - this relates to handling missing values and
the level of confidence for the confidence interval that is automatically quoted. Just
stick with the default settings.
In the output window you should see:
7
T-Test
Paired Samples Statistics
Pair 1
Before
Mean
145.0000
N
8
Std. Deviation
19.08627
Std. Error
Mean
6.74802
After
135.1250
8
15.14159
5.35336
Pa ired Sa mples Corre lations
N
Pair 1
Before & After
8
Correlation
.428
Sig.
.291
Pa ired Sa mples Test
Paired Differences
Pair 1
Before - After
Mean
9.87500
Std. Deviation
18.61211
Std. Error
Mean
6.58038
95% Confidence
Interval of the
Difference
Lower
Upper
-5.68512 25.43512
t
1.501
Now the non-parametric approach using Wilcoxon's signed-ranks test:
Analyze -> Nonparametric tests -> 2 related samples
Place 'Before' and 'After' into the right hand panel, check 'Wilcoxon' and click OK.
df
7
Sig. (2-tailed)
.177
8
Wilcoxon Signed Ranks Test
Ranks
After - Before
Negative Ranks
N
6(a)
Mean Rank
4.50
Sum of Ranks
27.00
Positive Ranks
2(b)
4.50
9.00
Ties
0(c)
Total
8
a After < Before
b After > Before
c After = Before
Test Statistics(b)
After - Before
Z
-1.260(a)
Asymp. Sig. (2-tailed)
.208
Exact Sig. (2-tailed)
.250
Exact Sig. (1-tailed)
.125
Point Probability
.027
a Based on positive ranks.
b Wilcoxon Signed Ranks Test
We can check the asymptotic significance calculated by SPSS by hand using the
normal approximation above. For n = 8, T+ ~ N(18, 51). We can get the Z-score by
computing
Z
9  18
51
 1.26.
The corresponding 2-sided p-value is 2P(Z < -1.26) and can be computed from the
tables to be 2(1 - 0.8962) = 0.208.
Tied ranks. When applying Wilcoxon's matched-pairs rank-sum test we will end up
with tied ranks - when two or more differences have the same magnitude. There are
various ways to resolve this. One way is to assign each of these ranks the average of
the tied ranks. For example, if you get two equal magnitudes tied in 4th position, then
they can both be assigned a rank of 4.5. This is generally the simplest thing to do and
is what SPSS does by default.
9
5.4 The sign test
This is an even cruder non-parametric test that can be applied to paired data. It tests a
more general null hypothesis than the t-test or the Wilcoxon matched-pairs rank-sum
test. The null hypothesis is simply that the difference in measurement for any subject
is equally likely to be positive or negative. Under H0, the distribution of the number
of positive differences must follow a Binomial(n, 0.5) distribution where n is the
number of observations. For the blood pressure data we observe 2 out of 8 negative
differences. The p-value for a 2-sided test is P(X  2) + P(X  6) = 20.1445 = 0.289
from tables.
To analyse these data in SPSS, Analyze -> Nonparametric tests -> 2 related samples
Place 'Before' and 'After' into the right hand panel, check 'Sign' and click OK.
Frequencies
N
After - Before
Negative
Differences(a)
Positive
Differences(b)
Ties(c)
Total
6
2
0
8
a After < Before
b After > Before
c After = Before
Test Statistics(b)
After - Before
Exact Sig. (2-tailed)
.289(a)
Exact Sig. (1-tailed)
.145
Point Probability
.109
a Binomial distribution used.
b Sign Test
5.5. When to use which test?
Generally speaking, most statisticians would prefer to use the paired samples t-test of
5.1 when they believe that the differences in the paired measurements will be
normally distributed, with zero mean when the treatment has no effect. If sample
sizes are large (say bigger than 30, or so) then the t-test is also considered to be valid,
since the sample mean will be approximately normally distributed in such cases, even
if the distribution of the individual observations isn't.
If your sample size is small and the data do not appear to be normally distributed then
you should consider using a non-parametric method. In generally these will be less
powerful than the t-test i.e. less likely to detect differences when these are present.
10
However, there is no universal agreement among statisticians regarding which
analysis to do. You should be aware of the assumptions that underlie any particular
test and check that these are not obviously contradicted by the data. This often boils
down to looking at dot-plots and checking for any obvious deviations from normality.