Download Chapter 8: Testing the Difference of the Means of Two

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Psychometrics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 7: Testing Hypotheses about the Difference between the
Means of Two Populations
1. The Standard Error of the Difference
A lot of research questions involve trying to decide whether two population means differ from
one another, and we have to make this decision based on the data from two samples. For
example, what if you wanted to know how much your test score would suffer if you went out
until 3 A.M. on Sunday night versus going to bed at 10 P.M. To see if partying until the wee hours
truly hurts your GPA, we could take a random sample of students who typically get As on their
stats exams, break them randomly into two groups, and then find the mean of their test scores the
following day. Then, we can use NHT to determine whether there is a significant difference,
based on how much these two sample means differ from each other.
To decide whether your two sample means are significantly different, you need to find out the
amount by which two sample means (the same sizes as yours) typically differ based on only
random sampling. This amount is called the standard error of the difference (SED), and we
will show you exactly how to calculate it from the means, SDs, and sizes of the samples in your
study. As you will see, the SED is larger when the variation within your groups is larger, but it
gets smaller when using larger groups. Large groups do not tend to stray much from the
population mean, so when you are dealing with large groups, a fairly large difference between
the means is very likely to be significant.
2. Pooling the Sample Variances
Just like the standard error of the mean, the standard error of the difference usually must be
estimated based on the values you have from your samples. There are two main ways to do this,
but the one that is much more common involves taking a weighted average of your sample
variances (dubbed the pooled variance).
Here is the pooled variance formula:
s 2p 
 N1  1 s12   N 2  1 s22
N1  N 2  2
Then insert this value into the next formula to find the standard error of the difference:
Last, it’s time to plug everything into the t test formula. Because we are now pooling the
variances, we can use the t distribution that has df = N1+ N2 – 2. The most basic two-sample t
test formula is:
t
X
1

 X 2  1   2 
sX 1 X 2
But given that we most often test the null that the population means are equal to one another (i.e.,
1   2 is equal to zero), we can just drop the 1   2 from the formula. Also, if you want to skip
the step of finding the standard error of the difference separately, you can first find the pooled
variance, and then plug it directly into this bad boy:
t
X
1
 X2

 1
1 
s 2p 


 N1 N 2 
(For computational convenience, we have factored out the pooled variance instead of dividing it
by each sample size.) Once you’ve found your t value, just go by the degrees of freedom to look
up a critical value in your t distribution table, and figure out if your calculated t is greater or less
than the critical t. But remember, finding a significant t simply informs you that the population
means seem to be different. Moreover, the difference of your sample means is just a point
estimate of that population difference, so we will show you how to make it much more
informative as the center of an interval estimate.
Let’s try an example together:
Everyone has been complaining that taking statistics at 9 A.M. on Friday morning is ruining their
GPA. Professor Salvatore doesn’t believe that there is much of a difference between his two
sections (the other meets at 1 P.M. on Wednesdays), and so he wants to determine if the mean
scores of the midterm exam for the two sections differ significantly from one another at the .05,
two-tailed level. Here are the statistics he has on hand: XWednesday = 88, sWednesday = 4.3, NWednesday
= 18, X Friday = 84, sFriday = 5.5, NFriday = 14.
Step one: Determine the pooled variance.
spooled ² = (18 – 1)(4.3²) + (14 – 1)(5.5²)/(18 + 14 – 2) = (314.33 + 393.25)/30 = 23.586
Step two: Determine the standard error of the difference.
= √ ((23.586/18) + (23.586/14)) = √ 2.995 = 1.731
Step three: Figure out the t value.
t
X
1

 X 2  1   2 
sX 1 X 2
t = (88 – 84) / 1.731 = 2.311
Step four: Make your statistical decision.
Figure out the critical t value you need, with df = N1+ N2 – 2 = 30, by looking at the t value table.
You will see that t.05 (30) = 2.042.
Because our calculated t = 2.311 > tcrit (30) = 2.042, Professor Salvatore is going to have to
accept the fact that there is a statistically significant difference between the average midterm
grades of the two class sections. Looks like it’s time to rethink the schedule for next semester.
(And really, whether it’s statistics, psychology, economics, etc.—all 9 A.M. classes on Fridays
should be outlawed!)
Now you try an example:
1. Alexis is running a psychological experiment to determine whether classical music helps people
concentrate on a cognitive task. She divides her 31 participants into two groups (N1= 15 and N2 =
16) and finds that those who worked on the problem in silence have an average test score of
74 (M1) with s1 = 3.8, and those who listened to Mozart while working had an average test
score of 79 (M2) with s2 = 4.1. Do the two groups differ significantly from one another?
3. Confidence Interval for the Difference of Two Population Means
To gain more information for a two-sample study, we can revisit the confidence interval formula.
The formula for the two-group case is very akin to the one we used in the prior chapter to
determine the likely values of the mean of a single population, but we do need to rework the
formula a bit when trying to estimate the difference between two population means. The CI
formula for two samples looks like you’re staring at the one-sample formula and seeing double:


1  2  X 1  X 2  tcrit s X  X
1
2
Keep in mind that when zero is not included in the 95% confidence interval range, you know
that you can reject the null at the .05 level with a two-tailed test.
As usual, the 99% confidence interval is going to be somewhat larger than the 95% confidence
interval. Remember, the larger the interval, the more confident we are that the true difference
between the population means lies somewhere within that range.
Try the following example, using your stellar expertise with confidence intervals from the
last chapter:
2.
Colin wants to determine whether there is a significant difference between the average ticket
price for his band, Momma Lovin’, and his rival band, Smack Daddy. For the past eight shows for
Momma Lovin’, the mean ticket price was $18.40, with s = 3.2. Smack Daddy’s sales figures showed an
average ticket price of $19.60 for the past six shows, with s = 4.3. Since Smack Daddy think they are
abundantly better based on the $1.20 difference, Colin would like to estimate the true difference in ticket
price with a 95% confidence level.
4. Measuring the Size of an Effect
When the scale to measure a certain variable isn’t familiar to you (e.g., a researcher arbitrarily
made up a scale for his/her experiment), it helps immensely to standardize the measure to create
an effect size that can be easily interpreted. And, just because a difference is deemed to be
significant by NHT, it doesn’t necessarily mean that it has practical value. Looking at a standard
effect-size measure can help you make that call. The most common measure of effect size for
samples is the one that is often called d or Cohen’s d, but which we will call g just to confuse
you (just kidding—your text calls it g to avoid confusion with its counterpart in the population).
This effect size measure looks kinda like a z score:
g = X1  X 2
sp
Note that sp is just the square root of the pooled variance (√s²p). If you already have g, you have
most of what goes into the two-sample t value; to calculate t from g you can use the following
formula:
t = g √ (nh/2)
Unless your two samples are the same size, in which case you can substitute n (the size of either
sample) for nh in this formula, you must calculate the harmonic mean to use the formula. The
harmonic mean is used in a number of other statistical formulas, so it is worth learning. Here is a
simplified version of the harmonic mean formula that works when you have only two numbers:
nharmonic = 2N1 N2_
N1 + N2
For example, the harmonic mean of 10 and 15 is 12 [(2*10*15/10+15) = 300/25], whereas the
arithmetic mean is 12.5.
Since size matters quite a bit when it comes to effect size, it helps to have a general rule of
thumb about how large is large. Jack Cohen devised the following guidelines for psychological
research:
.8 = large
.5 = moderate
.2 = small
Try these examples:
3.
You are asked to assess the general contentment (based on results from the Subjective Happiness
Scale) of psychology majors in comparison to biology majors at Michigan State by determining the effect
size of the difference (i.e., “g”). The ratings given by a random sample of majors for each group are as
follows:
Psychology
4.6
3.2
5.7
6.8
6.2
5.1
4.2
6.3
5.9
5.4
5.5
6.0
Biology
5.0
3.4
4.3
5.6
3.2
3.3
2.9
6.0
4.0
4.2
5.3
4.1
4.
For the data in the previous example, compute the t value, by using the value you just
calculated for g, and determine whether this t value is significant at the .05 level. Are you
surprised by your statistical determination, given the size of the samples?
5. The Separate-Variance t Test
If the sample sizes are not equal AND if one variance is more than twice the size of the other
variance, you should not assume that you can pool the variances from the two groups. Instead,
you should calculate what is sometimes called a separate-variances t test, for reasons that should
be obvious from looking at the following formula:
t
x1  x2 
s12 s 22

n1 n 2
Note that when the sample sizes are equal, the pooled and separate-variances tests will always
produce exactly the same t value, and everyone just uses N1+ N2 – 2 degrees of freedom to look
up the critical value. Unfortunately, when both the Ns and SDs differ a good deal, not only
should the separate-variance formula be used to find the t value but also a complex formula
should be used to find the df for looking up the critical t value. The good news is that a lot of
statistical programs will do that work for you. However, when the variances of the two samples
are very different, there are usually other problems with the data, so researchers usually use some
more drastic alternative to the pooled-variance t test (e.g., a data transformation or a
nonparametric test) than the separate-variances test.
6.
The Matched-Pairs (aka Correlated or Dependent) t Test
Sometimes, you’re lucky enough to deal with samples that match up with one another. And the
bottom line is, the better the matching, the better chance you have of finding significance for
your t test. So, whether the matching is based on using the same people twice or by matching two
groups of individuals with one another on some relevant characteristics, if you can match
samples, you probably should!
Luckily for us, the matched t test formula is essentially the formula for a one-sample t test, which
makes life abundantly easier, provided that you’ve learned the material from the previous
chapter. The main change between the two formulas is that in the matched t test formula, you
will use the mean of the difference scores in the numerator, and then use the standard
deviation of the difference scores in the denominator of the formula.
To make it even clearer, look at the two formulas side-by-side:
One sample t-test
Matched t-test
Keep in mind that, when you use the matched t test, your degrees of freedom correspond with the
number of matched pairs you are using, not the total number of scores. For example, if you are
matching 16 participants into 8 pairs, your df = 8 – 1 = 7, not 14 (i.e., 16 – 2). The fact that your
df decreases means that your critical t increases when you perform a matched t-test, so don’t take
matching your participants lightly. There is a downside. If the matching doesn’t really work well,
you’ve just tossed out a bunch of dfs for nothing!
Let’s try an example together . . .
Jared wants to see if pulse rates differ significantly for students one hour after taking the GRE as
compared to one hour before taking it. He can see from the Before and After means that the
pulse rates are quite a bit higher before the test (most of his friends have been freaking out in the
morning whenever they take the exam), but when he performs a two-sample t test on his data, he
fails to get even close to significance. Then he finally notices that the variance within his groups
is huge (since everyone is at varying fitness levels in the samples he uses) greatly inflating his
denominator. That’s when he realized that he was doing the wrong t test; he was not taking
advantage of the reduction in error variance that occurs when you use the before-after difference
scores for each student’s pulse rate. With the difference scores in place, his data set looks like
this:
Pretest
88
79
90
86
75
80
79
92
100
67
70
83
Posttest
80
75
85
82
70
76
72
86
91
65
67
76
Difference
8
4
5
4
5
4
7
6
9
2
3
7
X pre = 82.42
s = 9.434
X post =77.083
s = 7.971
X diff = 5.33
sD = 2.103
Using the data from the table, let’s try both t tests and see how much of a difference matching
scores can make. First, let’s try the ordinary (i.e., independent-samples) t test, using the
simplified formula for groups with the same sample size.
t
X
1
 X2

 s12  s22 


 n 
t = (82.42 – 77.083)/[√(9.434² + 7.971²)/12] = 1.496; df = 12 + 12 – 2 = 22; t.05 (22) = 2.074;
because 1.5 << 2.074, we are not even close to significance with this test.
Now, let’s see what happens when we take the matching into account and test the difference
scores.
t = (5.33 – 0)/(2.103/√12) = 8.784 (note that we are using X diff or X D to mean the same as D
with a bar over it). For this matched test, df = 11 (# of pairs – 1), so the critical t increases to t.05
(11) = 2.201. But, as usual, the change in the critical t (2.074 to 2.201) is small compared to the
change in the calculated t (1.496 to 8.784). Finally, Jared can back up his claim that his friends
are in a physiological frenzy before taking the GRE!
7.
Matched and Repeated-Measures Designs
So as you can see, by matching individuals who have correlated data or asking people to perform
a task multiple times (repeated measures) with related outcomes, it can help out immensely when
trying to attain significance with a t test. Keep in mind, however, that sometimes there is no
relevant basis for matching participants and repeating conditions on the same person more than
once can be seriously problematic (imagine trying to teach the same person a foreign language
twice, in order to compare two different teaching methods!). And, when it is reasonable to test
the same person twice (memorizing a list of words while listening to sad music and a similar list
while listening to happy music), you will usually have to counterbalance the conditions to
prevent practice or fatigue from affecting your results, and even counterbalancing can present
problems, as well (e.g., carry-over effects). Just remember, sometimes in research, unlike taking
a statistics course, practice isn’t always a good thing.
Now you try a matched example:
5. Emily is measuring the effect of cognitive behavioral therapy (CBT) on patients with panic
disorder. She uses the number of panic attacks that occurred during the week before each
patient began CBT and during the week following the completion of eight sessions of CBT
as her Before and After measures, respectively. The data are as follows:
Before CBT
14
8
6
14
20
13
9
12
a.
b.
c.
After CBT
13
4
1
8
10
12
8
6
Determine whether the difference between the two groups is significant when using a
dependent t-test.
Perform an independent two-sample t-test, and compare the results with those from the
dependent t-test.
Do you think there are any reasons not to use a dependent t-test design?
Additional t test examples:
Participants in a study were taught SPSS by either one-on-one tutoring or sitting in a 200-person
lecture hall and were classified into one of two groups (undergraduates versus graduate students).
Mean performance (along with SD and n) on the computer task is given for each of the four
subgroups.
Mean
SD
n
Undergraduate
Tutoring Lecture Hall
36.74
32.14
6.69
7.19
52
45
Graduate
Tutoring Lecture Hall
29.63
26.04
8.51
7.29
20
30
6. Calculate the pooled-variance t test for each of the four comparisons that make sense (i.e.,
compare the two academic levels for each method and compare the two methods for each
academic level), and test for significance.
7. Calculate g for each of the t tests in Exercise #6. Comment on the size of g in each case.
8. a. Find the 95% CI for the difference of the two methods for the undergraduate participants.
b. Find the 99% CI for the difference of the two methods for the graduate participants.
Answers to Exercises
1. Yes, there is a significant difference at both the .05 and .01 levels, with t = 5/√2.0235 =
5/1.4225 = 3.515 > t.05 (29) = 2.045 and t.01 (29) = 2.756.
2. spooled ² = 13.678, tcv .05(12) = 2.179, SED = 1.9973, the 95% CI (–5.552 ≤ μ ≤ 3.1518).
Because 0 is contained in the 95% CI we must retain the null (at the .05 level, two-tailed) that
there is no difference between the ticket prices of the two bands. So Colin can keep living the
dream and tell Smack Daddy they still have a viable rival out there!
3. g = 1.125 (rather large effect), which is based on: X 1 = 5.408, s1 = 1.0059, X 2 = 4.275, s2 =
1.0083, spooled = 1.0071
4. t = 1.125 * √(12/2) = 2.756 > t .05 (23) = 2.069. Therefore, the two majors differ significantly
at the .05 level. Even with the smallish sample sizes, this result is not surprising, given that the
effect size was so large.
5. a. t = 3.7612; df = 7; tcv .05 (7) = 2.365; X diff = 4.25; s = 3.196; 3.7612 > 2.365, so these results
can be declared statistically significant.
b. t = 2.022; df = 14; tcv .05 (14) = 2.145;—does not attain significance
c. One possible reason to not use the dependent t test is that your critical value gets larger (from
2,145 to 2,365 in this example). However, the increase in the calculated t (from 2.022 to 3.7612
in this example) usually more than compensates for the increase in critical t, so unless the
matching is very poor, it pays to do the dependent t test.
6.
Undergrad: Tutor versus Lecture
n1  52; n2  45; s1  6.69; s 2  7.19; s12  44.76; s 22  51.70
s 2p 
t
(n1  1) s12  (n2  1) s 22 (52  1) 44.76  ( 45  1) 51.70 4557.56


 47.97
52  45  2
95
n1  n2  2
X1  X 2
1
1
s 2p   
 n1 n2 

36.74  32.14
1
 1
47.97   
 52 45 

4.6
1.988

4.6
 3.26
1.41
df = 95,  = .05, two-tailed tcrit= 1.99 < 3.26; therefore, the difference is significant.
Graduate: Tutor versus Lecture
n1  20; n2  30; s1  8.51; s 2  7.29; s12  72.42; s 22  53.14
t
29.63  26.04
1
 1
60.77   
 20 30 

3.59
5.0642

3.59
= 1.60
2.25
df = 48,  = .05, two-tailed tcrit= 2.01 > 1.60; therefore, the difference is not significant.
Tutor: Undergrad vereue Graduate
n1  52; n2  20; s1  6.69; s 2  8.51; s12  44.76; s 22  72.42
36.74  29.63
t
1 
 1
52.27   
 52 20 

7.11
7.11
= 3.74
1.9023

3.6187
df = 70,  = .05, two-tailed tcrit= 2.0 < 3.74; therefore, the difference is significant.
Lecture: Undergrad versus Graduate
n1  45; n2  30; s1  7.19; s 2  7.29; s12  51.70; s 22  53.14
32.14  26.04
t
1 
 1
52.27   
 45 30 
t
6.10

2.9038
6.10
= 3.58
1.704
df = 73,  = .05, two-tailed tcrit= 2.0 < 3.58; therefore, the difference is significant.
7.
g
X1  X 2
sp
4.6
 0.66 ; between moderate and large
47.97 6.926
3.59
3.59

 0.46 ; moderate
Graduate: Tutor versus Lecture: g 
60.77 7.796
7.11
7.11

 0.98 ; quite large
Tutor: Undergrad versus Graduate: g 
52.27 7.229
6.10
6.10
Lecture: Undergrad versus Graduate: g 

 0.84 ; large
52.57 7.229
Undergrad: Tutor versus Lecture: g 
4.6

8. a.
1   2  X1  X 2  t.05 s X  X = 4.6 ± 1.99 (1.41); so, 1 - 2 = 4.6 ± 2.8
1
1
Therefore, the 95% CI goes from +1.8 to +7.4.
b.
1   2  X1  X 2  t.01 s X X = 3.59 ± 2.68 (2.25); so, 1 - 2 = 3.59 ± 6.03
1
1
Therefore, the 99% CI goes from –2.44 to +9.62.