Download Chapter 8: Testing the Difference of the Means of Two

Chapter 7: Testing Hypotheses about the Difference between the Means of Two Populations 1. The Standard Error of the Difference A lot of research questions involve trying to decide whether two population means differ from one another, and we have to make this decision based on the data from two samples. For example, what if you wanted to know how much your test score would suffer if you went out until 3 A.M. on Sunday night versus going to bed at 10 P.M. To see if partying until the wee hours truly hurts your GPA, we could take a random sample of students who typically get As on their stats exams, break them randomly into two groups, and then find the mean of their test scores the following day. Then, we can use NHT to determine whether there is a significant difference, based on how much these two sample means differ from each other. To decide whether your two sample means are significantly different, you need to find out the amount by which two sample means (the same sizes as yours) typically differ based on only random sampling. This amount is called the standard error of the difference (SED), and we will show you exactly how to calculate it from the means, SDs, and sizes of the samples in your study. As you will see, the SED is larger when the variation within your groups is larger, but it gets smaller when using larger groups. Large groups do not tend to stray much from the population mean, so when you are dealing with large groups, a fairly large difference between the means is very likely to be significant. 2. Pooling the Sample Variances Just like the standard error of the mean, the standard error of the difference usually must be estimated based on the values you have from your samples. There are two main ways to do this, but the one that is much more common involves taking a weighted average of your sample variances (dubbed the pooled variance). Here is the pooled variance formula: s 2p   N1  1 s12   N 2  1 s22 N1  N 2  2 Then insert this value into the next formula to find the standard error of the difference: Last, it’s time to plug everything into the t test formula. Because we are now pooling the variances, we can use the t distribution that has df = N1+ N2 – 2. The most basic two-sample t test formula is: t X 1   X 2  1   2  sX 1 X 2 But given that we most often test the null that the population means are equal to one another (i.e., 1   2 is equal to zero), we can just drop the 1   2 from the formula. Also, if you want to skip the step of finding the standard error of the difference separately, you can first find the pooled variance, and then plug it directly into this bad boy: t X 1  X2   1 1  s 2p     N1 N 2  (For computational convenience, we have factored out the pooled variance instead of dividing it by each sample size.) Once you’ve found your t value, just go by the degrees of freedom to look up a critical value in your t distribution table, and figure out if your calculated t is greater or less than the critical t. But remember, finding a significant t simply informs you that the population means seem to be different. Moreover, the difference of your sample means is just a point estimate of that population difference, so we will show you how to make it much more informative as the center of an interval estimate. Let’s try an example together: Everyone has been complaining that taking statistics at 9 A.M. on Friday morning is ruining their GPA. Professor Salvatore doesn’t believe that there is much of a difference between his two sections (the other meets at 1 P.M. on Wednesdays), and so he wants to determine if the mean scores of the midterm exam for the two sections differ significantly from one another at the .05, two-tailed level. Here are the statistics he has on hand: XWednesday = 88, sWednesday = 4.3, NWednesday = 18, X Friday = 84, sFriday = 5.5, NFriday = 14. Step one: Determine the pooled variance. spooled ² = (18 – 1)(4.3²) + (14 – 1)(5.5²)/(18 + 14 – 2) = (314.33 + 393.25)/30 = 23.586 Step two: Determine the standard error of the difference. = √ ((23.586/18) + (23.586/14)) = √ 2.995 = 1.731 Step three: Figure out the t value. t X 1   X 2  1   2  sX 1 X 2 t = (88 – 84) / 1.731 = 2.311 Step four: Make your statistical decision. Figure out the critical t value you need, with df = N1+ N2 – 2 = 30, by looking at the t value table. You will see that t.05 (30) = 2.042. Because our calculated t = 2.311 > tcrit (30) = 2.042, Professor Salvatore is going to have to accept the fact that there is a statistically significant difference between the average midterm grades of the two class sections. Looks like it’s time to rethink the schedule for next semester. (And really, whether it’s statistics, psychology, economics, etc.—all 9 A.M. classes on Fridays should be outlawed!) Now you try an example: 1. Alexis is running a psychological experiment to determine whether classical music helps people concentrate on a cognitive task. She divides her 31 participants into two groups (N1= 15 and N2 = 16) and finds that those who worked on the problem in silence have an average test score of 74 (M1) with s1 = 3.8, and those who listened to Mozart while working had an average test score of 79 (M2) with s2 = 4.1. Do the two groups differ significantly from one another? 3. Confidence Interval for the Difference of Two Population Means To gain more information for a two-sample study, we can revisit the confidence interval formula. The formula for the two-group case is very akin to the one we used in the prior chapter to determine the likely values of the mean of a single population, but we do need to rework the formula a bit when trying to estimate the difference between two population means. The CI formula for two samples looks like you’re staring at the one-sample formula and seeing double:   1  2  X 1  X 2  tcrit s X  X 1 2 Keep in mind that when zero is not included in the 95% confidence interval range, you know that you can reject the null at the .05 level with a two-tailed test. As usual, the 99% confidence interval is going to be somewhat larger than the 95% confidence interval. Remember, the larger the interval, the more confident we are that the true difference between the population means lies somewhere within that range. Try the following example, using your stellar expertise with confidence intervals from the last chapter: 2. Colin wants to determine whether there is a significant difference between the average ticket price for his band, Momma Lovin’, and his rival band, Smack Daddy. For the past eight shows for Momma Lovin’, the mean ticket price was $18.40, with s = 3.2. Smack Daddy’s sales figures showed an average ticket price of $19.60 for the past six shows, with s = 4.3. Since Smack Daddy think they are abundantly better based on the $1.20 difference, Colin would like to estimate the true difference in ticket price with a 95% confidence level. 4. Measuring the Size of an Effect When the scale to measure a certain variable isn’t familiar to you (e.g., a researcher arbitrarily made up a scale for his/her experiment), it helps immensely to standardize the measure to create an effect size that can be easily interpreted. And, just because a difference is deemed to be significant by NHT, it doesn’t necessarily mean that it has practical value. Looking at a standard effect-size measure can help you make that call. The most common measure of effect size for samples is the one that is often called d or Cohen’s d, but which we will call g just to confuse you (just kidding—your text calls it g to avoid confusion with its counterpart in the population). This effect size measure looks kinda like a z score: g = X1  X 2 sp Note that sp is just the square root of the pooled variance (√s²p). If you already have g, you have most of what goes into the two-sample t value; to calculate t from g you can use the following formula: t = g √ (nh/2) Unless your two samples are the same size, in which case you can substitute n (the size of either sample) for nh in this formula, you must calculate the harmonic mean to use the formula. The harmonic mean is used in a number of other statistical formulas, so it is worth learning. Here is a simplified version of the harmonic mean formula that works when you have only two numbers: nharmonic = 2N1 N2_ N1 + N2 For example, the harmonic mean of 10 and 15 is 12 [(2*10*15/10+15) = 300/25], whereas the arithmetic mean is 12.5. Since size matters quite a bit when it comes to effect size, it helps to have a general rule of thumb about how large is large. Jack Cohen devised the following guidelines for psychological research: .8 = large .5 = moderate .2 = small Try these examples: 3. You are asked to assess the general contentment (based on results from the Subjective Happiness Scale) of psychology majors in comparison to biology majors at Michigan State by determining the effect size of the difference (i.e., “g”). The ratings given by a random sample of majors for each group are as follows: Psychology 4.6 3.2 5.7 6.8 6.2 5.1 4.2 6.3 5.9 5.4 5.5 6.0 Biology 5.0 3.4 4.3 5.6 3.2 3.3 2.9 6.0 4.0 4.2 5.3 4.1 4. For the data in the previous example, compute the t value, by using the value you just calculated for g, and determine whether this t value is significant at the .05 level. Are you surprised by your statistical determination, given the size of the samples? 5. The Separate-Variance t Test If the sample sizes are not equal AND if one variance is more than twice the size of the other variance, you should not assume that you can pool the variances from the two groups. Instead, you should calculate what is sometimes called a separate-variances t test, for reasons that should be obvious from looking at the following formula: t x1  x2  s12 s 22  n1 n 2 Note that when the sample sizes are equal, the pooled and separate-variances tests will always produce exactly the same t value, and everyone just uses N1+ N2 – 2 degrees of freedom to look up the critical value. Unfortunately, when both the Ns and SDs differ a good deal, not only should the separate-variance formula be used to find the t value but also a complex formula should be used to find the df for looking up the critical t value. The good news is that a lot of statistical programs will do that work for you. However, when the variances of the two samples are very different, there are usually other problems with the data, so researchers usually use some more drastic alternative to the pooled-variance t test (e.g., a data transformation or a nonparametric test) than the separate-variances test. 6. The Matched-Pairs (aka Correlated or Dependent) t Test Sometimes, you’re lucky enough to deal with samples that match up with one another. And the bottom line is, the better the matching, the better chance you have of finding significance for your t test. So, whether the matching is based on using the same people twice or by matching two groups of individuals with one another on some relevant characteristics, if you can match samples, you probably should! Luckily for us, the matched t test formula is essentially the formula for a one-sample t test, which makes life abundantly easier, provided that you’ve learned the material from the previous chapter. The main change between the two formulas is that in the matched t test formula, you will use the mean of the difference scores in the numerator, and then use the standard deviation of the difference scores in the denominator of the formula. To make it even clearer, look at the two formulas side-by-side: One sample t-test Matched t-test Keep in mind that, when you use the matched t test, your degrees of freedom correspond with the number of matched pairs you are using, not the total number of scores. For example, if you are matching 16 participants into 8 pairs, your df = 8 – 1 = 7, not 14 (i.e., 16 – 2). The fact that your df decreases means that your critical t increases when you perform a matched t-test, so don’t take matching your participants lightly. There is a downside. If the matching doesn’t really work well, you’ve just tossed out a bunch of dfs for nothing! Let’s try an example together . . . Jared wants to see if pulse rates differ significantly for students one hour after taking the GRE as compared to one hour before taking it. He can see from the Before and After means that the pulse rates are quite a bit higher before the test (most of his friends have been freaking out in the morning whenever they take the exam), but when he performs a two-sample t test on his data, he fails to get even close to significance. Then he finally notices that the variance within his groups is huge (since everyone is at varying fitness levels in the samples he uses) greatly inflating his denominator. That’s when he realized that he was doing the wrong t test; he was not taking advantage of the reduction in error variance that occurs when you use the before-after difference scores for each student’s pulse rate. With the difference scores in place, his data set looks like this: Pretest 88 79 90 86 75 80 79 92 100 67 70 83 Posttest 80 75 85 82 70 76 72 86 91 65 67 76 Difference 8 4 5 4 5 4 7 6 9 2 3 7 X pre = 82.42 s = 9.434 X post =77.083 s = 7.971 X diff = 5.33 sD = 2.103 Using the data from the table, let’s try both t tests and see how much of a difference matching scores can make. First, let’s try the ordinary (i.e., independent-samples) t test, using the simplified formula for groups with the same sample size. t X 1  X2   s12  s22     n  t = (82.42 – 77.083)/[√(9.434² + 7.971²)/12] = 1.496; df = 12 + 12 – 2 = 22; t.05 (22) = 2.074; because 1.5 << 2.074, we are not even close to significance with this test. Now, let’s see what happens when we take the matching into account and test the difference scores. t = (5.33 – 0)/(2.103/√12) = 8.784 (note that we are using X diff or X D to mean the same as D with a bar over it). For this matched test, df = 11 (# of pairs – 1), so the critical t increases to t.05 (11) = 2.201. But, as usual, the change in the critical t (2.074 to 2.201) is small compared to the change in the calculated t (1.496 to 8.784). Finally, Jared can back up his claim that his friends are in a physiological frenzy before taking the GRE! 7. Matched and Repeated-Measures Designs So as you can see, by matching individuals who have correlated data or asking people to perform a task multiple times (repeated measures) with related outcomes, it can help out immensely when trying to attain significance with a t test. Keep in mind, however, that sometimes there is no relevant basis for matching participants and repeating conditions on the same person more than once can be seriously problematic (imagine trying to teach the same person a foreign language twice, in order to compare two different teaching methods!). And, when it is reasonable to test the same person twice (memorizing a list of words while listening to sad music and a similar list while listening to happy music), you will usually have to counterbalance the conditions to prevent practice or fatigue from affecting your results, and even counterbalancing can present problems, as well (e.g., carry-over effects). Just remember, sometimes in research, unlike taking a statistics course, practice isn’t always a good thing. Now you try a matched example: 5. Emily is measuring the effect of cognitive behavioral therapy (CBT) on patients with panic disorder. She uses the number of panic attacks that occurred during the week before each patient began CBT and during the week following the completion of eight sessions of CBT as her Before and After measures, respectively. The data are as follows: Before CBT 14 8 6 14 20 13 9 12 a. b. c. After CBT 13 4 1 8 10 12 8 6 Determine whether the difference between the two groups is significant when using a dependent t-test. Perform an independent two-sample t-test, and compare the results with those from the dependent t-test. Do you think there are any reasons not to use a dependent t-test design? Additional t test examples: Participants in a study were taught SPSS by either one-on-one tutoring or sitting in a 200-person lecture hall and were classified into one of two groups (undergraduates versus graduate students). Mean performance (along with SD and n) on the computer task is given for each of the four subgroups. Mean SD n Undergraduate Tutoring Lecture Hall 36.74 32.14 6.69 7.19 52 45 Graduate Tutoring Lecture Hall 29.63 26.04 8.51 7.29 20 30 6. Calculate the pooled-variance t test for each of the four comparisons that make sense (i.e., compare the two academic levels for each method and compare the two methods for each academic level), and test for significance. 7. Calculate g for each of the t tests in Exercise #6. Comment on the size of g in each case. 8. a. Find the 95% CI for the difference of the two methods for the undergraduate participants. b. Find the 99% CI for the difference of the two methods for the graduate participants. Answers to Exercises 1. Yes, there is a significant difference at both the .05 and .01 levels, with t = 5/√2.0235 = 5/1.4225 = 3.515 > t.05 (29) = 2.045 and t.01 (29) = 2.756. 2. spooled ² = 13.678, tcv .05(12) = 2.179, SED = 1.9973, the 95% CI (–5.552 ≤ μ ≤ 3.1518). Because 0 is contained in the 95% CI we must retain the null (at the .05 level, two-tailed) that there is no difference between the ticket prices of the two bands. So Colin can keep living the dream and tell Smack Daddy they still have a viable rival out there! 3. g = 1.125 (rather large effect), which is based on: X 1 = 5.408, s1 = 1.0059, X 2 = 4.275, s2 = 1.0083, spooled = 1.0071 4. t = 1.125 * √(12/2) = 2.756 > t .05 (23) = 2.069. Therefore, the two majors differ significantly at the .05 level. Even with the smallish sample sizes, this result is not surprising, given that the effect size was so large. 5. a. t = 3.7612; df = 7; tcv .05 (7) = 2.365; X diff = 4.25; s = 3.196; 3.7612 > 2.365, so these results can be declared statistically significant. b. t = 2.022; df = 14; tcv .05 (14) = 2.145;—does not attain significance c. One possible reason to not use the dependent t test is that your critical value gets larger (from 2,145 to 2,365 in this example). However, the increase in the calculated t (from 2.022 to 3.7612 in this example) usually more than compensates for the increase in critical t, so unless the matching is very poor, it pays to do the dependent t test. 6. Undergrad: Tutor versus Lecture n1  52; n2  45; s1  6.69; s 2  7.19; s12  44.76; s 22  51.70 s 2p  t (n1  1) s12  (n2  1) s 22 (52  1) 44.76  ( 45  1) 51.70 4557.56    47.97 52  45  2 95 n1  n2  2 X1  X 2 1 1 s 2p     n1 n2   36.74  32.14 1  1 47.97     52 45   4.6 1.988  4.6  3.26 1.41 df = 95,  = .05, two-tailed tcrit= 1.99 < 3.26; therefore, the difference is significant. Graduate: Tutor versus Lecture n1  20; n2  30; s1  8.51; s 2  7.29; s12  72.42; s 22  53.14 t 29.63  26.04 1  1 60.77     20 30   3.59 5.0642  3.59 = 1.60 2.25 df = 48,  = .05, two-tailed tcrit= 2.01 > 1.60; therefore, the difference is not significant. Tutor: Undergrad vereue Graduate n1  52; n2  20; s1  6.69; s 2  8.51; s12  44.76; s 22  72.42 36.74  29.63 t 1   1 52.27     52 20   7.11 7.11 = 3.74 1.9023  3.6187 df = 70,  = .05, two-tailed tcrit= 2.0 < 3.74; therefore, the difference is significant. Lecture: Undergrad versus Graduate n1  45; n2  30; s1  7.19; s 2  7.29; s12  51.70; s 22  53.14 32.14  26.04 t 1   1 52.27     45 30  t 6.10  2.9038 6.10 = 3.58 1.704 df = 73,  = .05, two-tailed tcrit= 2.0 < 3.58; therefore, the difference is significant. 7. g X1  X 2 sp 4.6  0.66 ; between moderate and large 47.97 6.926 3.59 3.59   0.46 ; moderate Graduate: Tutor versus Lecture: g  60.77 7.796 7.11 7.11   0.98 ; quite large Tutor: Undergrad versus Graduate: g  52.27 7.229 6.10 6.10 Lecture: Undergrad versus Graduate: g    0.84 ; large 52.57 7.229 Undergrad: Tutor versus Lecture: g  4.6  8. a. 1   2  X1  X 2  t.05 s X  X = 4.6 ± 1.99 (1.41); so, 1 - 2 = 4.6 ± 2.8 1 1 Therefore, the 95% CI goes from +1.8 to +7.4. b. 1   2  X1  X 2  t.01 s X X = 3.59 ± 2.68 (2.25); so, 1 - 2 = 3.59 ± 6.03 1 1 Therefore, the 99% CI goes from –2.44 to +9.62.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 8: Testing the Difference of the Means of Two