Download CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR

CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR QUANTITATIVE VARIABLES TOPICS RANDOM SAMPLES INDEPENDENT AND DEPENDENT VARIABLES PAIRED T-TESTS (Before/After comparisons) INDEPENDENT SAMPLE T-TESTS (Separate group comparisons) A. Large sample or equal population variances B. Unequal variances TESTS TO COMPARE VARIANCES NORMALITY ASSUMPTION Section 1. Introduction Many important experimental results are based on statistical analyses no more difficult than those we will review in this chapter. They are among the most useful test statistics ever devised, simply because the experimental designs which they match are easy, powerful, and popular. All statistical tests require that the sample studied be a random sample from the population of interest. This is an extremely stringent requirement. It means that every person or item in the population had an equal chance of making it into the sample. These are the ONLY conditions under which sampling variability can be calculated. It ensures that sampling variability is the sole source of error in your results. If random samples are not selected randomly, biases can easily contaminate the results. At the very least, if random samples are not possible, randomization has to be used to create comparison groups. We should note that most hypothesis tests are loosely stated as questions of the form "Does variable X affect variable Y?" For instance, we might ask if Gender affects a person's opinion on Abortion, or whether Age affects a person's Blood Pressure. In questions where one variable can be thought of as "affecting" the other variable, the "cause" is referred to as the independent variable. The outcome is referred to as the dependent variable. Hence, in the preceding examples, Gender and Age are independent variables possibly affecting the dependent variables Opinion and Blood Pressure. The tools discussed in this chapter are only applicable when the dependent variable is quantitative. Basically, that is because all these methods focus on the effect of the independent STA 5126 -- 1 of Chp 2 ©D. Mohr variable on the mean and standard deviation of the dependent variable. These parameters only make sense if the variable is quantitative. There are further mathematical requirements, which we will list at the end of the chapter. Section 2. Paired t-test The most common experimental designs are in the form of a comparison. Often, the comparison is on values collected on the same experimental subjects. For instance, we may have reading proficiency scores for children before and after they undergo a six-week training program. We may have strength scores on right and left arms of the same person. We may have yields from tomato plants of type A and B, when one of each were planted in the same pot. In each of these examples, the key feature is that there is a "matching" mechanism which pairs an observation of one type unambiguously with an observation of the other type. The statistical technique we discuss will preserve the information due to the pairing by using one of the observations as a "baseline" against which the other is measured. The method is simple. Consider the before and after scores for reading proficiency. If we are interested in whether the program (independent variable) affected the reading proficiency (dependent variable), we are really interested in whether there was typically a change in the scores from before to after. We will calculate the individual changes for each child and use the one-sample t-test (Chapter 1) to test the null hypothesis that the mean change is zero (no effect exists.) Recipe for Paired t-test Data structure: For n individuals, we have measurement 1 (X1) and measurement 2 (X2) which we wish to compare. X1 and X2 must be quantitative variables. Form a new column D=X1 -X2 which contains the differences in the two measurements for each individual. Perform one sample t-test on D 1) Ho: D=0 (typical values do not differ for measurements 1 and 2) H1: D0 (typical values do differ for measurements 1 and 2) 2) Since the sample of D which we have observed has n observations, the tstatistics will have n-1 d.f. STA 5126 -- 2 of Chp 2 ©D. Mohr t  xD  0 sD / n  xD sD / n The subscript "D" is to remind you that these statistics are calculated from the column of D=Differences. Form your critical region using the table of the t-distribution with n-1 df where n is the number of pairs. 3) Calculate the value of t for your sample. If you use a statistical computer package, it may give you the p-value for this test automatically. 4) Write the appropriate conclusion. Example of a paired comparison. Notice the presence of a natural pairing mechanism between observations with the different "treatments". What are the advantages of such a mechanism? The data below are from Darwin's study of cross- and self-fertilization. Pairs of seedlings of the same age, one produced by cross-fertilization and the other by self-fertilization, were grown together so that the members of each pair were reared under nearly identical conditions. The data are the final heights of each plant after a fixed period of time, in inches. Darwin consulted the famous 19th century statistician Francis Galton about the analysis of these data. The summary information was produced by the statistical package SAS for Windows. PAIR 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Summary on variable DIFF CROSS 23.5 12.0 21.0 22.0 19.1 21.5 22.1 20.4 18.3 21.6 23.3 21.0 22.1 23.0 12.0 SELF 17.4 20.4 20.0 20.0 18.4 18.6 18.6 15.3 16.5 18.0 16.3 18.0 12.8 15.5 18.0 DIFF = cross-self 6.1 -8.4 1.0 2.0 0.7 2.9 3.5 5.1 1.8 3.6 7.0 3.0 9.3 7.5 -6.0 N Mean Std Dev Minimum Maximum ---------------------------------------------------------15 2.6066667 4.7128194 -8.4000000 9.3000000 STA 5126 -- 3 of Chp 2 ©D. Mohr The null hypothesis is that the mean difference in the population is 0, implying that mean heights of cross and self-fertilized plants would not differ. In symbols, Ho: D=0 vs H1: D 0 There are 15 observations in the data set, so 14 d.f. If we use  = 5%, then the critical region would be "Reject Ho if t < -2.145 or t > 2.145" In this sample, t=2.142. Hence, there is no significant evidence, at =5%, that cross and self-fertilized seedlings differ in mean length. Section 3. Two-sample t-test (also called the independent samples t-test) Frequently, we have two separate groups on which we wish to make comparisons. We may be interested in comparing mean salaries for male and female entry-level employees, or length of hospital stays for HMO and PPC plan insurees. In the first case, our independent variable is gender while the dependent variable is salary. Salary is a quantitative variable for which we summarize typical values using the mean. Unlike the paired t-test, where the values in each group are naturally matched, here we assume the two groups are completely independent. Diagram 1 gives a schematic of the statistical situation. We have two populations summarized by the means in each (1 and 2) and the standard deviations (1 and 2). Our hypotheses are Ho: 1 = 2 (1 - 2 = 0) "group" has no effect on mean Ha: 1  2 (1 - 2  0) "group" has an effect on mean Since we cannot observe 1 and 2, we must use our sample data to reach conclusions. Looking at the hypotheses, our natural move is to compare the two sample means to each other, or equivalently, their difference to 0. STA 5126 -- 4 of Chp 2 ©D. Mohr Population 1 Parameters 1, 1 Population 2 Parameters 2, 2 Ho: 1 = 2 Sample 1 Stats: n1, mean x1, s1 Sample 2 Stats: n2, mean x2, s2 DIAGRAM 1. Comparing two populations If the population variances are known, probability theory shows that the appropriate statistic would be Z x1  x 2 12 / n1   22 / n 2 The two sample t-test has two versions, which differ in how they "doctor up" the denominator of this statistic since the population variance is hardly ever known. The versions differ depending on whether the two population variances can be assumed equal or unequal. In section 4 we cover a method for checking this assumption. Section 3A. Large samples or unequal variances When the variances (or standard deviations) in the two groups appear very dissimilar, the best method may be the unequal variance version. This method does not require the assumption of equal variances. The disadvantage of this method is that the degrees of freedom are sometimes small, and they are always difficult to calculate (this is referred to as Satterthwaite‘s approximation). While the test statistic itself is easy to calculate, the degrees of freedom are best computed by a statistical package. Without the computer, it helps to know that the d.f. are always between ns  1 where ns is the size of the smallest sample, and n1  n2  2 , so if you get the same conclusion using both those d.f., you are safe. If both samples are large (at least 50), it STA 5126 -- 5 of Chp 2 ©D. Mohr is probably safe to use infinite () d.f. The value of the test statistic is computed by: t x1  x 2 s12 / n1  s 22 / n 2 Section 3B. Equal variance t-test When the variances in the two samples appear similar, it is advantageous to "pool" the two estimates into an estimate of the alleged single underlying variance. This allows us to pool the degrees of freedom in the two groups as well, giving more sensitive critical regions. s 2p  t (n1  1)s12  (n 2  1)s 22 n1  n 2  2 sp x1  x 2 1 / n1  1 / n 2  pooled var iance (note s p , not s 2p ) d.f. = n1 + n2 - 2 Section 4. Comparing two standard deviations Some authors now argue that we should always use the unequal variance version of the test. Traditionally, however, the equal variance version was preferred both because of the potentially greater degrees of freedom and because its relation to more advanced topics (like the one-way ANOVA) is well understood. In this tradition, before we decide which version of the two-sample t-test to use, we need a tool for deciding whether the variances in two groups are equal or different. This amounts to a hypothesis test for the hypotheses 2 2 Ho: 1 = 2 (or equivalently,  1 =  2) vs 2 2 Ha: 1  2 (or equivalently,  1   2) Note the restatement of the hypotheses in terms of the variances. There are many tests available for testing these hypotheses. The most commonly cited are Fisher's test (F-test classic!) and STA 5126 -- 6 of Chp 2 ©D. Mohr Levene's test. Section 4a. Fisher's test The test statistic used to compare the variances if the F-statistic. F is for Sir R. A. Fisher, who pioneered many classic statistical techniques. F  s12 / s 22 or F'  s 2max / s 2min F' differs from F only in that it always places the larger of the two sample variances in the numerator. If the null hypothesis is true, we expect F (or F') to be near 1. If F is either very much larger or very much smaller than 1 (F' very much larger than 1), we would believe Ha is true. As always, the question is where to draw the line (critical region). The table of the F-distribution is provided in most statistics texts. Most tables only give the cutpoint which marks off the lower 1-A of area from the upper A of area in the righthand tail. It can be quite confusing to understand how to use this to get the critical values for all the varieties of test. Generic shape of an F distribution: a) Area A Area 1-A F is explicitly a two-tailed test. So we need cutpoints which mark off the lower /2 area in the lefthand tail, and /2 in the righthand tail. Most tables only give the righthand cutpoint. To get the lefthand cutpoint, you use Lefthand cutpoint for lower /2 with M,N df = 1. / (Righthand cutpoint for upper /2 with N,M df) Example: Suppose we are using  = 5%, and sample 1 has n=10 while sample 2 has n=6 (9 and 5 df, respectively). We should put 2.5% in each tail. From the table, we see that the cutpoint for the upper tail is 6.68. To get the lower cutpoint, we need to reverse the order of the d.f. (now 5 and 9), then take the reciprocal of the upper cutpoint. That is, the lower cutpoint is 1/4.48=.223. b) F' is also two-tailed, but it finesses the problem of getting the lower cutpoint by arranging to always put the largest variance on top. Hence, if we had sample 1 with n=10 and sample 2 with n=6, our critical region would be: If s1 is the largest, reject if F' > 6.68 (9 and 5 df, with /2 in the upper tail); if s2 is the largest, reject if F' > 4.48 (5 and 9 df, with /2 in the upper STA 5126 -- 7 of Chp 2 ©D. Mohr tail). Example of an F test. Notice that sometimes hypothesis tests about the variances (or standard deviations) are of interest in there own right. Drill press operators in a manufacturing plant must drill holes of specified diameter in sheets of metal. One goal is that all holes should have the same diameter (small variability in the individual diameters). Actual diameters are measured for 20 holes drilled by inexperienced operators, and 10 holes drilled by experienced operators. The data is summarized below. Is there significant evidence, at  = 5%, that the population variances differ for experienced and inexperienced operators? Inexperienced n = 20 s = .52 mm Experienced n = 10 s = .21 mm 1) Ho: 1 = E (variability is the same for experienced and inexperienced operators) Ha: 1  E (variability is not the same) 2) We will reject Ho if F' > 3.69, using F table for upper area of .025, 19 df in numerator and 9 in denominator. 2 2 3) F' = .52 / .21 = 6.13 4) There is significant evidence that the variability in the diameters differs for experienced and inexperienced operators. Inexperienced operators have larger variability (less consistency) in the diameters of the holes. Section 4b. Levene's Test (used by SPSS) Levene's test actually tests the null hypothesis that the mean values of the magnitude of the distances from individual observations to the mean are the same. Instead of defining dispersion in terms of 'squared distances' as variances do, it uses absolute values of distances. The actual algorithm is as follows: 1) Within each group, compute the difference between the individual observations and the group mean. 2) Take the absolute value of these differences. 3) Do a independent sample t-test (equal variance version) of the null hypothesis that the means of the absolute differences are equal. 4) Square the t-value from the t-test. (Under Ho, the square of a t should have an F distribution with 1 df in the numerator and n1 + n2 -2 in the denominator.) Compare it to the cutpoint which places  (usually 5%) area in the upper tail of the distribution. You are only interested in large values of F, because only large values would indicate that the variances are different. (Note the difference between this and the cutpoints for Fisher's test, which place /2 in each tail.) STA 5126 -- 8 of Chp 2 ©D. Mohr Large values of F indicate that one of the means must be different from the other (Ha true). Bear in mind at this point that we are no longer talking about the means of the raw data, but of the distance of the raw values around their group means. In the example above, a large value for F would indicate that the typical (mean) distance of individual diameters from the group mean was larger in one group than in another, indicating more variability in one group. Levene's Test and Fisher's Test do not give exactly the same result. Except in borderline cases, however, they usually give comparable values. There is some intuitive evidence that Levene's Test is less sensitive to departures from the normality assumption, and I think that is why it is the default in SPSS. Example of Levene's Test The following data shows test scores for five freshman and five juniors on an assessment test for critical thinking. Does variability differ in the two groups, using  = 5%? Freshman: 28 32 21 36 33 ( sample mean = 30.0) Juniors: 34 49 43 32 27 ( sample mean = 37.0) Ho:  1   2 , vs Ha:  1   2 . Reject Ho if F > 5.32 (using table with 1 and 8 d.f., and 5% in the upper tail.) Absolute values of differences from mean within each group: Freshman: 2 2 9 6 3 (sample mean = 4.4, s=3.05) Juniors: 3 12 6 5 10 (sample mean = 7.2, s=3.70) Sp = 3.39, df = 8, t = 1.31, F = 1.71. Since 1.71 is less than the cutpoint of 5.32, there is no significant evidence that the variances are different in the two groups. When we compute the t-test to compare the means of the test scores, we can use the equal variance version. 2 2 2 2 Section 5. A Full Example!! Recipe for a two sample t-test: Data structure: Two separate groups are measured for a quantitative variable Y. Performing the two sample t-test 1) State the null and alternative hypotheses in terms of the means in the two groups. 2) Decide which version of the t-test to use by using the F-test or Levene's test to examine the variances within the two samples. 3) Calculate the number of degrees of freedom for the appropriate version of the test. Use your  and a table of the t-distribution to set the critical region. 4) Calculate the appropriate version of t. 5) Write your conclusion. Most computer programs will automatically calculate both versions of t as well as F, along with their pvalues, saving you a lot of effort. Example for two-sample test (two independent samples) STA 5126 -- 9 of Chp 2 ©D. Mohr Notice that the two groups of patients are completely separate, with no natural pairing. In small to moderate samples, the particular version of the two-sample t-test depends on whether the variances within the two groups seem similar. SAS for Windows computes an F-test to help you decide which version is appropriate. The data summarized below show cholesterol values for the 39 heaviest men in the Western Collaborative Group Study. (This study was carried out in California in 1960-1961 and involved 3,154 middle-aged men. The purpose was to study behaviour patterns and risk of coronary heart disease.) All the cholesterols summarized below are for men weighing more than 225 pounds. Cholesterols are given in mg per 100 ml. Each man was rated as generally having Behaviour Type A (urgency, aggression, ambition) or Behaviour Type B (relaxed, non-competitive, less hurried.) In heavy, middled-aged men, is cholesterol level related to behaviour type? 1) The null hypothesis is that behavior type has no effect on mean cholesterol, while the alternative hypothesis is that it does have an effect on mean cholesterol. In symbols: Ho: A = B vs Ha: A  B 2) Since the hypotheses concern the means in two separate groups, we will use the two sample t-test. To decide which version, we notice that the program has printed the value of F', along with the p-value (labeled Prob>F'). Recall that this statistic tests the null hypothesis that the two population variances are equal. Since the p-value of .2927 is greater than any reasonable  (.1 to .01) so it is reasonable to assume that the variances are equal and use that version of the t-test. 3) For the pooled (equal) variance version, the d.f.=37. With a significance level of 5%, we would reject Ho if t < -2.021 or t > 2.021. Alternatively, we reject if the p-value is less than .05. 4) For the equal variance version, t=2.5191, df=37 and the p-value is .0162. You should use the table of sample means and standard deviations to check these results. 5) If we are using a significance level of .05, we would reject the null hypothesis. Hence, we can say there is significant evidence that behaviour type is associated with differences in mean cholesterol. COMPUTER PRINTOUT - TTEST PROCEDURE Variable: CHOL TYPE N Mean Std Dev Std Error --------------------------------------------------------------------A 19 245.36842105 37.61384279 8.62920735 B 20 210.30000000 48.33991486 10.80913356 Variances T DF Prob>|T| --------------------------------------Unequal 2.5355 35.7 0.0158 Equal 2.5191 37.0 0.0162 <----- note how SAS labels the p-values For H0: Variances are equal, F' = 1.65 DF = (19,18) Prob>F' = 0.2927 <------note how SAS labels the p-values STA 5126 -- 10 of Chp 2 ©D. Mohr Boxplot for CHOL by TYPE | 400 + | | | 0 300 + | | +-----+ | | *--+--* +-----+ 200 + | *--+--* | | +-----+ | | 100 + ------------+-----------+----------TYPE A B CASE STUDY Jerrold et al (2009) compared typically developing children to young adults who had Downs Syndrome, with respect to a number of psychological measures thought to be related to the ability to learn new words. Data on two of the measures is summarized in Table 5.6. Recall Score is a measure of verbal short-term memory. Raven’s CPM is a task in which the participant must correctly identify an image which completes a central pattern. The authors used the pooled t test to compare the typical scores in the two groups. For Raven’s CPM, t  .485, p value  .629 . For Recall Score, t  7.007, p value  .0001 . Hence, the two groups did not differ significantly with respect to mean Raven’s CPM, but the Down’s Syndrome group scored significantly differently (apparently lower) on Recall Score. Based on this and a number of other comparisons, the authors conclude that verbal short-term memory is a primary factor in the ability to learn new words. The authors choice of the pooled t test rather than the unequal-variance t Test appears reasonable here. For Raven’s CPM, F  0.700, p value  .379 . For Recall Score, F  0.691, p value  .361 . Neither variable showed a significant difference in the variances within the groups. The other distributional assumption underlying t tests is that the data comes from normal distributions. Journal publications rarely have space in which to present graphical evidence with which the reader can check this assumption. However, the discussion will often include a sentence addressing this issue, and remark on any transformations (e.g. logarithms) used to make the variable more nearly normal. The authors actually presented the results of the pooled t test (with 80 degrees of freedom) as an F test with 1 degrees of freedom in the numerator and 80 in the denominator. The relation between these two test statistics will be explained in Chapter 4. Summary statistics from Jerrold (2009). Raven’s CPM Recall Score Down Syndrome young adults n = 21 Mean S.D. 19.33 4.04 12.00 3.05 Typically developing children n = 61 Mean S.D. 19.90 4.83 18.25 3.67 (Source: Jerrold, C., Thorn, A. S. C, and Stephens, E. (2009). The relationship among verbal short-term memory, phonological awareness, and new word learning: evidence from typical development and Down syndrome. J. Experimental Child Psychology, 102(2) 196-218.) STA 5126 -- 11 of Chp 2 ©D. Mohr Section 6. Nasty mathematical assumptions We already know of two fundamental assumptions underlying the tests in this chapter, and that of the t-test in Chapter 1. 1) The sample must be random 2) The dependent variable must be quantitative In addition, the derivations of the t and F-distributions have a nasty mathematical assumption: that the distribution of the variable in the population must follow a normal distribution. In plain language, if you could draw a histogram of the values for all the observations in the entire population, you should see the famous "bell curve". So we have a third assumption: 3) The distribution for the individual values is normal. It is not very likely that we will ever know for sure whether assumption 3 is met. What can we do to check and how important is it anyway? There are several graphical techniques we can use to check for normality. So far, we have seen dotplots and boxplots, though we have not discussed them. (See your elementary text.) In chapter 4 we will meet a tool called a normal probability plot which gives a more sensitive check. What are we really looking for? An immediate cause of trouble in a small or moderate data set would be when one or two values are very far away from the rest. The self/cross-fertilization data used as an example of the paired ttest may be a case where the data contains two "outliers". Outliers should be rare in normally distributed data. Outliers can cause the p-values and critical regions to be only approximate. The most common problem is to make the p-value larger than what it should be. If there are no outliers, and the data show a nearly symmetric pattern with the points clustering in the middle of the range, then it is unlikely that non-normality is a serious problem. If you do seriously suspect nonnormality in your data, consult a statistician on a variety of "nonparametric" statistical tests which do not require the normality assumption. There is one frequent case in social science data where normality is very questionable. If you have data collected on an ordinal scale (e.g. 0 = strongly disagree to 4 = strongly agree), it is unlikely to be normally distributed. Recall that the normal distribution is for continuous or nearly continuous random variables, and data on a five point scale is quite discrete. This is STA 5126 -- 12 of Chp 2 ©D. Mohr especially true if the values cluster at one end or the other end of the scale (e.g. almost all agree or strongly agree). In this case, one of the techniques of Chapter 3 might be appropriate. Furthermore, it is questionable as to whether one can legitimately average values on this kind of scale — does a (―disagree‖+‖strongly agree‖)/2 = ―agree‖? Nevertheless, treating this ordinal data AS IF it were numerical on a 0-4 scale, and conducting averaging operations, is a sloppy but common practice in the social sciences. Averages over several questions frequently produce values which appear reasonably normally distributed. Finally, when comparing two population means, the choice of the version of the test depends on whether variances can be assumed equal. Since this assumption, referred to as ―homogeneity of variance‖, underlies much of the Analysis of Variance, we list it as a fourth assumption: 4) variances in the two groups are equal. STA 5126 -- 13 of Chp 2 ©D. Mohr EXERCISES FOR CHAPTER 2 *Exercise 1. Data below show blood pressures for 5 subjects. The first value was taken while the subject was resting. The second was taken while the subject was resting, but asked to work a mental arithmetic problem. Does math affect mean blood pressure? Use  = 5%. Subject number Resting BP During Math BP 1 115 125 2 125 125 3 110 130 4 120 115 5 110 125 *Exercise 2. Occupancy rates (average annual percentage of beds filled) are compared for randomly selected urban and suburban hospitals in a state. a. Is there evidence of difference in variability between the two groups? Compute both Levene's Test and Fisher's Test to answer this. Use  = 5%. b. Is there evidence of a difference in the mean occupancy rates? Use the results of A to help you decide on an appropriate version of the t-test. Use  = 5%. Urban: 76.5 79.6 77.5 79.4 79.3 78.1 Suburban: 71.5 73.4 71.2 67.8 63.0 76.5 Exercise 3. Eight students volunteer to participate in a test of the effect of caffeine on the speed with which they can respond to a flashing light. Each student takes the test on a morning when they have had no caffeine, then again a week later on a morning after having had the equivalent of two cups of coffee. The data is given below, in hundredths of seconds to respond to the light. Subject 1 2 3 4 5 6 7 8 Without caffeine 12 18 22 9 14 24 21 16 With caffeine 10 14 20 8 14 21 19 14 Does caffeine have an effect? Use  = 5%. Exercise 4. Do HMO‘s really reduce costs of care? 40 adults aged 55-60 enrolled in HMO‘s are questioned on their health care within the last 2 years. They report an average of days hospitalized STA 5126 -- 14 of Chp 2 ©D. Mohr during that period of 1.19 days with a standard deviation of 1.4 days. A similar sample of 40 adults with ordinary healthcare insurance reports an average of 1.35 days with a standard deviation of 1.7 days. a) Is there evidence of a difference in the variability within the groups? (You don‘t have enough information to do Levene‘s test here, you must use Fisher‘s.) b) Is there evidence of a difference in the means for the groups? Use =5% for each test. Exercise 5. We are comparing math FCAT scores for rural and urban high schools. We have a random sample of 20 urban high schools and 20 rural high schools. Their school-aggregated math FCAT scores for 10th graders are summarized below. Location n sample mean sample standard deviation Rural 20 1925 252 Urban 20 1982 212 a. Use the F‘ test to say whether it is reasonable to assume that the two populations have the same variance. Why can you not prove that the variances are equal? b. Do the means differ significantly in the two groups? Use  = 1%. Exercise 6. From each of 4 different litters of mice, a researcher chooses two female mice (for a total of 8 mice). Within each pair of sisters, one is chosen to be fed a standard diet, and the other is fed a high-protein diet. Their weight in grams, at the end of 6 weeks, is shown below. Diet Pair 1 Pair 2 Pair 3 Pair 4 Standard 19.4 18.2 18.5 19.8 High-protein 17.6 19.4 17.2 19.2 Do the different diets seem to affect mean weight? Use  = 5%. Exercise 7. (A development from Exercise 1.) The researcher wishes to know whether girls and boys differ in their reaction to arithmetic. 5 girls are are recruited, and their blood pressures are tested resting, and again resting but doing mental arithmetic. 5 boys are tested under the same circumstances. The data is given below. Is there significant evidence, at  = 5%, that boys and girls differ in the mean change in BP experienced while doing arithmetic? Note: this experimental design uses ideas from both paired and two-sample experiments! Girls Boys Resting During Math Resting During Math 120 130 120 125 110 115 110 125 115 115 105 115 110 120 110 120 110 115 120 115 STA 5126 -- 15 of Chp 2 ©D. Mohr Exercise 8. Pedersen (2007, Perceptual and Motor Skills, 104(1), pp 201-211) interviewed a sample of students enrolled in psychology courses in a large private university in the western U.S. regarding their attitudes towards sports. Each student was asked to self-rate his or her degree of sport participation, on a scale of 1 to 5. The 112 men in the sample had M = 4.3 and SD = 1.7. The 173 women had M = 3.6 and S.D. = 1.7. (M is a common abbreviation for the sample mean, and SD a common abbreviation for the standard deviation.) Is there significant evidence, at  = 1%, that men and women at this university differ in their mean self-rankings of sport participation? Exercise 9. Martinussen et al. (2007, J. Criminal Justice 35, 239-249) compared ‗burnout‘ among a sample of Norwegian police officers to a comparison group of air traffic controllers, journalists and building constructors. Burnout was measured on three scales: exhaustion, cynicism, and efficacy. The data is summarized in the table below. The authors state The overall level of burnout was not high among police compared to other occupational groups sampled from Norway. In fact, police scored significantly lower on exhaustion and cynicism than the comparison group, and the difference between groups was largest for exhaustion. Substantiate the authors‘ claim regarding Exhaustion. That is, check that it does show a significant difference between the two groups.. Summary Statistics for Exercise 9 Police, n = 222 Mean Exhaustion 1.38 Cynicism 1.50 Efficacy 4.72 std dev 1.14 1.33 0.97 Comparison group, n = 473 mean std dev 2.20 1.46 1.75 1.34 4.69 0.89 SOLUTIONS TO STARRED PROBLEMS Exercise 1. Notice the existence of a pairing mechanism between items. Each experimental unit (a subject) has two blood pressures--a resting and a ‗during math‘ blood pressure. This should be done via a paired t-test. a) D = mean difference in during math – resting blood pressure in the population Ho: D = 0 versus Ha: D  0. b) The 5 differences in the sample are: 10 0 20 –5 15. There will be 4 degrees of freedom. We will reject Ho if t < -2.776 or t> 2.776 c) d  8 and sD  10.368, t  80  1.725 10.368 / 5 d) Do not reject Ho. There is no significant evidence that math affects mean blood pressure, at  = 5%. Further Note. A computer package would not tell you the cutpoints for t. Instead, it would report that the p-value for this data was .1595. Since .1595 > .05 (your ), you would not reject Ho. Exercise 2. Urban group had mean = 78.4, s.d. 1.2458 STA 5126 -- 16 of Chp 2 ©D. Mohr Suburban group had mean = 70.5667 and s.d. = 4.6779 a) Difference in variability: Ho:  s2   U2 versus Ha:  s2   U2 Fisher‘s test. Reject if F‘ > F-table value with 6-1=5 and 6-1=5 df and 2.5% in tail F‘ = 4.67792 / 1.24582 = 14.1. Cutpoint in table is 7.15. Since F‘ > 7.15, we reject Ho. There is significant evidence of variability, at  = 5%. (Note, tail value is half the desired  for F‘ version.) Levene‘s test. Absolute values of difference of individual scores from group mean— Urban 0.9 1.2 0.9 1.0 0.9 Suburban .93 2.83 .63 2.77 7.57 5.93 2 Running an independent samples t-test (equal variance version) on this data gives t=2.099. F=2.099 = 4.406 with 1 and 10 df. Since the cutpoint in the F-table is 4.96 for =5% (don‘t split the !) we would not reject Ho: that is, we have no significant evidence of a difference in variability. b) Fisher‘s test and Levene‘s test differ on the advisability of using the equal variance /unequal variance version. Fortunately, in this case, the answers don‘t differ. Both versions of t come out to 3.96, which would be significant whether you use df=10 (equal variance version) or df=5 (smallest df possible under unequal variance version). STA 5126 -- 17 of Chp 2 ©D. Mohr SPSS NOTES IF YOU WANT TO GET STARTED ON YOUR OWN! Step 1. Deciding how to set up your data. When you double-click on the SPSS icon, the first thing you see is a spreadsheet-like grid for entering your data. This is called the Data Editor. Before you charge in and start typing, you have to think about how the data is structured. The format in which you enter the data must follow that structure. The basic rule-of-thumb is that entries on the same row, or line, are from the SAME subject, or experimental unit. Things on different lines are from different subjects. Things in different columns are different measurements. Let‘s see how that plays out in the Starred Exercises. By double-clicking on the heading of each column, you can change the name to something sensible, and also indicate whether your data is nominal (‗string‘) or numerical. Exercise 1. There are 5 different subjects, so our data entry will have five rows. There should be one column for subject number (‗SUBJ‘), one for the resting BP (‗REST‘) and one for the during-math BP (‗MATH‘). In otherwords, the data entry will look very like the table given in the problem. Exercise 2. There are 14 different hospitals. Each will have its own row in the data entry. In addition to a column for occupancy rate (‗O_RATE‘), I will need a column which tells me whether it is an Urban or Suburban hospital. I will call this column ‗LOCATION‘. Many of the ANOVA and T-test routines in SPSS want group variables to be coded AS IF they were numeric. I am going to code Urban=0, Suburban=1. Keep notes of the codes you define. LOCATION O_RATE 0 76.5 0 79.6 ...... 1 76.5 Step 2. Request the appropriate t-test Exercise 1. In SPSS, click on the ANALYZE option at the top. From the drop-down menu, request COMPARE MEANS. Choose the type of T-test you need, in this case, the PAIRED SAMPLES T-TEST. You will see a ‗Dialog Box‘ like the one below. You need to click on the column names with the two variables you are trying to compare (REST and MATH), and move them into the big box on the right using the key that looks like an arrow >. Then hit the OK button. You will see printout like that on the next page. STA 5126 -- 18 of Chp 2 ©D. Mohr Paired Samples Statistics Pair 1 Mean 116.0000 124.0000 REST MATH N 5 5 Std. Dev iat ion 6.51920 5.47723 Std. Error Mean 2.91548 2.44949 Paired Samples Correlations N Pair 1 REST & MATH 5 Correlation -.490 Sig. .402 Paired Samples Test Paired Diff erences Pair 1 REST - MATH Mean Std. Dev iation -8.0000 10.36822 Std. Error Mean 4.63681 95% Confidence Interv al of the Dif f erence Lower Upper -20.8738 4.8738 t -1.725 df 4 Sig. (2-tailed) .160 The first panel gives you some of the summary statistics within each group. The last panel reports the results of the t-test. The p-value is labeled Sig., which is short for ‗Observed Significance Level‘. Since the .16 is greater than your , you do not have significant evidence of a math effect. Exercise 2. From the ANALYZE / COMPARE MEANS menu, choose INDEPENDENT SAMPLES T-TEST. You need to click on O-RATE and use the > key to move it into the Test Variable(s) box. You need to click on location and move it into the Grouping Variable box. Then hit OK. STA 5126 -- 19 of Chp 2 ©D. Mohr Group Statisti cs O_RATE LOCATION 0 1 N Mean 78.4000 70.5667 6 6 St d. Error Mean .50859 1.90974 St d. Dev iation 1.24579 4.67789 Independent Samples Test Levene's Test f or Equality of Variances F O_RATE Equal variances assumed Equal variances not assumed 4.406 Sig. t-test for Equality of Means t .062 df Sig. (2-tailed) Mean Dif f erence Std. Error Dif f erence 95% Confidence Interv al of the Dif f erence Lower Upper 3.964 10 .003 7.8333 1.97630 3.42985 12.23681 3.964 5.706 .008 7.8333 1.97630 2.93638 12.73029 Note that SPSS automatically gives you Levene‘s test to help you choose the version of the t-test. The p-value is once again labeled Sig. in SPSS.. STA 5126 -- 20 of Chp 2 ©D. Mohr BOXPLOTS Simple boxplots use a box to mark off the middle 50% of the data. The box extends from the first quartile to the third quartile, with a thick mark at the median. The purpose of the box is to draw your eye to the central 'typical' half of the data. The lowest 25% of the data is marked off by a whisker that extends from the minimum value to the first quartile. The highest 25% of the data is marked off by a whisker that extends from the third quartile (75th percentile) to the maximum. Modified boxplots alter the whiskers to draw your attention to outliers, or wild values in the data set. To define outliers, the computer calculates a value called the hinge width, which is 1.5 x (75th percentile - 25th percentile ) = 1.5 x length of 'box'. Any value lying more than one hinge width ABOVE the 75th percentile is an outlier on the high side. Any value lying more than one hinge width BELOW the 25th percentile is an outlier on the low side. Modified boxplots draw the whiskers from the quartile to the most extreme value that is not an outlier. Any outliers are marked off with a separate symbol. Boxplots give you a quick view as to whether typical values (denoted by the boxes) are changing. They also help you see whether the spread (variance) are relatively stable. They can also help you diagnose non-normality, by helping you spot asymmetries or outliers. Outliers are wild, unusual values. Normally distributed data should have very few, if any, outliers. A very large data set might reasonably have a few (1%?) outliers without causing harm, but very severe or frequent outliers can cause statistical trouble. Moreover - outliers are of interest in there own right -- what causes these people to be so different from the rest? Example of Boxplots The typical values are higher in Group 1 than in Group 2. The spreads are similar, except that Group 2 has an outlier with an unusually large value. 14 25 12 10 8 6 4 2 0 X -2 N= 15 15 1.0000 2.0000 GROU P STA 5126 -- 21 of Chp 2 ©D. Mohr

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR