Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Selected Nonparametric Statistics Categorical Data Analysis Packet CD05 Dale Berger, Claremont Graduate University ([email protected]) Statistics website: http://wise.cgu.edu 2 5 8 10 18 22 28 33 35 39 40 41 42 43 Counting Rules Binomial Distribution D11: Wilcoxon Ws and Mann-Whitney U D12: Comparing two groups with SPSS (t, Wilcoxon Ws, Mann-Whitney U, Median) D13: Wilcoxon T for paired data D14a: SPSS CROSSTABS Statistics for 2x2 Contingency Tables D14b: SPSS CROSSTABS analyses for larger contingency tables D15: McNemar’s test of related proportions D16: Spearman r and SPSS Table D (Binomial) and Table P (Spearman r) from Siegel (1956) Nonparametric Statistics Critical values for Spearman r from Ramsey (1989) Journal of Educational Statistics Mann-Whitney U Table from Kirk (1978) Introductory Statistics Wilcoxon T Table from Kirk (1978) Introductory Statistics Table F (Runs tests – too few or too many) from Siegel (1956) Nonparametric Statistics CD05 Nonparametric Statistics 1 Berger, CGU Counting Rules Rule 1: If any one of k mutually exclusive and exhaustive events can occur on each of n trials, then there are kn different sequences that may result from a set of trials. Example: Toss a coin 4 times. How many possible outcomes are there? One possible outcome may be represented by HTTH. The total number of possible outcomes can be illustrated by a branching diagram, as shown below. There are two possibilities for the first coin. For each of these possibilities there are 2 possibilities for the second coin, giving a total of 4 distinct two-coin sequences (HH, HT, TH, and TT). For each of these two-coin sequences, there are 2 possible outcomes for the third coin, giving 4x2 or 8 possible three coin outcomes. Similarly, for each of the 8 three-coin sequences, there are 2 possible outcomes for the fourth coin, giving 8x2 = 16 distinct four-coin sequences. If we apply Rule 1 with k=2 (i.e., heads or tails) and n=4 (four coin tosses) we obtain kn = 24 = 2 x 2 x 2 x 2 = 16. Suppose each outcome is equally likely to occur. Then the probability of any particular sequence is 1/kn. What is the probability of four heads on four coin tosses? (1/kn = 1/24 = 1/16 = .0625.) Example: How many distinct ways are there to answer a 10-item multiple choice test with four alternatives on each item? (Answer: k=4 and n=10, kn = 410 = 1,048,576.) What is the probability that someone who is purely guessing will score all 10 correct on this test? (Only one sequence is totally correct, so the probability is 1/1048576 = .00000095367.) CD05 Nonparametric Statistics 2 Rule 2: If any we have n trials where the number of different events which can occur on trials 1, 2, 3, …, n are k1, k2, k3 ….kn respectively, then the number of distinct outcomes from the n events is (k1)(k2)(k3)…(kn). Example: Suppose we have a task with a sequence of three choice points. At the first point we have two choices, at the second point we have three choices, and at the third point we have four choices. How many different ways might we complete the task? [Answer: (2)(3)(4) = 24.] What is the probability of any one specific sequence if all choices are equally likely at each step? [Answer: 1/24 = .0417.] Note that Rule 1 is a special case of Rule 2, where k1 = k2 = k3 = …. = kn = k. Rule 3: The number of different ways that n distinct objects may be arranged in order is (n)(n-1)(n-2)…(3)(2)(1). This product is called n-factorial, symbolized by n!. 0! is defined to be equal to 1. Any particular arrangement of n objects is called a permutation. Thus, the total number of permutations for n objects is n!. Example: You are the judge for a pie baking contest, and it is your task to rank the three finalists: Apple, Banana, and Cream. How many distinct orders are possible? Applying the formula, n! = 3! = 3x2x1 = 6. The six possible orders are ABC, ACB, BAC, BCA, CAB, and CBA. There are three possible ways to fill the first place. After first place is assigned, there are two pies left, or two ways to fill the second place. Thus, for each of the three possible ways to fill first place, there are two ways to fill second place, giving 3x2 ways to fill the first two places. For each choice of first and second place, there is only one pie left for third place; the total number of ways to rank the three pies is 3x2x1 = 3! = 6. Rule 4: The number of ways of selecting and arranging r objects from N distinct objects is NPr = N! / (N-r)! = “Permutations of N objects r at a time.” Example: Given a set of 5 different cards, how many ways can you and I each choose one card? The first card chosen can be any one of the 5 cards. After I have chosen my card, there are 4 cards left for you to choose from, so there is a total of 5x4 = 20 ways in which we can choose two cards. We have selected and arranged r=2 objects from a set of N=5 objects, so we can calculate NPr = 5! / (5-2)! = 5! / 3! = 5x4x3x2x1 / 3x2x1 = 5x4 = 20. Example: How many ways can we select 3 students from a class of 10 to fill the offices of President, Vice-President, and Secretary? The first office can be filled by any one of 10 people. After this office is filled, we must choose from among the remaining 9 people to fill the second office. Finally, the third office can be filled by any one of 8 people. This gives a total of 10x9x8 = 720 ways to fill the three offices. Notice that the descending product is the first part of 10!, with the last part (7!) missing. The expression 10x9x8 can be written as 10! / 7!, which is 10P3. CD05 Nonparametric Statistics 3 Rule 5: The number of ways of selecting a sample of r objects from a set of N distinct objects is NCr = N! / [r!(N-r)!] = “Combinations of N objects r at a time.” Example: How many ways can you select two cards from a deck of 5 distinct cards, with no concern for order? Let us look again at the situation described in Rule 4 where you and I each selected a card from a deck of 5 distinct cards. There are 5!/3! = 20 ordered pairs. If the Ace and King were drawn, we counted AK and KA as two separate outcomes because we were concerned with order. Thus, each pair of cards was counted twice. If we want only the number of possible pairs with no concern for order, we must divide the number of ordered pairs by 2!, the number of different orders for the two cards. This gives us 20/2 = 10 distinct pairs. If we apply Rule 5 we get 5C2 = 5! / [(2!)(3!)] = 120 / [(2)(6)] = 10. Example: Given a group of 10 people, how many ways can we choose 3 to form a committee with no regard for order of selection? We have already found the number of ordered groups of 3 people. By applying Rule 4 we found 10!/7! = 720 ordered groups of 3. The number of unordered committees must be less than this number because for every unordered group of 3 people, there are 3! = 6 orders. In general, because a group of r objects can be ordered in r! ways, there are r! times as many ordered groupings as there are different groups not considering order. If we divide the number of ordered groups given to us by Rule 4 by r!, we have Rule 5, the number of groups of size r not considering order, N!/[(N-r)!r!]. Note that the number of groups of r objects selected from N objects is equal to the number of groups of N-r objects. This is because for each group of size r, the remaining objects form a group of size N-r. Thus, 100C3 = 100C97 = 100! / [97! x 3!]. Rule 6: The number of distinct orders of N objects consisting of k groups of N1, N2, …, Nk indistinguishable objects is N! / [N1! N2! … Nk!]. Example: How many ways can the letters A,B,B,C,C,C be arranged if we can’t distinguish among like letters? If all of the 6 letters were distinct, there would be 6! distinct arrangements. But for any specific arrangement of the letters, such as ACBCBC, the Bs can be arranged in 2! ways and the Cs can be arranged in 3! ways. Each of the now indistinguishable orders is counted when we use 6!, so we must divide by both 2! and 3!, giving 6!/[2!3!] = 720/[2x6] = 60. Example: How many distinct ways can you rearrange the letters in MISSISSIPPI? N=11; NM=1; NI=4; NS=4; NP=2 11!/[1!4!4!2!] = 34,650 Example: How many distinct ways could you have 4 items correct on a 10-item test? That is, how many distinct arrangements are there of 4 Cs and 6 Ws? 10! 10x9x8x7x6x5x4x3x2x1 3628800 210 6!4! (6x5x4x3x2x1)x(4x3x2x1) 720x24 CD05 Nonparametric Statistics 4 Binomial Distribution Dale Berger, CGU When we have a dichotomous event (two mutually exclusive possibilities) such as male or female, success or failure, Republican or Democrat, open or closed mind, etc. and we have multiple independent observations, we may be able to make use of the Binomial Distribution (bi = two; nominal = names). Consider a multiple-choice examination where each item has 4 choices, only one of which is correct. If all choices are equally likely, then we would expect a student who knows absolutely nothing about the subject to be correct on about one item in four simply by guessing. In general, we let p be the probability of a success and 1-p = q be the probability of a failure. If the test has four choices on each item, is in Russian, and a student taking the test knows no Russian, we would expect p=1/4 and q=3/4. (1) For a test with a single item (n=1), there are two possible outcomes: He is correct (C) with probability p (here p = 1/4) He is wrong (W) with probability q (here q = 3/4) (2) If each item is independent of all others, then for a test with n=2 items, we have four possible outcomes: Items 1 2 C C with probability p x p C W " p x q W C " q x p W W " q x q = = = = p2 pq pq q2 = = = = 1/4 1/4 1/4 3/4 x x x x 1/4 3/4 3/4 3/4 = = = = 1/16 3/16 3/16 9/16 Because these four outcomes are mutually exclusive and exhaustive, the sum of their probabilities is equal to 1.000. p2 + 2pq + q2 = (p + q)2 = 1 because (p + q) = 1. Also, 1/16 + 3/16 + 3/16 + 9/16 = 16/16 = 1.00 Suppose we are concerned with the number of items a student might have correct on this twoitem test. Let's call this number X. Then X is a random variable which may take on the values 0, 1, or 2 and there are probabilities associated with each of these values. We could construct a sampling distribution for X as follows: Number of Successes x 0 1 2 P(X=x) q2 2pq p2 This example 9/16 = .5625 6/16 = .3750 1/16 = .0625 CD05 Nonparametric Statistics 5 Question: What is the probability that a person who is guessing on this two-item test will have exactly one item correct? Answer: p(X=1) = 2pq = 6/16 = .3750 in this example. Note that this probability comes from summing the probabilities associated with the outcomes CW and WC, the two ways of having exactly one success. Let's look at the 8 possible outcomes from a three-item test (n=3). Items 1 2 3 Sequence probability p3 p2 q p2 q p2 q C C C W C C W C C W C C p p p q x x x x p p q p x x x x p q p p = = = = C W W W W C W W W W C W p q q q x x x x q p q q x x x x q q p q = p q2 = p q2 = p q2 = q3 Number correct Number of ways Prob. This example 3 1 p3 .0156 2 3 3p2q .1406 1 3 3pq2 .4219 0 1 q3 .4219 The probabilities for these mutually exclusive events can be added to show that they sum to 1: p3 + 3p2q + 3pq2 + q3 = (p + q)3 = 1 because (p + q) = 1. The first term in the above sum gives the probability of observing three successes. The second term has two parts; the p2q is the probability of observing any particular sequence with two successes and one failure, and the coefficient of 3 represents the number of different sequences consisting of two successes and one failure. This coefficient could also be obtained by calculating the number of ways in which we could choose x=2 items to be correct out of the n=3 items. This can be expressed as the number of combinations of 2 chosen from a group of 3, or 3C2 = 3!/2!1! = 6/(2x1) = 3. Similarly, the coefficient for the third term in the sum represents the number of ways 1 item from a group of 3 can be correct, or 3!/1!2! = 3. To continue the pattern, the coefficient for the last term is 1 which is the number of ways of getting 0 correct on a 3 item test, or 3!/0!3! = 6/(1x6) = 1. Thus, each term in the sum gives the probability of getting x successes and can be expressed in the general form 3Cx pxq3-x. We could thus write our sum: 3C0 p0q3 + 3C1 p1q2 + 3C2 p2q1 + 3C3 p3q0 = q3 + 3pq2 + 3p2q + p3. In general, the exact probability associated with exactly x successes out of n trials in any situation where trials are independent and p remains the same on each trial can be calculated with the following expression: p(x=X) = nCx pxqn-x nCx [** This is the very useful Binomial Formula] is n!/[x!(n-x)!] “combinations of n objects taken x at a time” (Counting Rule 5) CD05 Nonparametric Statistics 6 Example: Find the probability that a person guesses correctly on every item in a five-item test where there are four choices on each item. With n=5, x=5, and p=1/4, we find that nCx pxqn-x = 5C5 (1/4)5(3/4)0 = (1/4)5 = 1/1024 = .0010 Example: Find the probability that our person is correct exactly four times and wrong only once. Applying the formula with n=5, x=4, and p=1/4: nCx pxqn-x = 5C4 (1/4)4(3/4)1 = 5 (1/4)4(3/4)1 = 15/1024 = .0146 Note that for any one particular sequence of 4 correct and 1 wrong, the probability is p4q1, and there are 5 ways to get such a sequence: (CCCCW, CCCWC, CCWCC, CWCCC, WCCCC). If we let X represent the number correct on the five-item test, then X is a random variable. We have just calculated the probabilities associated with two of the possible values of X, p(X=5) = .0010 and p(X=4) = .0146. To complete the sampling distribution for X when n=5, we must calculate the probabilities that X = 3, 2, 1, or 0. Number of Successes x Probability of Sequence Number of Ways Binomial Dist. p(X=x) Probability if p=1/4, q=3/4 5 p5 1 p5 .0010 4 p4q1 5 5p4q .0146 3 p3q2 10 10p3q2 .0879 2 p2q3 10 10p2q3 .2637 1 p1q4 5 5p q4 .3955 0 q5 1 q5 .2373 1.0000 This table can be used to answer questions regarding the likelihood of getting any particular number of items right by chance. For example, what is the probability that a person will get exactly three items correct on this test if she is guessing? Answer: p(X=3) = .0879. This table can also be helpful in making inferences about performance on the test. For example, suppose a job applicant says she can read Russian. You give her this 5-item Russian test, and she gets all five correct. Do you think she can read Russian? (How likely is it that she would get such a good score if she were guessing?) [Answer: The probability that someone would get all five items correct by chance is only .0010. It seems likely that she can read Russian.] CD05 Nonparametric Statistics 7 Dale Berger, CGU Wilcoxon Ws and Mann-Whitney U D11 Sometimes we wish to compare performance of two groups but our data do not satisfy the normality assumptions of the parametric t-test, and we have a small sample size so the sampling distribution of means may not be close to normal. A nonparametric test may do the job for us. Suppose we wish to compare an Experimental group (E) with a Control group (C) on the number of downloads from a research site in the past week. We have three randomly sampled observations from E (7, 12, 86) and four randomly selected observations from C (0, 4, 6, 10). Can we conduct a t-test for independent groups? Sure, the computer won’t care. SPSS gives us the mean for sample E = 35.0 with SD = 44.2, and the mean for sample C = 5.0 with SD = 4.16, and t(5) = 1.395, p = .222. Have we satisfied the mathematical assumptions of the t-test? If we use the SPSS estimate for t with variances not assumed equal we get t (df = 2.027) = 1.171, p = .361. Is this test valid? The plot below may be helpful to assess assumptions. Do we believe that the sampling distribution of the difference between means is reasonably normal? Obviously, that is not likely because of the outlier and small sample sizes. _______E____E_______________________________________________________________E__ 0 10 20 30 40 50 60 70 80______ C CC C The Mann-Whitney U test is based on the rank order of the observations, not on the scale values. The null hypothesis is that if a score is randomly chosen from each population, p (E > C) = 1/2. That is, the two randomly chosen scores are as likely to be ordered E > C as C > E. We observed the order C C C E C E E. What is the probability that the seven scores would be ordered with E having as much or more of an advantage over C than this if the null hypothesis is true, i.e., p (E > C) = 1/2? Let’s calculate this by hand. How many distinct ways can we order three Es and four Cs? Counting Rule 6, described earlier in this handout, shows why this is (N1 + N2)! / (N1!*N2!) = 7!/(3!4!) = (7*6*5*4!)/(3!*4!) = 35. What is the probability that the seven scores would be ordered C C C C E E E ? 1/35 = .0286 What is the probability that the seven scores would be ordered C C C E C E E ? 1/35 = .0286 This gives us a one-tailed probability of .0286+.0286 = .0572 and two-tailed p=.1144. To compute the Mann-Whitney U statistic, we count the number of C scores that exceed each E score, and total them. Here we find U=1. This can also be seen as the number of reversals of adjacent pairs needed to reach perfect separation, CCCCEEE. Alternatively, if we counted the number of E values that exceed each C, we would find UE=3+3+3+2=11. UC = (N1*N2) – UE = (3*4)-11 = 12-11=1. Only one C exceeds one E. Mann-Whitney U is the smaller of UE and UC. There are tables for U in many books. For example, when N1=3 and N2=4, Siegel (1957) gives p=.057 for U=1, and p=.028 for U=0. These are one-tailed p values. A descriptive effect size is ‘Probability of Superiority’ = 1 – U/(N1N2). PS = 1 – 1/(3*4) = 1 - 1/12 = .917. Wilcoxon Ws is an alternate and equivalent statistical test based on ranks. We simply order all of the scores and find the sum of ranks for scores in the smaller group. A source of confusion is that ranking can be done from either end, with the largest value given the rank of 1 or the smallest value given the rank of 1. Ws is defined as the smaller of these two values. If we made the wrong choice we can apply a formula to our sum to find Ws. Important note: SPSS doesn’t necessarily report the smallest Wilcoxon Ws value but it reports Mann Whitney U correctly. SPSS gives a wrong Ws when the sum of ranks is smaller for the larger group. CD05 Nonparametric Statistics 8 Score Group Rank1 0 C 1 4 C 2 6 C 3 7 E 4 10 C 5 12 E 6 86 E 7 Sum of ranks: 28 R1 C 1 2 3 R1 E 4 5 6 7 11 17 Rank2 R2C 7 7 6 6 5 5 4 3 3 2 1 28 21 R2 E 4 2 1 7 We have four possible values for the sum of ranks: 11, 17, 21, and 7. We define Ws as the smallest sum, so Ws = 7. This will be the smaller of the two possible sums of ranks for the smaller of the two groups. SPSS erroneously reports Ws=11, computed as the smaller sum of ranks for the two groups (R1C), where the smallest number is ranked 1 (see Rank1, with corresponding R1C and R1E). If the smaller group has larger ranks, as we have here, we can find the correct Ws value by reversing the ranking, assigning a rank of 1 to the largest number (see Rank2). Then the smaller sum of ranks is R2E = 7, the correct value for Ws. Conventionally, we call the smaller sample size N1 and the larger sample size N2. Here N1=3 and N2=4, and the sum of ranks for the smaller group = R1E = 17 = Ws'. We can convert Ws' to the correct Ws value with the formula Ws = 2W - Ws', where 2W = N1*(N1+N2+1). Thus, Ws = N1*(N1+N2+1) – Ws' = 3*(3+4+1)-17 = 24-17 = 7. N2 3 4 5 One-tailed p values .010 .025 .05 .10 -6 7 -6 7 -6 7 8 2W 21 24 27 Here is Table Ws from Howell for N1=3. The table shows the critical values for Ws. Smaller values for Ws have smaller p values. Our Ws of 7 with N2=4 gives onetailed p<.10 but not p<.05. The minimum Ws value when N1=3 is 6 (i.e., 1+2+3). With large samples (e.g., N1>25), Ws approaches a normal distribution with Mean Ws N 1 ( N 1 N 2 1) and SDW s 2 In our example, Ws N 1 N 2 ( N 1 N 2 1) 12 3(3 4 1) 3 * 4 * (3 4 1) 12 and SDW s 8 2.828 2 12 However, with N1=3, which is much smaller than 25, the normal approximation is not valid. Here, this invalid computation yields Z = (7 – 12) / 2.828 = -1.768, p = .077 two-tailed. This is what SPSS reports as the ‘asymptotic significance.’ The correct value, which we computed earlier, is p = .0572, one-tailed, which is p = .1144 two-tailed. We could also find the correct value from a U table, which shows p = .057 one-tailed. SPSS gives the correct value for U, so we can use the following formula to compute the correct value for Ws. Ws = (N1)(N1+1)/2 + U. In our example, U=1. This gives Ws = (3)(3+1)/2 + 1 = 6 + 1 = 7. This is the correct value for Ws, though SPSS reports W=11. SPSS also reports the Exact Sig. [2*1-tailed Sig.)] = .114, which is correct. Lesson 1a: Be careful reporting Wilcoxon W from SPSS. Lesson 1b: Even major computer programs can be wrong. CD05 Nonparametric Statistics 9 SPSS Nonparametrics Dale Berger, CGU Comparing two groups with SPSS: D12 Parametric t, nonparametric Wilcoxon and Mann-Whitney U, Median test One goal of an employment skills training program is to increase the number of applications to potential employers. Data on the number of applications submitted in the past two weeks are available from 12 graduates of the program and 16 comparable control cases who have not yet taken the training. Here are the data on number of applications: Control group: 0, 0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 4, 5, 5, 6, 134 Training group: 0, 2, 5, 7, 8, 9, 10, 10, 10, 10, 11, 11 Our task is to enter the data in to SPSS and conduct appropriate analyses (and maybe some inappropriate analyses for comparison). First, let us enter the data into SPSS. Call up SPSS. Check the circle that says we will Type in data. A spreadsheet opens and we are ready to begin. Following the usual SPSS protocol, we enter data into the spreadsheet so that each case is on a separate line, and the columns are the variables. In our example, we will define three variables. The first is an ID code, so that we can locate and refer to any specific case easily. The second variable is the group code, indicating which of the two groups the case is in. The third variable is the dependent measure, the number of applications. On the bottom of the spreadsheet are two tabs. One is labeled Data View and the other Variable View. Click the Variable View tab. Let us begin by entering a name for each of our three variables. Under the column headed Name, enter id in the first row, group in the second row, and applic in the third row (a limitation of SPSS 12 and earlier versions is that variable names can be no longer than 8 characters, giving rise to arcane variable names known as SPSSese). Next we supply information about each of the variables. SPSS supplies some default information, but we must make sure the information is correct for our application. The default is that each variable is ‘numeric,’ 8 characters wide, with 2 decimals. We don’t need any decimal places, so we can set the decimals to zero. Click on a box under Decimals and enter 0. You can also use the little up and down arrows to increase or decrease the number of decimals. We will provide a label for each variable. Under the Label heading, click a box to open. Enter the label ID number in the first row, Training Group in the second row, and Number of applications in the third row. Value labels are useful for categorical variables, such as Training Group. Click the box in the second row under Values, and click on the little gray box with three dots that appears. This will open the Value Labels window. Let us use 0 to indicate the control group and 1 to indicate the training group. The cursor begins in the Value window. Enter 0 and press Tab or the down arrow key to move the cursor to the Value Label window. Don’t press Enter or click OK before you are ready to leave this window. Enter Control and click the Add button. Enter 1, press Tab, enter Training, click Add, and then click OK. Now we are ready to enter our data. Click the Data View tab at the bottom of the work sheet. Under ID, enter 1, press Enter, enter 2, etc. sequentially down to 28 in row 28. CD05 Nonparametric Statistics 10 Next we enter the group code for each case. The first 16 cases are in the control group, so they each have the value of 0. Under the group column, in row 1 enter a 0. You can enter a 0 in each row down to row 16. An easier method is to click on another cell, then right-click on the 0, select Copy, highlight the column down to row 16, right-click on the highlight, and select Paste. Enter 1 in row 17 and copy 1 into rows through 28. Finally, we enter the number of applications for each case in turn. Important: be sure to check all of your data to make sure they are correct before you go on. Now we are ready to analyze the data and work the magic of SPSS. First, let’s do a Bumble. Our friend Bumble did this analysis in his sleep. (He often does analyses when less than fully awake, and sometimes he gets it right.) There are two independent groups, and we wish to compare the number of applications submitted – Bumble always does a t-test with data like this. t-test for independent groups From the menu bar at the top, click Analyze, select Compare Means, select Independent-Samples T Test…, to open a new window. Here we specify the variables we will use in the t-test. In the left window, click on Number of applications to highlight the variable. Then click the arrow to move the variable into the Test variable(s): window. Next click on Training Group, and move it into the bottom Grouping Variable: box. Click the Define Groups… button to open a new window. Enter 0 for Group 1 and enter 1 for Group 2, and click Continue. Click OK, and watch SPSS do its thing, automatically opening the Output - SPSS Viewer window and displaying your output. Group Statistics Number of applications Training Group Control Training N 16 12 Mean 10.63 7.75 Std. Deviation 32.956 3.621 Std. Error Mean 8.239 1.045 The first table provides summary statistics. We see that the number of cases is correct for each group. The average number of applications is actually larger for the control group (10.63) than for the training group (7.75) in our sample, and we see that the standard deviation also is larger in the control group. CD05 Nonparametric Statistics 11 Independent Sam ple s Test Levene's Test for Equality of Variances Number of applications Equal variances as sumed Equal variances not as sumed t-t est for Equality of Means F Sig. t 2.25 .145 .299 .346 Sig. (2-tailed) Mean Difference 26 .767 2.88 9.602 -16.861 22.611 15.481 .734 2.88 8.305 -14.779 20.529 df St d. Error Difference 95% Confidenc e Int erval of the Difference Lower Upper The second table gives the results for Levene’s Test for Equality of Variances and two different ttests. Levene’s test is not statistically significant (p=.145) so the assumption of equal variance cannot be rejected statistically. [Note: This test is of limited value because is most sensitive with very large samples, when violation of the assumption of equal variance is less important. Furthermore, SPSS provides an adjusted t-test for which we do not assume equal variance.] The first t-test is the standard t-test which assumes that the population variances are equal in the two populations represented by the two samples. The standard t-test has df=26, and the resulting t is not statistically significant, t(26) = .299, p=.767. The difference in means is 10.63-7.75 = 2.88, with a standard error of 9.602. This gives the t-value of 2.88/9.602 = .299. The 95% confidence interval ranges from -16.861 to +22.611. The adjusted t-test not assuming equal variances gives similar results, with df=15.481. Given the large difference in variance, we should use the test that does not assume equal variance. Bumble reported that there is no significant effect of training, t(26) = .299, p=.767. Bumble concluded that although the sample mean was larger for the control group, we can’t be confident that the population mean for the control group is larger than the population mean for the training group, because the confidence interval for the difference in population means includes both positive and negative values. Q1: Is Bumble’s conclusion correct? Q2: What advice do you have for Bumble? (Hint: What is the first thing you would do as a data analyst?) Ans Q1: If all assumptions of the t-test are satisfied, then Bumble’s conclusion is correct. Although the training group produced a lower mean in this sample, the groups do not differ sufficiently for us to be confident that training leads to poorer performance. It could well be that training actually leads to better performance in the population. However, if assumptions are violated, then the statistical test is suspect, and the results may be misleading. Because Bumble did not check the validity of assumptions, his reasoning is invalid (although his conclusion may be correct). Ans Q2: The first thing Bumble should do is look at the data! A fundamental principle of data analysis is that your statistical models should be appropriate for the data. We need to look at the data carefully to make sure that our models are appropriate. CD05 Nonparametric Statistics 12 Simple graphs and summary statistics provide useful diagnostics. SPSS Frequencies is an especially useful tool. On the top menu bar, click Analyze, select Descriptive Statistics, select Frequencies… In the new window, click on applic, click the arrow to move it into the Variable(s) window. Click the Statistics… button to open a new window. Let’s select Mean, Median, Skewness, Kurtosis, Minimum, Maximum, and Std. Deviation and click Continue. Now click Charts…, select Histograms, click Continue. Now click Format… and select Suppress tables with many categories and click Continue. If there are many possible values for a variable (e.g., >10), we may wish to suppress a the long frequency list and instead focus on the histogram. Click OK. To judge whether the population distributions are reasonably normal, we can apply an ‘intra-ocular trauma’ test to the histogram. Statistics Number of applications N Valid Missing Mean Median Std. Deviation Skewness Std. Error of Skewness Kurtos is Std. Error of Kurtosis Minimum Maximum 28 0 9.39 4.50 24.715 5.092 .441 26.538 .858 0 134 OUCH! Our eyes are traumatized by this histogram. The distribution of the Number of Applications is clearly far from normal. The summary statistics show that the maximum value is 134, while the other scores are below 20. Bumble’s t-test is quite suspect because of the gross violation of the assumption of normality of the sampling distribution of means. What should we do? First, let’s find a better way to look at the data. The automatic scaling in SPSS obscures details of the shape of data. There is one very large score that causes SPSS to form large bin intervals for the plot, and we lose the shape for cases that are closer together. The cases at the lower end of the distribution all fall into one interval. To get a better look at the shape of the distribution of cases with values less than 100, we can select Data, Select Cases…, select If condition, If…, and define a selection rule. Click on applic, click the arrow to move applic into the window on the right, <, 100, Continue, and then OK. It is useful to generate separate histograms for the two groups. In the SPSS Data Editor window, go to Data in the menu at the top, click Split file…, Compare groups, click group and move it into the Groups Based on: box, click Sort the file…, OK. Now run Frequencies again as described earlier, and SPSS will produce a separate histogram for each group. However, notice that the scaling is not the same, which makes the histograms hard to compare. SPSS automatically scales so that the distributions cover the complete X axis. For the control group the X axis ranges from -2 to 8 while for the training group the range is -5 to 15. CD05 Nonparametric Statistics 13 So, by default, SPSS graphs may differ in range and interval size. We can modify the histograms from the SPSS defaults to make them more comparable. To override the defaults, double-click on the graph for the Training group in the SPSS output window to open the Chart Editor. Doubleclick the numbers on the X axis to open a Properties window. Select Scale, uncheck Auto for Minimum and set the Custom minimum value to 0. Similarly, uncheck Auto and set the Maximum to 12 and the Major Increment to 1. Click Apply, select Scale, set minimum to -1 and maximum to 12. Click Apply, Close. Click on a bin to open a Properties window where you can select Binning. Click Custom, Interval width, set to 1; Apply; Close. Do the same for the Control group histogram. What do we see in these histograms? The control group has many cases with zero applications, and no case with more than six (other than the one excluded extreme case). The treatment group has a wider range, with a lump at ten. Maybe participants were trained to submit one application every work day over the two week period? The more we know about the program and the control group, the more focused and useful our analyses can be. It may be helpful to look at a cross tabulation of the data, if there are not too many categories. First, we should remove the split file and the selection of cases <100. In the Data Editor window, click Data in the top menu, Split File…, Analyze all cases…, OK. Then click Data, Select cases…, All cases…, OK. Now we will create a crosstabulation table. Click Analyze, Descriptive Statistics, Crosstabs…, select group for Rows and applic for Columns, click OK. CD05 Nonparametric Statistics 14 Group * Number of a ppli cati ons Crossta bul ation Count Group Total Control Training 0 4 1 5 1 1 1 2 3 1 4 3 3 3 Number of applicat ions 4 5 6 7 8 1 2 1 1 1 1 1 3 1 1 1 9 1 1 10 11 4 4 2 2 134 1 1 Total 16 12 28 This gives us the exact number of cases with each value, and we can see that the outlier in the Control group has a value of 134. The outlier. How should we deal with the outlier in the control group? There are many options to consider, including the following: 1) track down the outlier to learn more about the case. 2) omit the case from further analysis; 3) Winsorize, by setting the value equal to the next most extreme case; 4) transform with a log or square root; 5) use an alternate analysis that is less sensitive to extreme scores. By all means, begin with option 1). If the recorded value is an error, find the correct value, determine an estimate for the value, or drop the case. If the value is valid, the outlier may be the most interesting case in the sample. Although we have only one case in that range, we should try to understand what is special about it. Maybe this person submitted poor applications to many inappropriate places – this might cue us to set some criteria on what we count as an application. We may not be able to use this one case to generalize to the population that includes few cases in that range, but that isn’t a reason to ignore the extreme case. Thus, Option 2) should not be used mechanically, although omitting the case may be best. Option 3) has the appeal that a case with an extremely large value is retained in the sample, still with a relatively large value (tied with the largest value). With the smaller value, the case is less likely to be unduly influential in the analysis. A disadvantage is that the observed value is changed and information about parameters such as the population mean is lost. Theoretically, one should Winsorize an equal number of cases from each end of the distribution. In practice, if there is an outlier at only one end, the impact of Winsorizing at the other end is likely to be negligible. Option 4) is often useful, but it is important to check the shape of the distribution to make sure that a transformation will be helpful, and that the best transformation is chosen. In Bumble’s case, the distributions are not far from normal with the exception of the single extreme value. A log transform would bring the extreme case in and make it less of an outlier, but the transformation also would distort the rest of the data, which currently do not look bad. The extreme outlier would still be an outlier. If there is negative skew with an outlier at the lower end of the distribution, a log or square root transformation will make the distribution worse, increasing skew and kurtosis. In that case, one can reverse the scale prior to transforming. Option 5) provides many choices, including nonparametric tests and resampling tests. In general, tests that require fewer assumptions also provide less power, but there are exceptions. Also, alternate tests do not test the same hypothesis as the t-test. CD05 Nonparametric Statistics 15 Wilcoxon W and Mann-Whitney U: WS = [N1*(N1+1)/2] + U The Wilcoxon W and Mann-Whitney U tests are often mentioned together as Wilcoxon-MannWhitney because they are really the same test. The formula to translate from W to U and vice versa is U = WS - [N1*(N1+1)/2] where N1 is the smaller of N1 and N2 and WS is the smaller sum of ranks for the smaller group. U is the smaller of U and U' where U + U' = N1*N2. The null hypothesis for Wilcoxon W and Mann-Whitney U tests asserts that if we pool observations from the two populations and rank them, the average rank of scores is the same for each population. This does not imply that the means are hypothesized to be equal, though if the two populations have the same shape and dispersion, then W gives a test of the equality of medians. If both populations are also symmetric, then W gives a test of the equality of the means. Thus, a test of W is generally interpreted as a test of equality of central tendency. Because these tests are based on ranks rather than means, outliers have much less influence. The largest score is ranked 1 whether it is 10 or 10,000,000. These tests are nearly as powerful as the parametric t-test when the assumptions of the t-test are satisfied, and they often are more powerful when the assumptions of the t-test are violated. To run these tests, click Analyze, Nonparametric Tests, Legacy Dialogs, 2 Independent Samples..., move applic into the Test Variable List, and move group into the Grouping Variable window. Click Define Groups…, assign the value 0 to Group 1 and the value 1 to Group 2, click Continue. Select Test Type to be Mann-Whitney U, under Options select Descriptive, click Continue, OK. Te st Statisticsb Ranks Number of applications Training Group Control Training Total N 16 12 28 Mean Rank 10.72 19.54 Sum of Ranks 171.50 234.50 Number of applications Mann-W hit ney U 35.500 W ilcox on W 171.500 Z -2. 828 As ymp. Sig. (2-tailed) .005 a Ex act Sig. [2*(1-tailed .004 Sig.)] The table of Ranks shows us that the average rank for cases in the Control group is 10.72, while the average a. Not correct ed for ties. rank for cases in the Training group is 19.54. This tells b. Grouping Variable: Training Group us that the typical number of contacts is greater in the Training group than in the Control group. The tests of statistical significance indicate that the two populations are significantly different. We can conclude that a randomly chosen person from the Training population probably has more contacts than a randomly chosen person from the Control population. Red flag alert!!! Notice that the sum of ranks is larger for the smaller group (Ws' = 234.5). This tells us that if we reversed direction of the ranking, we would find a smaller value for Ws. With reversed rankings, Ws = N1*(N1+N2+1) – Ws' = 12*(12+16+1) – 234.5 = 113.5. This is the correct Ws. SPSS has the correct value for U = 35.5, but if you reported this W provided by SPSS you would be wrong. We can use the formula Ws = (N1)(N1+1)/2 + U. This gives Ws = (12)(12+1)/2 + 1 = 78.0 + 35.5 = 113.5. This is the correct value for Ws, though SPSS reports W = 171.5. The Probability of Superiority (PS) is 1 – U/(N1*N2) = 1 – 35.5/(12*16) = 1 – 35.5/192 = 1 - .185 = .815, a large effect. CD05 Nonparametric Statistics 16 Median Test The median test is generally less powerful than Wilcoxon because it does not use as much information. For the median test, we first pool the scores from both samples and find the overall median. We classify each case as above or below this overall median. Then we use a 2x2 chisquare test of independence test whether the two populations differ in the proportion of cases greater than the pooled median. In our example, we can find the overall median from the crosstab table that we created earlier. Note that there are 28 cases, so 14 cases are below the median. By counting from either end, we see that the score for the median case is between 4 and 5. We can create a new dichotomous variable to indicate whether a case has more or fewer than 4.5 contacts and conduct a chi-square test of independence from group membership. The chi-square test is appropriate only if we have enough data so that expected values are greater than 5 in each cell. Here is code that you can type in to the Syntax window and execute, or find by point-and-click. recode applic (0 thru 4.5=0)(4.5 thru hi=1) into c2. CROSSTABS /TABLES=group BY c2 /FORMAT= AVALUE TABLES /STATISTIC=CHISQ /CELLS= COUNT . Chi-Square Tests Pearson Chi-Square Continuity Correctiona Likelihood Rati o Fis her's Exact Test Linear-by-Linear As soci ation N of Valid Cases Value 9.333b 7.146 10.008 9.000 df 1 1 1 As ymp. Sig. (2-sided) .002 .008 .002 1 Exact Sig. (2-sided) Exact Sig. (1-sided) .006 .003 .003 28 a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count les s than 5. The mi nimum expected count is 6.00. By writing simple syntax, we can also conduct this test with SPSS nonparametrics. Note: this command is not available through point-and-click. NPAR TESTS /median(4.5)=applic by group(0 1). Median Test Frequencies Number of a applications > Median <= Median Training Group Control Training 4 10 12 2 a. Median specified as 4.5. CD05 Nonparametric Statistics 17 Wilcoxon T for paired data: D13 Dale Berger, CGU This is a detailed example of an application of the nonparametric Wilcoxon T test for paired data, including a discussion of the logic, demonstration of hand calculations, and of SPSS analysis. The problem: We wish to evaluate the impact of a program intended to increase recycling of newspapers. A random selection of 14 homes in a city is visited by Boy Scouts who deliver a brochure that describes why it is good to recycle newspapers. Researchers weigh the newspapers recycled by each household during the week before the visit and during the week following the visit. These data are represented in Table 1 in the columns labeled Before and After. The change in the amount of recycling is reported in the column labeled Diff. [Research design note: In practice you should include a control group and much larger samples, if you can.] Consider methods of analysis. If the program had no effect, we would expect the average Diff score to be zero, with positive and negative differences approximately balancing each other. A dependent (i.e., paired) t-test would be a good choice if the assumptions of that test are satisfied. The null hypothesis of the t-test is that the mean Diff in the population is zero. An assumption for the t-test is that the sampling distribution for mean Diff scores based on samples of N=14 is approximately normal. To judge the validity of this assumption, we plot our sample Diff scores to see if that plot is close enough to normal so that it is reasonable to assume that the sampling distribution of the mean Diff is normal. As we scan down the Diff column, the number 55 stands out. With such an extreme outlier and such a small sample, we are not willing to assume normality. We examine Home 5 to make sure this is not a coding error. If we discovered that Home 5 was the only apartment house in our sample, we might chose to drop it from this analysis and plan a second study focused on apartment houses. However, if Home 5 looks like a legitimate case, we might choose to use the nonparametric Wilcoxon T, which is based on ranks. Table 1: Pounds of newspapers recycled before and after Boy Scout visit ID(home) Before After Diff 1 0 0 0 2 0 6 6 3 0 0 0 4 14 18 4 5 10 65 55 6 0 17 17 7 0 10 10 8 5 12 7 9 0 0 0 10 6 10 4 11 17 10 -7 12 12 12 0 13 15 14 -1 14 12 16 4 Sum of ranks: (Low to High Ranking) Rank +T1 -T1 5 5 3 10 9 8 6.5 3 10 9 8 6.5 3 6.5 3 1 3 55 6.5 1 3 47.5 7.5 CD05 Nonparametric Statistics 18 Wilcoxon T for matched-pairs: hand calculations. The Wilcoxon test is based on ranks of the Diff scores, where Diff scores are ranked according to size from smallest to largest, ignoring the direction of the difference, and ignoring cases where Diff scores are zero. The null hypothesis for the Wilcoxon test is that the sum of the ranks of positive Diff scores is equal to the sum of the ranks of negative Diff scores in the population represented by our sample. Our first step is to rank the Diff scores. We ignore cases where Diff = 0. Diff scores are ranked according to their absolute values (ignoring sign) from low to high. The smallest Diff score (ignoring sign) is -1, so we assign it a rank of 1. The next smallest Diff score is 4, and there are three of them. If they weren’t tied, they would be given ranks of 2, 3, and 4. Because they are tied, they are assigned their average rank, which is 3 (i.e., [2+3+4]/3). The fifth smallest Diff is 6, which is assigned Rank=5. The sixth and seventh smallest are two Diff scores at 6 (or -6), so they are assigned their average rank of 6.5. The next three Diff scores of 10, 17, and 55 are assigned ranks of 8, 9, and 10, respectively. If the null hypothesis is true, the sum of ranks associated with positive Diff scores should be about the same as the sum of ranks for negative Diff scores. In our example, the sum of the ranks for the positive Diff scores, column +T1, is 47.5 and the sum of negative Diff (-T1) scores is 7.5. We can check for a computation error: the sum of ranks 1 through N is (N)(N+1)/2. If we had three numbers, the sum of ranks would be 1+2+3=6 and (N)(N+1)/2 = (3)(4)/2 = 12/2 = 6. In our example, N=10 for non-zero Diff scores, so the sum of ranks 1 through 10 is (10)(11)/2 = 55. The sum of +T1 and –T1 = 47.5 + 7.5 = 55. Check! Which is more indicative of a difference, a small T or a large T? If there was no difference in the matched pairs in the population, we would expect about half of the ranks to be positive and half to be negative. The sum of –T and +T = (N)(N+1)/2 = 55, so if the null hypothesis is true the expected value of T is half of the sum, or (N)(N+1)/4 = 27.5. The most extreme difference between groups would be if all differences were in the same direction, so the sum of ranks for the opposite direction would be zero, giving T=0. We found T=7.5. How surprising is this? We can test the statistical significance of our finding with Table T in Howell. Thus, when we use the table, we keep in mind that a smaller T value gives a smaller p value. Table T shows us the one-tailed p value for various T values. When N=10, we find the following values: N 10 p = 0.05 T α 10 0.0420 p = 0.025 T α 8 0.0244 p = 0.01 T α 5 0.0098 p = 0.005 T α 3 0.0049 11 9 6 4 0.0527 0.0322 0.0137 0.0068 This is a somewhat unusual table in that it gives exact one-tailed p values for the two outcomes that are on either side of the critical p value shown at the head of the table. For example, if we wished to conduct a two-tailed test with alpha = .05, we would use the T values found in the column headed by p=.025. To attain statistical significance with one-tailed p<.025, we would need to find T less than or equal to 8, because the probability of observing T less than or equal to 8 is .0244. For T=9, the probability is .0322. If we have tied scores, we may have a fractional T value, such as T=10.5. We can use linear interpolation to estimate the p value associated with T=10.5 as half way between the p values for T=10 and T=11. From the table, we find this p value to be (.0420 + .0527)/2 = .0474, which is still statistically significant with p<.05 one-tailed. CD05 Nonparametric Statistics 19 From the table one-tailed p=.0137 for T=6 and p=.0244 for T=8. We estimate the p value for our T=7.5 to be about ¾ of the distance between .0137 and .0244. This calculation is (.0244 - .0137)(3/4) + .0137 = .008 + .0137 = .0217. Thus, our one-tailed p is not significant at the alpha = .01 level. However, it is significant at the .05 level, both one-tailed and two-tailed (for two-tailed p, we double the one-tailed p, giving us a two-tailed p of about .0434). Howell’s table goes up to N=50. When we have more than 50 cases, the T statistic is approximately normally distributed, and we can use a normal approximation. Under the null hypothesis, the expected value of T = E(T) = (N)(N+1)/4 and the SD of T is the square root of (N)(N+1)(2N+1)/24. In our example where N=10, E(T) = 27.5 as calculated earlier, and SD = square root of 96.25 = 9.81. Applying this test to our data gives T - E(T) (7.5 27.5) z 2.04 . Consulting a z table gives p=.021 one-tailed. In our case where SDT 9.81 N=10, this is not a very reliable test. The normal approximation is reasonably accurate when N>50, but the tabled values are exactly correct. SPSS gives an approximate p-value based on a normal approximation. Wilcoxon T for Matched Pairs: SPSS Nonparametric Application. If we wish to use SPSS to apply the Wilcoxon test, our first task is to enter the data into SPSS. We need to tell SPSS the ‘before’ measure and the ‘after’ measure of recyling for each home. In the opening SPSS for Windows window, select Type in data, click OK. You now see the Untitled – SPSS Data Editor, a spread sheet. Before we enter data, let us define our variables. Click the tab at the bottom labeled Variable View. In the column headed by Name, in Row 1 enter ID for an identification code. Move to the second row under Name and you will notice that SPSS changed the name to all lower case, id. We know we are not measuring id and ego, so it is OK. In Row 2 under Name, enter before, and in Row 3 enter after. All three of these variables are numeric with no decimal places. Under Decimals, we can change the number from the default value of 2 to our preferred value of zero (0). When you click on a cell under Decimals, a tiny control bar opens, allowing you to increase or decrease the number in the cell, or you can simply type the number you want. If you exit the cell, you can come back and copy it and paste the value into other cells. In the first three rows under Label, we can enter Identification Number, Before Treatment, and After Treatment. Enter data into SPSS. Click the tab at the bottom labeled Data View. In the first 14 rows under the id column, enter the numbers 1 through 14. In the columns labeled before and after, enter the appropriate observed numbers. When you are finished, the spreadsheet should look like the first three columns in Table 1. Analyze the data. Ordinarily, we would begin with descriptive analyses to help us understand our data set and to check for errors. In our example, we noted an extreme score (Diff = 55 for id=5) and we decided that the parametric dependent t-test is not appropriate, and that the nonparametric Wilcoxon T is a better choice. In SPSS, click Analyze in the upper menu bar, click Nonparametric Tests, 2 Related Samples…. In the Two-Related-Samples Tests window, highlight both before and after, and click the little black triangle to move both of these variables into the Test Pair(s) List window. Under Test Type, select Wilcoxon, and under Options, select Descriptives, and Continue. Click Paste. In the syntax window of SPSS, click Run. CD05 Nonparametric Statistics 20 SPSS Output. Below are the syntax and the output for our analysis. NPAR TEST /WILCOXON=before WITH after (PAIRED) /STATISTICS DESCRIPTIVES /MISSING ANALYSIS. NPar Tests De scriptive Stati stics N BEFORE AFTER Mean 6.50 13.57 14 14 St d. Deviation 6.607 16.018 Minimum 0 0 Maximum 17 65 Wilcoxon Signed Ranks Test Ra nks N AFTER - BEFORE Negative Rank s Positive Ranks Ties Total 2a 8b 4c 14 Mean Rank 3.75 5.94 Sum of Ranks 7.50 47.50 a. AFTER < BEFORE b. AFTER > BEFORE c. BEFORE = AFTER Test Statisticsb Z As ymp. Sig. (2-tailed) AFTER BEFORE -2.045a .041 a. Based on negative ranks . b. Wilcoxon Signed Ranks Test We can compare the SPSS output to the results of our hand calculations. The z-test is not as accurate as the exact values shown in Howell’s tables, because our sample is small. However, in our example, the conclusions are the same. According to SPSS, the observed sum of negative ranks is 7.5 which agrees with our hand calculation of T=7.5. A value of T this small or smaller is unlikely. SPSS reports a two-tailed probability of .041 based on a normal approximation, which gives one-tailed probability of about .021. This is consistent with our hand calculation of the test using the normal approximation. Given our small sample, we should be suspicious of the normal approximation for testing T. Our calculations using Howell’s tables show that the one-tailed probability for T=7.5 with N=10 is about .0217. The two-tailed probability is twice that, or about .043. In our example, we could make a good argument for using a one-tailed test, because we probably are not interested in the treatment if it reduces recycling. CD05 Nonparametric Statistics 21 Categorical Data Analysis Dale Berger SPSS CROSSTABS D14a Statistics for 2x2 Contingency Tables When we are interested in the relationship between two categorical variables, SPSS CROSSTABS provides many useful descriptive summaries and statistical tests. In this handout we will ask SPSS to show us everything for a 2x2 table, and we will examine each analysis. Note that in practice many of these statistics will not be relevant for any specific research application. Example: We are interested in the relationship between education level and compensation level in a large organization. We have data from a random sample of 100 employees, including whether the employee has a BA degree or not and whether the employee is No BA BA on hourly compensation or on salary. We found 51 with a BA Hourly a 16 b 9 25 on salary, 9 with a BA on hourly, 24 with no BA on salary, and Salary c 24 d 51 75 16 with no BA on hourly. 40 60 100 We can enter this data set into SPSS in several ways. When we are given a frequency table, perhaps the easiest method is to enter one line of data for each cell showing the levels of the factors and the cell frequency. Then we can weight the analysis by the cell frequency. Call up SPSS and select the Variable View tab. Under Name, enter three variable names: pay, educ, and freq. Decimals can be set to zero for each of these variables. Under Values, for pay set 0=Hourly and 1=Salary, and for educ set 0=”No BA” and 1=BA. Then select the Data View tab. For the first cell, enter 0 for pay, 0 for educ, and 16 for freq. Similarly, enter the information for the other four cells. Now we are ready to run CROSSTABS. First we need to weight our cases by freq. Click Data, Weight Cases…, Weight cases by, select freq, click OK. Click Analyze, Descriptive statistics…, Crosstabs, select pay for the row variable and educ for the column variable. Click Statistics, and for illustration select everything except the last one (Cochran’s MH). Click Cells and select as shown. Again, this will produce more output than we generally would want to request. CD05 Nonparametric Statistics 22 First, check the Count and the labels to verify that we have specified the data correctly for SPSS. pay * educ Crosstabulation educ 0 pay 0 hourly 1 salary Total Count Expected Count % within pay % within educ Residual Std. Residual Count Expected Count % within pay % within educ Residual Std. Residual Count Expected Count % within pay % within educ 16 10.0 64.0% 40.0% 6.0 1.9 24 30.0 32.0% 60.0% -6.0 -1.1 40 40.0 40.0% 100.0% 1 9 15.0 36.0% 15.0% -6.0 -1.5 51 45.0 68.0% 85.0% 6.0 .9 60 60.0 60.0% 100.0% Total 25 25.0 100.0% 25.0% 75 75.0 100.0% 75.0% 100 100.0 100.0% 100.0% The Expected Count is based on an independence model. If educ and pay are independent, then we can use the marginal division on one variable (e.g., pay is split 25% hourly and 75% salary) to predict how cases are split between those levels for each level of the other variable. Thus, for the 40 cases with No BA and the 60 cases with a BA, we predict 25% at each level would be hourly and 75% would be salaried if we have independence between pay and education. For the first cell, this gives (RowSum * ColumnSum / N) = (25*40/100) = 10 as the expected count. The “% within pay” tells us that 36% of the hourly people have a BA degree, while 68% of the salary people have a BA. The “% within educ” tells us that 60% of those with no BA are salaried while 85% of those with a BA are salaried. The Residual is the difference between the observed Count and the Expected Count. The Std. Residual can be interpreted as a z-score to test the null hypothesis that the observed count is consistent with the expected count for any given cell. CD05 Nonparametric Statistics 23 Some of the statistics in the table of Chi-Square Tests will not be useful for most applications, but they can be very useful for special applications. Chi-Square Tests Pearson Chi-Square Continuity Correctiona Likelihood Ratio Fis her's Exact Test Linear-by-Linear As sociation McNemar Test N of Valid Cases Value 8.000b 6.722 7.901 7.920 df 1 1 1 1 As ymp. Sig. (2-sided) .005 .010 .005 Exact Sig. (2-sided) Exact Sig. (1-sided) .009 .005 .005 .014c 100 a. Computed only for a 2x2 table b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 10. 00. c. Binomial distribution us ed. Pearson Chi-Square tests the null hypothesis of independence ( foij feij ) 2 2 between the row and column variables in the population. This is an feij approximation goodness-of-fit test that may not be accurate if the expected value in any cell is extremely small (e.g., less than 5).The observed and expected frequencies in Cell ij are foij and feij, respectively. Distributed approximately as chi-square with df = (number of rows – 1) * (number of columns – 1). Continuity Correction should be used only for a 2x2 table where both marginals are fixed (i.e., known before the data are tested, as with a median test for two groups). The Likelihood Ratio = G2 is an alternate approximate chi-square test of independence. fo Distributed approximately as chi-square with G 2 2 foij ln ij feij df = (# of rows – 1)*(# of columns – 1) Fisher’s Exact Test is a multinomial nonparametric test that can be used with very small samples, including cases where expected values of some cells are less than five. Linear-by-Linear Association is most useful when both variables are on an interval scale. r2 (N-1) is distributed approximately as chi-square with df=1. McNemar’s test is useful for 2x2 ‘before-after’ designs (e.g., one variable is Pass vs. Fail before training and the second variable is Pass vs. Fail after training). The null hypothesis is that the two marginal distributions are equal in the population (i.e., there was no change in passing rate). A binomial test is conducted on only the two cells that show a change under the null hypothesis that in the population there were as many changes in one direction as in the other. Look for one set of N cases classified into two groups on two occasions, giving a 2x2 table. The null hypothesis is that the split is the same on the two margins, not that the two marginal variables are statistically independent. CD05 Nonparametric Statistics 24 Directional Measures Nominal by Nominal Lambda Goodman and Kruskal tau Uncertainty Coefficient Ordinal by Ordinal Somers' d Nominal by Interval Eta Symmetric pay Dependent educ Dependent pay Dependent educ Dependent Symmetric pay Dependent educ Dependent Symmetric pay Dependent educ Dependent pay Dependent educ Dependent Value .108 .000 .175 .080 .080 .064 .070 .059 .281 .250 .320 .283 .283 As ymp. a Std. Error .071 .000 .114 .056 .055 .045 .049 .041 .098 .090 .110 Approx. Approx. b T Sig. 1.414 .157 c . .c 1.414 .157 .005d .005d 1.422 .005e 1.422 .005e 1.422 .005e 2.756 .006 2.756 .006 2.756 .006 a. Not as suming the null hypothes is. b. Us ing the asymptotic standard error ass uming the null hypothesis. c. Cannot be computed because the asymptotic standard error equals zero. d. Based on chi-s quare approximation e. Likelihood ratio chi-square probability. B If we are interested in predicting one variable from the other, we may use a Directional Measure. Lambda gives an indication of how helpful information on one variable is for predicting the level of the other variable. A No BA BA Hourly a 16 b 9 Salary c 24 d 51 40 60 CA = Number of cases classified correctly on A using the marginal on A CA|B = Number of cases classified correctly on A given information on B 25 75 100 Lambda = (CA|B - CA)/ (N - CA) In our example, if we wish to predict pay level with no knowledge of education level, we can be correct on 75 cases by predicting ‘salary’ for everyone, so CA = 75. Knowledge of education level doesn’t help us predict pay level because the majority of people are on salary at both levels of education. Thus we will predict ‘salary’ for everyone and we will be correct on 24 cases with No BA and on 51 cases with a BA, for a total of CA|B = 24+51 = 75. Thus, lambda(pay) = 0. If we wish to predict education level, we will be correct on 60 cases if we predict ‘BA’ for everyone, giving CA = 60. Knowledge of pay level improves prediction. If someone is hourly, we predict ‘No BA’ and we are correct on 16 cases. If someone is salaried, we predict ‘BA’ and we are correct on 51 cases, giving us CA|B = 16+51 = 67. Thus, lambda(educ) = (67-60)/(100-60) = 7/40 = .175. Information on pay level accounts for 17.5% of the errors that are made in predicting education level without knowledge of pay level. CD05 Nonparametric Statistics 25 Symm etri c Measures Nominal by Nominal Ordinal by Ordinal Int erval by Interval Measure of Agreement N of V alid Cas es Phi Cramer's V Contingenc y Coefficient Kendall's t au-b Kendall's t au-c Gamma Spearman Correlat ion As ymp. a St d. E rror Approx . T .098 .087 .160 2.756 2.756 2.756 Approx . Sig. .005 .005 .005 .006 .006 .006 .283 .098 2.919 .004 .283 .267 100 .098 .095 2.919 2.828 .004c .005 Value .283 .283 .272 .283 .240 .581 Pearson's R Kappa b c a. Not as suming the null hypothesis. b. Us ing the asymptotic s tandard error as suming the null hypothesis. c. Based on normal approximation. 2 Phi 2 N B In a 2x2 table, Phi = Pearson r. (ad bc) 2 N (a b)(c d )(a c)(b d ) 0 A1 1 0 No BA BA Hourly a 16 b 9 Salary c 24 d 51 40 60 25 75 100 (ad bc) 2 Phi r (a b)(c d )( a c)(b d ) Cramer’s V adjusts for the size of the table by dividing phi by the square root of (L-1) where L is the lesser of the number of rows and the number of columns. Contingency Coefficient ranges between 0 and 1 2 Cramer' s V 2 Contingency Coefficient N ( L 1) (N 2 ) Kendall’s tau-b adjusts for ties in ranks. Kendall’s tau-c also adjusts for table size for tables larger than 2x2. Gamma provides a good test of ordinal by ordinal relationships. Gamma ignores ties and can be used if there are many ties. A pair of cases from cells a and d is ‘concordant’ because the case with the greater value on A has the greater value on B. P = the number of concordant pairs = a*d = 16*51 = 816. A pair of cases from cells b and c are ‘discordant’ because the case with the greater value on A has the lesser value on B. Q = the number of discordant pairs = b*c = 9*24 = 216. Gamma = (P-Q)/(P+Q) = (816-216)/(816+216) = 600/1032 = .581. With large N, gamma (G) can be tested with Z = G * square root of [(P+Q) / N(1 – G*G)]. Spearman Correlation is the Pearson Correlation conducted on the ranks of the scores. This is useful for an ordinal by ordinal relationship with few ties. Cohen’s Kappa is a measure of agreement adjusted for chance. Po is the observed percent agreement and Pc is the chance agreement. Kappa = (Po – Pc) / (1 – Pc). Thus, kappa can be interpreted as a proportion improvement over chance. CD05 Nonparametric Statistics 26 Risk Estimate Value Odds Ratio for pay (0 hourly / 1 salary) For cohort educ = 0 For cohort educ = 1 N of Valid Cases 95% Confidence Interval Upper Lower 3.778 1.461 9.767 2.000 .529 100 1.286 .307 3.111 .913 No BA BA Hourly a 16 b 9 Salary c 24 d 51 40 60 a 16 c 24 3.778 b 9 d 51 a 16 c 24 2.000 ( a b) (25) (c d ) (75) b 9 d 51 .529 ( a b) (25) (c d ) (75) 25 75 100 Odds ratios are not the same as ‘risk ratios.’ For example, the odds that someone with a BA is at salary is 51/9 = 5.667 while the odds that someone with No BA is at salary is 24/16 = 1.500. Thus, the odds ratio is 5.667/1.500 = 3.778. Psychologists are more familiar with a description in terms of proportions or ‘risk ratios.’ We might say that 85% of employees with a BA are on salary (i.e., 51/60) while only 60% (i.e., 24/40) of employees without a BA are on salary. Thus, the probability of being on salary is 25 percentage points greater for employees with a BA. We could also say that the probability of being on salary is 42% greater for those with a B.A. because (85% - 60%) / 60% = .42. These differences can be confusing unless one is very explicit about how percentages are calculated! Implication: We need to be very clear about what we are reporting. A limitation of ‘risk ratios’ is that it may be arbitrary which way we compute the ratio, and it isn’t possible to convert from one to the other without additional information. For example, we might compute the risk of being on hourly wages as 9/60 = 15% for those with a BA and 16/40 = 40% for those with no BA. The probability of being on hourly is more than twice as great for those with no BA, 40% / 15% = 2.67. Thus, the ‘risk ratio’ of being on hourly for those with no BA vs. those with a BA is 2.67. However, the probability of being on salary for those with a BA is 85% compared to 60% for those with no BA, so the ‘risk ratio’ of being on salary is only 85% / 60% = 1.42. However, the odds ratio is something quite different, 3.778. BIG Caution: A common error is to interpret an odds ratio as a risk ratio. It is not correct to interpret the odds ratio of 3.778 as indicating that people with a BA were 3.778 times as likely to be on salary compared to someone without a BA. Lesson: Odds ratio big red flag when interpreting CD05 Nonparametric Statistics 27 Categorical Data Analysis Dale Berger SPSS CROSSTABS D14b Statistics for larger contingency tables With large crosstab tables, relatively sophisticated analyses may be needed. For illustration, we will use a hypothetical example adapted from Franke, et al. (2012). Researchers wish to compare the effectiveness of three different methods of serving families with children at risk for abuse or neglect. A sample of 731 cases was randomly assigned to one of three treatments: (1) parenting education, (2) community services, or (3) wraparound that included both of the first two services plus case management. Outcomes after one year were classified into four categories: (1) no further contact with Child Protective Services (CPS), (2) a referral to CPS, (3) substantiated allegations of abuse or neglect, or (4) child removed from the home. Below is syntax and output for the example. CROSSTABS /TABLES=Outcome BY Treatment /FORMAT=AVALUE TABLES /STATISTICS=CHISQ GAMMA /CELLS=COUNT COLUMN SRESID PROP /COUNT ROUND CELL. CD05 Nonparametric Statistics 28 The Pearson chi-square test for independence = 36.771 with df = 6, p < .001. We reject the null hypothesis, and conclude that there is a relationship between the two variables. If the null hypothesis were true, then in the population the proportion of cases with any given outcome is the same for every treatment. In the example, if the null hypothesis is true, then in the population the proportion of cases where the child is removed is the same for every treatment (say 7%), and the proportion of cases referred to CPS is the same for every treatment (say 18%), etc. Of course, because of sampling variability, these proportions would not be exactly the same in our sample even if the null hypothesis is true. However, statistical significance tells us that the differences in proportions that we observed in this hypothetical sample are highly unlikely if the null hypothesis is true. So, where are the differences? A chi-square test with more than one degree of freedom is a ‘blob’ test. It tells us that there are differences between treatments in the proportion of cases in the various outcomes. However, it does not tell us where the differences are. A common error is to draw conclusions about specific differences based on a visual examination of the crosstab table. For example, we might be impressed that the Wraparound treatment had the highest proportion of cases with no new CPS contact (72.3%), compared to 52.2% for the Education treatment and 63.8% for the Community treatment. Similarly, we see that those who received only Education had the highest proportion of Child Removed (14.5%) compared to only 4.3% and 4.2% for the other two treatment conditions. However, we cannot use the overall chisquare test to conclude that those specific differences are statistically significant. Special, more focused tests are needed. Standardized Residual tests A simple, but very limited test is provided by the Standardized Residuals for each individual cell. This statistic, distributed as standardized Z, is computed as the square root of the contribution of a cell to the Pearson chi-square statistic: (Observed – Expected) / SQRT(Expected). For example, the observed frequency of Child Removed in the Education group (Row 4, Column 1) is 27. The expected frequency for this cell if the treatment and outcome were independent is computed as (Row Total)*(Column Total) / N = (50)*(186) / 731 = 12.72. The standardized residual is (27 – 12.72) / SQRT(12.72) = 14.28 / 3.57 = 4.00 (see Std. Residual in the table). Because this statistic exceeds 1.96, the two-tailed p-value is less than .05. The precise p-value is .00006. This test may not be of much practical value because it deals with one cell at a time in comparison to the entire model. We can conclude that there are more cases of Child Removed in the Education condition that we would expect if treatment and outcome were independent. In practice, it may be more useful to compare specific conditions to each other. Also, we may be concerned with alpha inflation if we consider a separate test for each cell. In this example, we have 3x4 = 12 cells. Applying the Bonferroni logic, the critical p-value would be .05/12 = .0042. Because the observed p-value of .0006 is less than .0042, even this conservative test attains statistical significance. Gamma test of ordinal by ordinal relationship If the values for both the row and column variables can be ordered from low to high on some underlying concept, then the gamma statistic provides an index and test of the ordinal by ordinal relationship. A statistically significant positive gamma indicates that larger values on one variable CD05 Nonparametric Statistics 29 are associated with larger values on the other variable. A negative gamma indicates a relationship in opposite direction. 1 2 3 4 Total 1 97 38 24 27 186 2 120 42 18 8 188 3 258 49 35 15 357 Total 475 129 77 50 731 As noted earlier in the discussion of 2x2 tables, gamma is based on the number of ‘concordant’ vs. ‘discordant’ pairs of cases. For the gamma calculation we consider each pair of cases where the cases do not share a row or a column. For example, any one of the 97 cases from the Cell[1,1] could be paired with any one of the 42 cases from Cell[2,2]. Any pair like this would be considered concordant because the case that has a larger value on the row variable also has a larger value on the column variable. For each case in Cell[1,1], a concordant pair can be formed with any case from any cell that is larger in both row and column, displayed to the right and down from Cell[1,1] in the current example. There are 42 + 49 + 18 + 35 + 8 + 15 = 167 such cases, so the number of concordant pairs where one case is in Cell[1,1] is 97 * 167 = 16,199. Similarly, we can compute the number of concordant pairs where the first case is in Cell[1,2] as 120 * (49 + 35 + 15) = 11,880. No concordant pairs can be formed with a case from Cell[1,1] and any of the other cases in Row 1 or Column 1. For Cell[2,1] with 38 cases, there are 38 * (18+35+8+15) = 2,888 concordant pairs. Continuing, with cells [2,2], [3,1], and [3,2], we find 2,100, 552, and 270 more concordant pairs, respectively. Adding the numbers of concordant pairs gives a total of 33,889 = P. For discordant pairs, the case with a larger value on the row variable has a smaller value on the column variable (ties are ignored). For each of the 258 cases in Cell[1,3], a discordant pair can be formed with any case taken from any of the cells to the left and downward. There are 157 cases in those cells, giving 258 * 157 = 40,506 discordant pairs involving Cell[1,3]. Cells [1,2], [2,2], [2,3], [3,2], and [3,3] produce 10680, 2142, 3773, 486, and 1225 discordant pairs with cases in cells to their left and downward, giving a total of Q = 58,812 discordant pairs. Gamma = (P-Q) / (P+Q) = (33,889 - 58,812) / (33,889 + 58,812) = -24,923 / 92,701= -.269. With large N, gamma (G) can be tested with standardized Z = G * square root of [(P+Q) / N(1 – G*G)]. Z = -.269 * √[92,701/(731*(1 – (-.269^2))] = -.269*√678.16 = -7.01; p < .001. Because gamma = -.269 is negative, we know that larger values on the treatment variable tend to be associated with smaller values on the outcome variable. From the coding, we see that larger values on the Treatment variable indicate a more intense treatment, while larger values on the outcome variable indicate a worse outcome. Thus, the negative gamma indicates that more intense treatment is associated with lower values on the index of negative outcomes. It may be easier to describe this finding by reversing the coding on the outcome variable, so a larger number indicates a more favorable outcome. Then, the sign would be reversed on gamma. We don’t actually need to re-run the analysis for gamma with the reversed scale because we know what it would be: gamma = .269. The interpretation is “More intense treatments are associated with better outcomes.” CD05 Nonparametric Statistics 30 Contrasts to test specific hypotheses Suppose we have a specific hypothesis that families that receive the Wraparound treatment are less likely have a referral to Child Protective Services (CPS) than families that receive either of the other two treatments. The overall test of independence was χ2 (6, N = 731) = 36.771, p < .001, indicating that there is a significant relationship between treatment and outcome. However, the overall test is a ‘blob’ test that does not allow us to draw conclusions about specific comparisons. As with ANOVA, we can construct contrasts to test specific hypotheses. The test is a Z test, computed as the value of a contrast divided by the standard error of the contrast (Goodman, 1963). ̂ = ∑ 𝑤𝑖 (𝑝𝑖 ) where 𝑝𝑖 is a group A contrast comparing group proportions is computed as 𝛹 proportion and wi is the weight assigned to that proportion. The weights are defined so that the sum of the weights is zero; thus, if the null hypothesis is true, the expected value of the contrast is zero. We begin by identifying the proportions we wish to compare, and the appropriate weights. The proportions of cases Referred to CPS in the Education and Community groups are 38/186 = .2043 = p1 and 42/188 = .2234 = p2, respectively. The total number of cases in these two groups is (186 + 188) = 374, and the relative weights assigned to these two groups correspond to their share of the N for these groups pooled: 186/374 = .4973 and 188/374 = .5027. Note that the sum of weights for these two groups is +1.0000. The proportion of cases Referred to CPS in the Wraparound treatment group is 49/357 = .1373 = p3. We assign weight -1 to the Wraparound group. ̂ = ∑ 𝑤𝑖 (𝑝𝑖 ) = (.4973)*(.2034) + (.5027)*(.2234) + (-1.000)*(.1373) The value of the contrast is 𝛹 = .1016 + .1123 + (-.1373) = .0766. The standard error squared for each sample proportion is (pi * qi) / Ni where pi is a specific proportion of interest, qi = 1 – pi, and Ni is the number of cases upon which the proportion pi is 2 based. The squared standard error for a contrast is 𝑆𝐸𝛹 = ∑ 𝑤𝑖2 ( Treatment Group Education Community Wraparound Sum Z= ̂ 𝛹 𝑆𝐸𝛹 Ni 186 188 357 𝒑𝒊 𝒘𝒊 .2043 .4973 .2234 .5027 .1373 -1.0000 0.0000 𝒘𝒊 (𝒑𝒊 ) .1016 .1123 -.1373 ̂ .0766 = 𝛹 𝒒𝒊 .7957 .7766 .8627 𝑝𝑖 𝑞𝑖 𝑁𝑖 ) 𝒑𝒊 𝒒𝒊 𝑵𝒊 .0008740 .0009228 .0003318 𝒑𝒒 𝒘𝟐𝒊 ( 𝑵𝒊 𝒊) 𝒊 .0002161 .0002332 .0003318 .0007811 = 𝑆𝐸𝛹2 .02795 = 𝑆𝐸𝛹 = .0766 / .02795 = 2.74, p = .006 two-tailed. If this was an a priori hypothesis and we did not wish to make any alpha adjustments for possible multiple tests, we could conclude that we have statistically significant evidence (p < .01) of fewer referrals to Child Protective Services for families that received the Wraparound treatment compared to families that received the other two treatments. CD05 Nonparametric Statistics 31 Holm’s test: If multiple tests are considered, then it may be appropriate to make adjustments. With a small set of a priori orthogonal contrasts, no adjustments may be needed. With a set of nonorthogonal a priori hypotheses, Holm’s test (Holm, 1979) is a good choice. Suppose you wish to test k contrasts (e.g., k = 5) controlling family-wise alpha error at α, such that if there is no effect of treatments at all, the probability of even one false significant finding for the set of k tests is α (e.g., α = .01). The test procedures uses a different critical value for each contrast. Compute the p-value for each contrast and order them from smallest observed p-value to largest. Then test the smallest p-value against the critical value of α / k (e.g., .01/5 = .0020), test the next smallest p-value against α / (k – 1) (e.g., .01/4 = .0025), the next against α / (k – 2) (e.g., .01/3 = .0033), and so on until the last p-value is tested against α / 1 (e.g., .05). An important rule is to stop at any point where the observed p-value exceeds its critical value. Scheffe’s test: If the contrasts are selected after looking at the data, the much more conservative Scheffe’s test is appropriate. Compute the Z values in the usual way, but compare to the square root of the critical value for the original chi-square test for the full table. If we wished to use alpha = .01for a Scheffe’s test on our example with a 3x4 table, we find the critical χ2 (6, α = .01) = 16.81, and take the square root of this value to find 4.10. A calculated Z-value must exceed 4.10 to be considered statistically significant using Scheffe’s test with alpha = .05 on a 3x4 table. Suppose we notice that 258 of the 357 families that received Wraparound treatment had no further involvement with CPA (72.27%), which appears greater than the 97 of 186 families that received Education only (52.15%). The contrast is .7227 - .5215 = .2012. The SE for the contrast is .04362, giving a Z score of .2012 / .04362 = 4.61. To apply the Scheffe test, we compare the 4.61 to 4.10. Our conclusion is that this difference is statistically significant, even taking into account the fact that we looked at the data to find this large effect. Treatment Group Education Wraparound Sum Ni 186 357 𝒑𝒊 .5215 .7227 𝒘𝒊 -1 +1 0.0000 𝒘𝒊 (𝒑𝒊 ) -.5215 .7227 ̂ .2012 = 𝛹 𝒒𝒊 .4785 .2774 𝒑𝒊 𝒒𝒊 𝑵𝒊 .001342 .0005614 𝒑𝒒 𝒘𝟐𝒊 ( 𝑵𝒊 𝒊) 𝒊 .001342 .0005614 .001903 = 𝑆𝐸𝛹2 .04362 = 𝑆𝐸𝛹 Caution: For the Z test to be accurate, the sample sizes should be large. A rule of thumb is that the product Ni * pi * qi for any cell involved in a contrast should exceed 5. In our example, the smallest observed Npq is for the Child Removed in the Community Treatment condition, where only 8 out of 188 were removed. This pi = 8/188 = .0425, so qi = .9575. The product Npq = 7.66. Thus, for this example, all cells have enough data for us to be comfortable with the Z test, which assumes normal distributions. With smaller samples, it may not be appropriate to use the Z test procedure to compare proportions in individual cells that have few observations. Franke, T. M., Ho, T., & Christie, C. A. (2012). The chi-square test: Often used and more often misinterpreted. American Journal of Evaluation, 33, 448-458. Goodman, L. (1963). Simultaneous confidence intervals for contrasts among multinomial populations. The Annals of Mathematical Statistics, 35, 716-725. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65-70. CD05 Nonparametric Statistics 32 McNemar’s test of related proportions: D15 Dale Berger, CGU McNemar’s test is applied to a 2x2 table to determine whether the row and column marginal proportions are different from each other in the population from which the data were sampled. For example, if one set of N cases is classified into two categories (+ or –) at two different times (A and B) or by two different raters (A and B), we can test whether the +/– split is different on the two occasions or for the two raters. We wish to know whether children are more likely to attain a rating of ‘excellent’ after they complete a training program. In a sample of 40 children, only 14 (i.e., 14/40 = 35%) were rated ‘excellent’ before training, but after training 20 (i.e., 50%) were rated ‘excellent.’ Do we have statistically significant evidence of improvement? The null hypothesis is that in the population represented by this sample the proportion rated ‘excellent’ is the same before and after training. + + a c A − totals a+c B − totals b a+b d c+d b+d N before + + after 7 20 1 19 20 totals 14 26 40 − 13 − totals If the null hypothesis is true, we expect the number of + responses to be the same before and after training. That is, we expect the totals a+b = a+c, or b=c. We can say “If the null hypothesis of no effect is true, then we expect to see as many cases that switch from + to – as cases that switch from – to +.” In our example we see that a total of b+ c = 7+1 = 8 cases switched classification, and of these, 7 switched from – to + while only 1 switched from + to –. Is this evidence of significant improvement? The relevant model is the binomial distribution. The null hypothesis is that the b+c cases that switched are from a binomial distribution where we expect cases to be split into the b and c cells with p=1/2 for each. How surprising is it to find a split as extreme as 1 in 8? This is just as surprising as it would be to toss a fair coin 8 times and observe 1 or fewer heads. We can use a computer program like StatWISE to find the exact probability of observing 0 or 1 heads out of 8 coin tosses. We use N=8, X=1, P=.50 and we find p(x<=1) = .0352. If we can justify a one-tailed test a priori, then we conclude that we have statistically significant evidence that a higher proportion of children attain a rating of ‘excellent’ after training than before training (p=.0352). If we would like to be able to detect a change in either direction, then we should apply a two-tailed test by simply multiplying the observed p value by two, giving p = 2*.0352 = .0704. In this case, we do not attain statistical significance at the .05 level. If A and B were two raters and we were interested in testing whether there is a difference in how liberal these raters are in assigning ‘+’ vs. ‘– ‘ then we should apply the two tailed test. CD05 Nonparametric Statistics 33 In general, we should use two-tailed tests unless we know, before looking at the data, that we are willing to ignore a difference in one of the two directions. In our example, perhaps our decision is whether to implement the new training program. If training makes things worse or doesn’t help, we won’t implement it, so we are interested only in whether we have evidence that the training improves performance. In this case we could conduct a one-tailed test with a pure heart. The binomial distribution gives the exact probability and so it is the preferred test. McNemar’s test is often presented as a chi-square test with df=1, but we should be aware that this is an approximation that may not be accurate with small samples. Note: The vertical lines around b-c indicate ‘absolute value’ which is always treated as a positive value. Thus |1-7| would become +6. In our example, we find 2 df 1 ( 7 1 1) 2 7 1 (6 1) 2 25 3.125 , giving p=.0771. 8 8 Although the p value comes from the upper tail of the chi-square distribution, this is a two-tailed test of the marginal probabilities because we would get a large value of chi-square if a large difference was observed in either direction. Important notes: The McNemar test does not test independence. The focus is on the cases only where there is disagreement in the ratings. The cases where there is agreement are ignored and have no impact on the McNemar test. Thus, the two tables below give identical results for McNemar tests. It is interesting to note that in the first table below, the proportion of + responses changes from 49.3% to 50.7%, while in the second table the proportion of + changes from 12.5% to 50.0%, yet the test results are identical because the focus of the test is on only those cases that change. In many practical applications, we should include information on agreement as well as disagreement to provide a context for the McNemar test that focuses on disagreements only. In the examples below, we should use the binomial distribution rather than the McNemar chi-square approximation, because the number of cases with disagreement is very small. The binomial is always correct, even with very small n, while the chi-square approximation may be quite far off with small n. before + after − + − totals 200 7 207 1 200 201 totals 201 207 408 before after + − totals + 1 7 8 − 1 7 8 totals 2 14 16 Be careful with labeling to make sure that you focus on the cells with cases that change. If the + and – columns were switched, we would focus on cells a and d rather than cells b and c. CD05 Nonparametric Statistics 34 Dale Berger Spearman r and SPSS: D16 The p values reported by SPSS for the Spearman correlation are wrong. They apparently are based on the parametric t-test rather than on the exact probabilities. Suppose we would like to measure and test agreement between two raters. If we have a reasonably bivariate normal distribution, Pearson correlation is the index of choice, and we can use a t-test to test the null hypothesis that the correlation is zero in the population. However, if we have an outlier, especially in a small sample, the correlation can be affected greatly and that test of statistical significance would not be very accurate. Here is an example. Two professors assigned the following ratings to four essays: Professor A: 86 81 72 20 (A_rate) Professor B: 92 95 85 12 (B_rate) What is the probability of such close agreement if their ratings are independent in the population of all possible essays represented by this sample? The Pearson correlation might not be a very good statistic here because of the very small sample size and the outlier (one essay in this small sample apparently was extremely weak). The Spearman correlation analyzes ranks rather than the raw data. Thus, we analyze the following ranked data: Professor A: 1 2 3 4 (A_rank) Professor B: 2 1 3 4 (B_rank) Spearman correlation (rs) is based on the agreement between ranks. For each case (each essay in this example), we compute the difference between ranks (di for case i). Then we square each di and find the sum. The minimum possible value for the sum of squared di values is 0 in the case where all ranks agree perfectly. The maximum possible value for this sum of squared di values is N(N21)/3 where N is the number of cases that are ranked. In our example, this maximum is 4(16-1)/3 = 20. You can check this out. If the two professors ordered the four essays in perfectly opposite order, the sum of the squared differences in ranks would be (1-4)2 + (2-3)2 + (3-2)2 + (4-1)2 = 9+1+1+9 = 20. If there is no relationship, we expect the sum of squared di to be about half way between zero and N(N2-1)/3, i.e., N(N2-1)/6. In our example, this would be 4(16-1)/6 = 10. 6 d i2 Spearman r can be computed as rs 1 N ( N 2 1) In our example, the sum of di squared is (1-2)2 + (2-1)2 + (3-3)2 + (4-4)2 = 1+1+0+0 = 2. This gives us rs = 1 – (6*2)/(4*15) = 1 – 2/10 = 1 - .200 = .800. Spearman r is simply Pearson r computed on ranked scores. The computation formula for Spearman r is a shortcut formula for Pearson r in the special case when we have ranked data. How surprising is such a large value for Spearman correlation? Can we apply the usual t-test? Clearly we have not satisfied the assumptions for the parametric t-test. The distribution of ranks is not normal and residuals from the regression line are not normally distributed. What to do? We can compute exactly how likely an observed Spearman r is by considering all possible CD05 Nonparametric Statistics 35 outcomes that we might observe when we have two ratings of the same N objects. If we order the outcomes from the first rater as 1, 2, 3, …, N, then we can consider all of the possible different orders that we might observe for the second rater. How many ways can we order N objects? From our counting rules, we know that is N*(N-1)*(N-2)*…*(3)*(2)*(1) = N factorial = N! In our example, there are 4! = 4*3*2*1 = 24 ways to order 4 distinct objects. If there is no relationship between the two raters, then each of these 24 possible pairings is equally likely. Here are the 24 possible rankings for Rater B, the sum of di squared, the rs value, and the probabilities of observing a value that large or larger if there is no relationship in the population. Rank_B 1 1 1 2 2 1 1 2 3 1 3 2 3 2 4 2 3 4 4 3 3 4 4 4 2 2 3 1 1 3 4 3 1 4 2 4 1 3 1 4 2 1 2 4 4 2 3 3 3 4 2 3 4 4 2 1 2 3 1 1 4 4 2 3 4 3 1 1 2 3 1 2 4 3 4 4 3 2 3 4 4 2 4 3 2 1 3 1 1 2 3 2 1 1 2 1 Σ(di2) 0 2 2 2 4 6 6 6 6 8 8 10 10 12 12 14 14 14 14 16 18 18 18 20 rs 1.000 .800 .800 .800 .600 .400 .400 .400 .400 .200 .200 .000 .000 -.200 -.200 -.400 -.400 -.400 -.400 -.600 -.800 -.800 -.800 -1.000 Cumulative p 1/24 .042 4/24 5/24 .167 .208 9/24 .375 11/24 .458 13/24 .542 15/24 .625 19/24 20/24 .792 .833 23/24 24/24 .958 1.000 There is only one order out of 24 that gives perfect agreement (1, 2, 3, 4), so the probability by chance of observing perfect agreement with rs = 1.000 is 1/24 = .042. There are three ways to obtain rs = .800, so the probability of observing an rs value of .800 or higher is 4/24 = .167. Let’s see what SPSS tells us. For illustration, I used the original rating data from the two professors, A_rate and B_rate, as well as the ranking data, A_rank and B_rank. SPSS can compute the Spearman r in the bivariate correlation analysis that produces Pearson r. CD05 Nonparametric Statistics 36 I entered these four variables into SPSS and asked for Pearson correlation, Spearman correlation, and Kendall’s tau_b. Kendall’s tau_b is another rank-order statistic that should give us p values equivalent to Spearman (according to Siegel, 1956, p. 219). To run the SPSS analysis, click Analyze, Correlate, Bivariate…, to open the Bivariate Correlations window. Select all four variables for analysis, check the boxes for Pearson, Kendall’s tau_b, and Spearman. Select the One-tailed test of significance. This generates the following syntax. CORRELATIONS /VARIABLES=A_rate B_rate A_rank B_rank /PRINT=ONETAIL NOSIG /STATISTICS DESCRIPTIVES /MISSING=PAIRWISE . NONPAR CORR /VARIABLES=A_rate B_rate A_rank B_rank /PRINT=BOTH ONETAIL NOSIG /MISSING=PAIRWISE . First, we see the Pearson correlations. The correlation between the two ratings is .992. This is inflated because of the outlier. A scatterplot shows how the outlier inflates the correlation. Correlations A_rate A_rate B_rate A_rank B_rank Pearson Correlation Sig. (1-tailed) N Pearson Correlation Sig. (1-tailed) N Pearson Correlation Sig. (1-tailed) N Pearson Correlation Sig. (1-tailed) N 1 4 .992** .004 4 -.879 .060 4 -.837 .082 4 B_rate .992** .004 4 1 4 -.816 .092 4 -.836 .082 4 A_rank -.879 .060 4 -.816 .092 4 1 4 .800 .100 4 B_rank -.837 .082 4 -.836 .082 4 .800 .100 4 1 4 **. Correlation is s ignificant at the 0.01 level (1-tailed). Also, we see that the correlation between the two ranked variables is .800. This demonstrates that Pearson’s correlation computed on ranks is exactly equal to the Spearman rho value. If we had normal residuals, this test of statistical significance would be correct: t r N 2 1 r2 .800 4 2 1 .800 2 .8(1.4142) 1.856 ; one-tailed p = .100. .6000 However, from our hand calculations, we know that the correct one-tailed p value for a Spearman r = .800 with N=4 is p = .167. CD05 Nonparametric Statistics 37 Now check what SPSS does when it reports Spearman’s rho. We see the same value for the correlation between the ratings as we see for the correlation between ranks. Both are .800. That is correct, because the Spearman correlation for both the ratings and the rankings are based on rankings. SPSS calculated the value for Spearman’s rho correctly. However, SPSS did not compute the p value correctly. The reported p value is .100, the same value reported for the Pearson correlation on ranks. That cannot be correct. Shame on SPSS! Correlations Kendall's tau_b A_rate B_rate A_rank B_rank Spearman's rho A_rate B_rate A_rank B_rank A_rate 1.000 . 4 .667 .087 4 -1.000* .021 4 -.667 .087 4 1.000 . 4 .800 .100 4 -1.000** .000 4 -.800 .100 4 Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N Correlation Coefficient Sig. (1-tailed) N B_rate .667 .087 4 1.000 . 4 -.667 .087 4 -1.000* .021 4 .800 .100 4 1.000 . 4 -.800 .100 4 -1.000** .000 4 A_rank -1.000* .021 4 -.667 .087 4 1.000 . 4 .667 .087 4 -1.000** .000 4 -.800 .100 4 1.000 . 4 .800 .100 4 B_rank -.667 .087 4 -1.000* .021 4 .667 .087 4 1.000 . 4 -.800 .100 4 -1.000** .000 4 .800 .100 4 1.000 . 4 *. Correlation is s ignificant at the 0.05 level (1-tailed). **. Correlation is s ignificant at the 0.01 level (1-tailed). Further note that the Spearman rho between A_rate and A_rank is -1.000. That is correct because the smallest rank was assigned to the largest rating, etc. But there are only 24 possible orders for Spearman, so even the most extreme outcome of 1.000 has a p value of 1/24 = .042. Yet SPSS reports Sig. (1-tailed) = .000. Also, the p values for Kendall’s tau_b and Spearman’s rho should be the same (Siegel, 1956, p. 219), but not in this SPSS analysis. Double shame! Take Home Lesson: Don’t trust a computer program for an important decision unless you can verify that the program is working correctly. Bumble’s axiom: To err is human; to really screw up it takes a computer. CD05 Nonparametric Statistics 38