Download CD05 CROSSTABS and Selected Nonparametric Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Selected Nonparametric Statistics
Categorical Data Analysis
Packet CD05
Dale Berger,
Claremont Graduate University
([email protected])
Statistics website: http://wise.cgu.edu
2
5
8
10
18
22
28
33
35
39
40
41
42
43
Counting Rules
Binomial Distribution
D11: Wilcoxon Ws and Mann-Whitney U
D12: Comparing two groups with SPSS (t, Wilcoxon Ws, Mann-Whitney U, Median)
D13: Wilcoxon T for paired data
D14a: SPSS CROSSTABS Statistics for 2x2 Contingency Tables
D14b: SPSS CROSSTABS analyses for larger contingency tables
D15: McNemar’s test of related proportions
D16: Spearman r and SPSS
Table D (Binomial) and Table P (Spearman r) from Siegel (1956) Nonparametric Statistics
Critical values for Spearman r from Ramsey (1989) Journal of Educational Statistics
Mann-Whitney U Table from Kirk (1978) Introductory Statistics
Wilcoxon T Table from Kirk (1978) Introductory Statistics
Table F (Runs tests – too few or too many) from Siegel (1956) Nonparametric Statistics
CD05 Nonparametric Statistics
1
Berger, CGU
Counting Rules
Rule 1: If any one of k mutually exclusive and exhaustive events can occur on each of n
trials, then there are kn different sequences that may result from a set of trials.
Example: Toss a coin 4 times. How many possible outcomes are there? One possible outcome may
be represented by HTTH. The total number of possible outcomes can be illustrated by a branching
diagram, as shown below. There are two possibilities for the first coin. For each of these
possibilities there are 2 possibilities for the second coin, giving a total of 4 distinct two-coin
sequences (HH, HT, TH, and TT). For each of these two-coin sequences, there are 2 possible
outcomes for the third coin, giving 4x2 or 8 possible three coin outcomes. Similarly, for each of
the 8 three-coin sequences, there are 2 possible outcomes for the fourth coin, giving 8x2 = 16
distinct four-coin sequences. If we apply Rule 1 with k=2 (i.e., heads or tails) and n=4 (four coin
tosses) we obtain kn = 24 = 2 x 2 x 2 x 2 = 16.
Suppose each outcome is equally likely to occur. Then the probability of any particular sequence is
1/kn. What is the probability of four heads on four coin tosses? (1/kn = 1/24 = 1/16 = .0625.)
Example: How many distinct ways are there to answer a 10-item multiple choice test with four
alternatives on each item? (Answer: k=4 and n=10, kn = 410 = 1,048,576.)
What is the probability that someone who is purely guessing will score all 10 correct on this test?
(Only one sequence is totally correct, so the probability is 1/1048576 = .00000095367.)
CD05 Nonparametric Statistics
2
Rule 2: If any we have n trials where the number of different events which can occur on
trials 1, 2, 3, …, n are k1, k2, k3 ….kn respectively, then the number of distinct
outcomes from the n events is (k1)(k2)(k3)…(kn).
Example: Suppose we have a task with a sequence of three choice points. At the first point we
have two choices, at the second point we have three choices, and at the third point we have four
choices. How many different ways might we complete the task? [Answer: (2)(3)(4) = 24.]
What is the probability of any one specific sequence if all choices are equally likely at each step?
[Answer: 1/24 = .0417.]
Note that Rule 1 is a special case of Rule 2, where k1 = k2 = k3 = …. = kn = k.
Rule 3: The number of different ways that n distinct objects may be arranged in order is
(n)(n-1)(n-2)…(3)(2)(1). This product is called n-factorial, symbolized by n!. 0! is
defined to be equal to 1. Any particular arrangement of n objects is called a
permutation. Thus, the total number of permutations for n objects is n!.
Example: You are the judge for a pie baking contest, and it is your task to rank the three finalists:
Apple, Banana, and Cream. How many distinct orders are possible? Applying the formula, n! = 3!
= 3x2x1 = 6. The six possible orders are ABC, ACB, BAC, BCA, CAB, and CBA. There are three
possible ways to fill the first place. After first place is assigned, there are two pies left, or two ways
to fill the second place. Thus, for each of the three possible ways to fill first place, there are two
ways to fill second place, giving 3x2 ways to fill the first two places. For each choice of first and
second place, there is only one pie left for third place; the total number of ways to rank the three
pies is 3x2x1 = 3! = 6.
Rule 4: The number of ways of selecting and arranging r objects from N distinct objects is
NPr
= N! / (N-r)! = “Permutations of N objects r at a time.”
Example: Given a set of 5 different cards, how many ways can you and I each choose one card?
The first card chosen can be any one of the 5 cards. After I have chosen my card, there are 4 cards
left for you to choose from, so there is a total of 5x4 = 20 ways in which we can choose two cards.
We have selected and arranged r=2 objects from a set of N=5 objects, so we can calculate NPr = 5!
/ (5-2)! = 5! / 3! = 5x4x3x2x1 / 3x2x1 = 5x4 = 20.
Example: How many ways can we select 3 students from a class of 10 to fill the offices of
President, Vice-President, and Secretary? The first office can be filled by any one of 10 people.
After this office is filled, we must choose from among the remaining 9 people to fill the second
office. Finally, the third office can be filled by any one of 8 people. This gives a total of 10x9x8 =
720 ways to fill the three offices. Notice that the descending product is the first part of 10!, with
the last part (7!) missing. The expression 10x9x8 can be written as 10! / 7!, which is 10P3.
CD05 Nonparametric Statistics
3
Rule 5: The number of ways of selecting a sample of r objects from a set of N distinct
objects is NCr = N! / [r!(N-r)!] = “Combinations of N objects r at a time.”
Example: How many ways can you select two cards from a deck of 5 distinct cards, with no
concern for order? Let us look again at the situation described in Rule 4 where you and I each
selected a card from a deck of 5 distinct cards. There are 5!/3! = 20 ordered pairs. If the Ace and
King were drawn, we counted AK and KA as two separate outcomes because we were concerned
with order. Thus, each pair of cards was counted twice. If we want only the number of possible
pairs with no concern for order, we must divide the number of ordered pairs by 2!, the number of
different orders for the two cards. This gives us 20/2 = 10 distinct pairs. If we apply Rule 5 we get
5C2 = 5! / [(2!)(3!)] = 120 / [(2)(6)] = 10.
Example: Given a group of 10 people, how many ways can we choose 3 to form a committee with
no regard for order of selection? We have already found the number of ordered groups of 3 people.
By applying Rule 4 we found 10!/7! = 720 ordered groups of 3. The number of unordered
committees must be less than this number because for every unordered group of 3 people, there are
3! = 6 orders. In general, because a group of r objects can be ordered in r! ways, there are r! times
as many ordered groupings as there are different groups not considering order. If we divide the
number of ordered groups given to us by Rule 4 by r!, we have Rule 5, the number of groups of
size r not considering order, N!/[(N-r)!r!].
Note that the number of groups of r objects selected from N objects is equal to the number of
groups of N-r objects. This is because for each group of size r, the remaining objects form a group
of size N-r. Thus, 100C3 = 100C97 = 100! / [97! x 3!].
Rule 6: The number of distinct orders of N objects consisting of k groups of N1, N2, …, Nk
indistinguishable objects is N! / [N1! N2! … Nk!].
Example: How many ways can the letters A,B,B,C,C,C be arranged if we can’t distinguish among
like letters? If all of the 6 letters were distinct, there would be 6! distinct arrangements. But for
any specific arrangement of the letters, such as ACBCBC, the Bs can be arranged in 2! ways and
the Cs can be arranged in 3! ways. Each of the now indistinguishable orders is counted when we
use 6!, so we must divide by both 2! and 3!, giving 6!/[2!3!] = 720/[2x6] = 60.
Example: How many distinct ways can you rearrange the letters in MISSISSIPPI?
N=11; NM=1; NI=4; NS=4; NP=2
11!/[1!4!4!2!] = 34,650
Example: How many distinct ways could you have 4 items correct on a 10-item test? That is, how
many distinct arrangements are there of 4 Cs and 6 Ws?
10!
10x9x8x7x6x5x4x3x2x1
3628800


 210
6!4! (6x5x4x3x2x1)x(4x3x2x1) 720x24
CD05 Nonparametric Statistics
4
Binomial Distribution
Dale Berger, CGU
When we have a dichotomous event (two mutually exclusive possibilities) such as male or female,
success or failure, Republican or Democrat, open or closed mind, etc. and we have multiple
independent observations, we may be able to make use of the Binomial Distribution
(bi = two; nominal = names).
Consider a multiple-choice examination where each item has 4 choices, only one of which is
correct. If all choices are equally likely, then we would expect a student who knows absolutely
nothing about the subject to be correct on about one item in four simply by guessing. In general,
we let p be the probability of a success and 1-p = q be the probability of a failure. If the test has
four choices on each item, is in Russian, and a student taking the test knows no Russian, we would
expect p=1/4 and q=3/4.
(1)
For a test with a single item (n=1), there are two possible outcomes:
He is correct (C) with probability p (here p = 1/4)
He is wrong (W) with probability q (here q = 3/4)
(2)
If each item is independent of all others, then for a test with n=2 items, we have four
possible outcomes:
Items
1 2
C C with probability p x p
C W
"
p x q
W C
"
q x p
W W
"
q x q
=
=
=
=
p2
pq
pq
q2
=
=
=
=
1/4
1/4
1/4
3/4
x
x
x
x
1/4
3/4
3/4
3/4
=
=
=
=
1/16
3/16
3/16
9/16
Because these four outcomes are mutually exclusive and exhaustive, the sum of their
probabilities is equal to 1.000.
p2 + 2pq + q2 = (p + q)2 = 1 because (p + q) = 1.
Also, 1/16 + 3/16 + 3/16 + 9/16 = 16/16 = 1.00
Suppose we are concerned with the number of items a student might have correct on this twoitem test. Let's call this number X. Then X is a random variable which may take on the values
0, 1, or 2 and there are probabilities associated with each of these values. We could construct a
sampling distribution for X as follows:
Number of Successes
x
0
1
2
P(X=x)
q2
2pq
p2
This example
9/16 = .5625
6/16 = .3750
1/16 = .0625
CD05 Nonparametric Statistics
5
Question: What is the probability that a person who is guessing on this two-item test will have
exactly one item correct?
Answer: p(X=1) = 2pq = 6/16 = .3750 in this example. Note that this probability comes from
summing the probabilities associated with the outcomes CW and WC, the two ways of
having exactly one success.
Let's look at the 8 possible outcomes from a three-item test (n=3).
Items
1 2 3
Sequence probability
p3
p2 q
p2 q
p2 q
C
C
C
W
C
C
W
C
C
W
C
C
p
p
p
q
x
x
x
x
p
p
q
p
x
x
x
x
p
q
p
p
=
=
=
=
C
W
W
W
W
C
W
W
W
W
C
W
p
q
q
q
x
x
x
x
q
p
q
q
x
x
x
x
q
q
p
q
= p q2
= p q2
= p q2
=
q3
Number
correct
Number
of ways
Prob.
This
example
3
1
p3
.0156
2
3
3p2q
.1406
1
3
3pq2
.4219
0
1
q3
.4219
The probabilities for these mutually exclusive events can be added to show that they sum to 1:
p3 + 3p2q + 3pq2 + q3 = (p + q)3 = 1 because (p + q) = 1.
The first term in the above sum gives the probability of observing three successes. The second
term has two parts; the p2q is the probability of observing any particular sequence with two
successes and one failure, and the coefficient of 3 represents the number of different sequences
consisting of two successes and one failure. This coefficient could also be obtained by calculating
the number of ways in which we could choose x=2 items to be correct out of the n=3 items. This
can be expressed as the number of combinations of 2 chosen from a group of 3, or 3C2 = 3!/2!1! =
6/(2x1) = 3. Similarly, the coefficient for the third term in the sum represents the number of ways
1 item from a group of 3 can be correct, or 3!/1!2! = 3. To continue the pattern, the coefficient for
the last term is 1 which is the number of ways of getting 0 correct on a 3 item test, or 3!/0!3! =
6/(1x6) = 1. Thus, each term in the sum gives the probability of getting x successes and can be
expressed in the general form 3Cx pxq3-x.
We could thus write our sum:
3C0
p0q3 + 3C1 p1q2 + 3C2 p2q1 + 3C3 p3q0 = q3 + 3pq2 + 3p2q + p3.
In general, the exact probability associated with exactly x successes out of n trials in any situation
where trials are independent and p remains the same on each trial can be calculated with the
following expression:
p(x=X) = nCx pxqn-x
nCx
[** This is the very useful Binomial Formula]
is n!/[x!(n-x)!]  “combinations of n objects taken x at a time” (Counting Rule 5)
CD05 Nonparametric Statistics
6
Example: Find the probability that a person guesses correctly on every item in a five-item test
where there are four choices on each item.
With n=5, x=5, and p=1/4, we find that
nCx
pxqn-x = 5C5 (1/4)5(3/4)0 = (1/4)5 = 1/1024 = .0010
Example: Find the probability that our person is correct exactly four times and wrong only once.
Applying the formula with n=5, x=4, and p=1/4:
nCx
pxqn-x = 5C4 (1/4)4(3/4)1 = 5 (1/4)4(3/4)1 = 15/1024 = .0146
Note that for any one particular sequence of 4 correct and 1 wrong, the probability is p4q1, and
there are 5 ways to get such a sequence: (CCCCW, CCCWC, CCWCC, CWCCC, WCCCC).
If we let X represent the number correct on the five-item test, then X is a random variable. We
have just calculated the probabilities associated with two of the possible values of X, p(X=5) =
.0010 and p(X=4) = .0146. To complete the sampling distribution for X when n=5, we must
calculate the probabilities that X = 3, 2, 1, or 0.
Number of
Successes
x
Probability
of Sequence
Number
of
Ways
Binomial Dist.
p(X=x)
Probability
if
p=1/4, q=3/4
5
p5
1
p5
.0010
4
p4q1
5
5p4q
.0146
3
p3q2
10
10p3q2
.0879
2
p2q3
10
10p2q3
.2637
1
p1q4
5
5p q4
.3955
0
q5
1
q5
.2373
1.0000
This table can be used to answer questions regarding the likelihood of getting any particular
number of items right by chance. For example, what is the probability that a person will get
exactly three items correct on this test if she is guessing? Answer: p(X=3) = .0879.
This table can also be helpful in making inferences about performance on the test. For example,
suppose a job applicant says she can read Russian. You give her this 5-item Russian test, and she
gets all five correct. Do you think she can read Russian? (How likely is it that she would get such
a good score if she were guessing?) [Answer: The probability that someone would get all five
items correct by chance is only .0010. It seems likely that she can read Russian.]
CD05 Nonparametric Statistics
7
Dale Berger, CGU
Wilcoxon Ws and Mann-Whitney U D11
Sometimes we wish to compare performance of two groups but our data do not satisfy the normality
assumptions of the parametric t-test, and we have a small sample size so the sampling distribution of means
may not be close to normal. A nonparametric test may do the job for us.
Suppose we wish to compare an Experimental group (E) with a Control group (C) on the number of
downloads from a research site in the past week. We have three randomly sampled observations from
E (7, 12, 86) and four randomly selected observations from C (0, 4, 6, 10). Can we conduct a t-test for
independent groups? Sure, the computer won’t care. SPSS gives us the mean for sample E = 35.0 with
SD = 44.2, and the mean for sample C = 5.0 with SD = 4.16, and t(5) = 1.395, p = .222.
Have we satisfied the mathematical assumptions of the t-test? If we use the SPSS estimate for t with
variances not assumed equal we get t (df = 2.027) = 1.171, p = .361. Is this test valid? The plot below may
be helpful to assess assumptions. Do we believe that the sampling distribution of the difference between
means is reasonably normal? Obviously, that is not likely because of the outlier and small sample sizes.
_______E____E_______________________________________________________________E__
0
10
20
30
40
50
60
70
80______
C CC C
The Mann-Whitney U test is based on the rank order of the observations, not on the scale values. The null
hypothesis is that if a score is randomly chosen from each population, p (E > C) = 1/2. That is, the two
randomly chosen scores are as likely to be ordered E > C as C > E.
We observed the order C C C E C E E. What is the probability that the seven scores would be ordered with
E having as much or more of an advantage over C than this if the null hypothesis is true,
i.e., p (E > C) = 1/2? Let’s calculate this by hand.
How many distinct ways can we order three Es and four Cs? Counting Rule 6, described earlier in this
handout, shows why this is (N1 + N2)! / (N1!*N2!) = 7!/(3!4!) = (7*6*5*4!)/(3!*4!) = 35.
What is the probability that the seven scores would be ordered C C C C E E E ? 1/35 = .0286
What is the probability that the seven scores would be ordered C C C E C E E ? 1/35 = .0286
This gives us a one-tailed probability of .0286+.0286 = .0572 and two-tailed p=.1144.
To compute the Mann-Whitney U statistic, we count the number of C scores that exceed each E score, and
total them. Here we find U=1. This can also be seen as the number of reversals of adjacent pairs needed to
reach perfect separation, CCCCEEE. Alternatively, if we counted the number of E values that exceed each
C, we would find UE=3+3+3+2=11. UC = (N1*N2) – UE = (3*4)-11 = 12-11=1. Only one C exceeds one E.
Mann-Whitney U is the smaller of UE and UC. There are tables for U in many books. For example, when
N1=3 and N2=4, Siegel (1957) gives p=.057 for U=1, and p=.028 for U=0. These are one-tailed p values.
A descriptive effect size is ‘Probability of Superiority’ = 1 – U/(N1N2). PS = 1 – 1/(3*4) = 1 - 1/12 = .917.
Wilcoxon Ws is an alternate and equivalent statistical test based on ranks. We simply order all of the scores
and find the sum of ranks for scores in the smaller group. A source of confusion is that ranking can be done
from either end, with the largest value given the rank of 1 or the smallest value given the rank of 1. Ws is
defined as the smaller of these two values. If we made the wrong choice we can apply a formula to our sum
to find Ws. Important note: SPSS doesn’t necessarily report the smallest Wilcoxon Ws value but it reports
Mann Whitney U correctly. SPSS gives a wrong Ws when the sum of ranks is smaller for the larger group.
CD05 Nonparametric Statistics
8
Score Group Rank1
0
C
1
4
C
2
6
C
3
7
E
4
10
C
5
12
E
6
86
E
7
Sum of ranks:
28
R1 C
1
2
3
R1 E
4
5
6
7
11
17
Rank2 R2C
7
7
6
6
5
5
4
3
3
2
1
28
21
R2 E
4
2
1
7
We have four possible values for the sum of ranks: 11, 17, 21, and 7. We define Ws as the smallest sum, so
Ws = 7. This will be the smaller of the two possible sums of ranks for the smaller of the two groups.
SPSS erroneously reports Ws=11, computed as the smaller sum of ranks for the two groups (R1C), where
the smallest number is ranked 1 (see Rank1, with corresponding R1C and R1E). If the smaller group has
larger ranks, as we have here, we can find the correct Ws value by reversing the ranking, assigning a rank of
1 to the largest number (see Rank2). Then the smaller sum of ranks is R2E = 7, the correct value for Ws.
Conventionally, we call the smaller sample size N1 and the larger sample size N2. Here N1=3 and N2=4, and
the sum of ranks for the smaller group = R1E = 17 = Ws'. We can convert Ws' to the correct Ws value with
the formula Ws = 2W - Ws', where 2W = N1*(N1+N2+1).
Thus, Ws = N1*(N1+N2+1) – Ws' = 3*(3+4+1)-17 = 24-17 = 7.
N2
3
4
5
One-tailed p values
.010 .025
.05
.10
-6
7
-6
7
-6
7
8
2W
21
24
27
Here is Table Ws from Howell for N1=3.
The table shows the critical values for Ws.
Smaller values for Ws have smaller p
values. Our Ws of 7 with N2=4 gives onetailed p<.10 but not p<.05. The minimum
Ws value when N1=3 is 6 (i.e., 1+2+3).
With large samples (e.g., N1>25), Ws approaches a normal distribution with
Mean  Ws 
N 1 ( N 1  N 2  1)
and SDW s 
2
In our example, Ws 
N 1 N 2 ( N 1  N 2  1)
12
3(3  4  1)
3 * 4 * (3  4  1)
 12 and SDW s 
 8  2.828
2
12
However, with N1=3, which is much smaller than 25, the normal approximation is not valid. Here, this
invalid computation yields Z = (7 – 12) / 2.828 = -1.768, p = .077 two-tailed. This is what SPSS reports as
the ‘asymptotic significance.’ The correct value, which we computed earlier,
is p = .0572, one-tailed, which is p = .1144 two-tailed. We could also find
the correct value from a U table, which shows p = .057 one-tailed.
SPSS gives the correct value for U, so we can use the following formula to
compute the correct value for Ws. Ws = (N1)(N1+1)/2 + U. In our example,
U=1. This gives Ws = (3)(3+1)/2 + 1 = 6 + 1 = 7. This is the correct value for
Ws, though SPSS reports W=11. SPSS also reports the Exact Sig. [2*1-tailed
Sig.)] = .114, which is correct.
Lesson 1a: Be careful reporting Wilcoxon W from SPSS.
Lesson 1b: Even major computer programs can be wrong.
CD05 Nonparametric Statistics
9
SPSS Nonparametrics
Dale Berger, CGU
Comparing two groups with SPSS: D12
Parametric t, nonparametric Wilcoxon and Mann-Whitney U, Median test
One goal of an employment skills training program is to increase the number of applications to
potential employers. Data on the number of applications submitted in the past two weeks are
available from 12 graduates of the program and 16 comparable control cases who have not yet
taken the training.
Here are the data on number of applications:
Control group: 0, 0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 4, 5, 5, 6, 134
Training group: 0, 2, 5, 7, 8, 9, 10, 10, 10, 10, 11, 11
Our task is to enter the data in to SPSS and conduct appropriate analyses (and maybe some
inappropriate analyses for comparison).
First, let us enter the data into SPSS. Call up SPSS. Check the circle that says we will Type in
data. A spreadsheet opens and we are ready to begin.
Following the usual SPSS protocol, we enter data into the spreadsheet so that each case is on a
separate line, and the columns are the variables. In our example, we will define three variables.
The first is an ID code, so that we can locate and refer to any specific case easily. The second
variable is the group code, indicating which of the two groups the case is in. The third variable is
the dependent measure, the number of applications.
On the bottom of the spreadsheet are two tabs. One is labeled Data View and the other Variable
View. Click the Variable View tab. Let us begin by entering a name for each of our three variables.
Under the column headed Name, enter id in the first row, group in the second row, and applic in
the third row (a limitation of SPSS 12 and earlier versions is that variable names can be no longer
than 8 characters, giving rise to arcane variable names known as SPSSese). Next we supply
information about each of the variables. SPSS supplies some default information, but we must
make sure the information is correct for our application. The default is that each variable is
‘numeric,’ 8 characters wide, with 2 decimals. We don’t need any decimal places, so we can set
the decimals to zero. Click on a box under Decimals and enter 0. You can also use the little up and
down arrows to increase or decrease the number of decimals.
We will provide a label for each variable. Under the Label heading, click a box to open. Enter the
label ID number in the first row, Training Group in the second row, and Number of
applications in the third row.
Value labels are useful for categorical variables, such as Training Group. Click the box in the
second row under Values, and click on the little gray box with three dots that appears. This will
open the Value Labels window. Let us use 0 to indicate the control group and 1 to indicate the
training group. The cursor begins in the Value window. Enter 0 and press Tab or the down arrow
key to move the cursor to the Value Label window. Don’t press Enter or click OK before you are
ready to leave this window. Enter Control and click the Add button. Enter 1, press Tab, enter
Training, click Add, and then click OK.
Now we are ready to enter our data. Click the Data View tab at the bottom of the work sheet.
Under ID, enter 1, press Enter, enter 2, etc. sequentially down to 28 in row 28.
CD05 Nonparametric Statistics
10
Next we enter the group code for each case. The first 16 cases are in the control group, so they
each have the value of 0. Under the group column, in row 1 enter a 0. You can enter a 0 in each
row down to row 16. An easier method is to click on another cell, then right-click on the 0, select
Copy, highlight the column down to row 16, right-click on the highlight, and select Paste. Enter 1
in row 17 and copy 1 into rows through 28.
Finally, we enter the number of applications for each case in turn. Important: be sure to check all
of your data to make sure they are correct before you go on.
Now we are ready to analyze the data and work the magic of SPSS. First, let’s do a Bumble. Our
friend Bumble did this analysis in his sleep. (He often does analyses when less than fully awake,
and sometimes he gets it right.) There are two independent groups, and we wish to compare the
number of applications submitted – Bumble always does a t-test with data like this.
t-test for independent groups
From the menu bar at the top, click Analyze, select Compare Means, select Independent-Samples
T Test…, to open a new window. Here we specify the variables we will use in the t-test. In the left
window, click on Number of applications to highlight the variable. Then click the arrow to move
the variable into the Test variable(s): window. Next click on Training Group, and move it into the
bottom Grouping Variable: box. Click the Define Groups… button to open a new window. Enter 0
for Group 1 and enter 1 for Group 2, and click Continue.
Click OK, and watch SPSS do its thing, automatically opening the Output - SPSS Viewer
window and displaying your output.
Group Statistics
Number of applications
Training Group
Control
Training
N
16
12
Mean
10.63
7.75
Std. Deviation
32.956
3.621
Std. Error
Mean
8.239
1.045
The first table provides summary statistics. We see that the number of cases is correct for each
group. The average number of applications is actually larger for the control group (10.63) than for
the training group (7.75) in our sample, and we see that the standard deviation also is larger in the
control group.
CD05 Nonparametric Statistics
11
Independent Sam ple s Test
Levene's
Test for
Equality of
Variances
Number of
applications
Equal variances
as sumed
Equal variances
not as sumed
t-t est for Equality of Means
F
Sig.
t
2.25
.145
.299
.346
Sig.
(2-tailed)
Mean
Difference
26
.767
2.88
9.602 -16.861
22.611
15.481
.734
2.88
8.305 -14.779
20.529
df
St d. Error
Difference
95% Confidenc e
Int erval of the
Difference
Lower
Upper
The second table gives the results for Levene’s Test for Equality of Variances and two different ttests. Levene’s test is not statistically significant (p=.145) so the assumption of equal variance
cannot be rejected statistically. [Note: This test is of limited value because is most sensitive with
very large samples, when violation of the assumption of equal variance is less important.
Furthermore, SPSS provides an adjusted t-test for which we do not assume equal variance.]
The first t-test is the standard t-test which assumes that the population variances are equal in the
two populations represented by the two samples. The standard t-test has df=26, and the resulting t
is not statistically significant, t(26) = .299, p=.767. The difference in means is 10.63-7.75 = 2.88,
with a standard error of 9.602. This gives the t-value of 2.88/9.602 = .299. The 95% confidence
interval ranges from -16.861 to +22.611. The adjusted t-test not assuming equal variances gives
similar results, with df=15.481. Given the large difference in variance, we should use the test that
does not assume equal variance.
Bumble reported that there is no significant effect of training, t(26) = .299, p=.767. Bumble
concluded that although the sample mean was larger for the control group, we can’t be confident
that the population mean for the control group is larger than the population mean for the training
group, because the confidence interval for the difference in population means includes both
positive and negative values.
Q1: Is Bumble’s conclusion correct?
Q2: What advice do you have for Bumble?
(Hint: What is the first thing you would do as a data analyst?)
Ans Q1: If all assumptions of the t-test are satisfied, then Bumble’s conclusion is correct. Although
the training group produced a lower mean in this sample, the groups do not differ sufficiently for
us to be confident that training leads to poorer performance. It could well be that training actually
leads to better performance in the population. However, if assumptions are violated, then the
statistical test is suspect, and the results may be misleading. Because Bumble did not check the
validity of assumptions, his reasoning is invalid (although his conclusion may be correct).
Ans Q2: The first thing Bumble should do is look at the data! A fundamental principle of data
analysis is that your statistical models should be appropriate for the data. We need to look at the
data carefully to make sure that our models are appropriate.
CD05 Nonparametric Statistics
12
Simple graphs and summary statistics provide useful diagnostics. SPSS Frequencies is an
especially useful tool. On the top menu bar, click Analyze, select Descriptive Statistics, select
Frequencies… In the new window, click on applic, click the arrow to move it into the Variable(s)
window. Click the Statistics… button to open a new window. Let’s select Mean, Median,
Skewness, Kurtosis, Minimum, Maximum, and Std. Deviation and click Continue. Now click
Charts…, select Histograms, click Continue. Now click Format… and select Suppress tables with
many categories and click Continue. If there are many possible values for a variable (e.g., >10),
we may wish to suppress a the long frequency list and instead focus on the histogram. Click OK.
To judge whether the population distributions are reasonably normal, we can apply an ‘intra-ocular
trauma’ test to the histogram.
Statistics
Number of applications
N
Valid
Missing
Mean
Median
Std. Deviation
Skewness
Std. Error of Skewness
Kurtos is
Std. Error of Kurtosis
Minimum
Maximum
28
0
9.39
4.50
24.715
5.092
.441
26.538
.858
0
134
OUCH! Our eyes are traumatized by this histogram. The distribution of the Number of
Applications is clearly far from normal. The summary statistics show that the maximum value is
134, while the other scores are below 20. Bumble’s t-test is quite suspect because of the gross
violation of the assumption of normality of the sampling distribution of means. What should we
do? First, let’s find a better way to look at the data.
The automatic scaling in SPSS obscures details of the shape of data. There is one very large score
that causes SPSS to form large bin intervals for the plot, and we lose the shape for cases that are
closer together. The cases at the lower end of the distribution all fall into one interval. To get a
better look at the shape of the distribution of cases with values less than 100, we can select Data,
Select Cases…, select If condition, If…, and define a selection rule. Click on applic, click the
arrow to move applic into the window on the right, <, 100, Continue, and then OK.
It is useful to generate separate histograms for the two groups. In the SPSS Data Editor window,
go to Data in the menu at the top, click Split file…, Compare groups, click group and move it into
the Groups Based on: box, click Sort the file…, OK. Now run Frequencies again as described
earlier, and SPSS will produce a separate histogram for each group. However, notice that the
scaling is not the same, which makes the histograms hard to compare. SPSS automatically scales
so that the distributions cover the complete X axis. For the control group the X axis ranges from -2
to 8 while for the training group the range is -5 to 15.
CD05 Nonparametric Statistics
13
So, by default, SPSS graphs may differ in range and interval size. We can modify the histograms
from the SPSS defaults to make them more comparable. To override the defaults, double-click on
the graph for the Training group in the SPSS output window to open the Chart Editor. Doubleclick the numbers on the X axis to open a Properties window. Select Scale, uncheck Auto for
Minimum and set the Custom minimum value to 0. Similarly, uncheck Auto and set the Maximum
to 12 and the Major Increment to 1. Click Apply, select Scale, set minimum to -1 and maximum to
12. Click Apply, Close. Click on a bin to open a Properties window where you can select Binning.
Click Custom, Interval width, set to 1; Apply; Close. Do the same for the Control group histogram.
What do we see in these histograms? The control group has many cases with zero applications, and
no case with more than six (other than the one excluded extreme case). The treatment group has a
wider range, with a lump at ten. Maybe participants were trained to submit one application every
work day over the two week period? The more we know about the program and the control group,
the more focused and useful our analyses can be.
It may be helpful to look at a cross tabulation of the data, if there are not too many categories.
First, we should remove the split file and the selection of cases <100. In the Data Editor window,
click Data in the top menu, Split File…, Analyze all cases…, OK. Then click Data, Select
cases…, All cases…, OK. Now we will create a crosstabulation table. Click Analyze, Descriptive
Statistics, Crosstabs…, select group for Rows and applic for Columns, click OK.
CD05 Nonparametric Statistics
14
Group * Number of a ppli cati ons Crossta bul ation
Count
Group
Total
Control
Training
0
4
1
5
1
1
1
2
3
1
4
3
3
3
Number of applicat ions
4
5
6
7
8
1
2
1
1
1
1
1
3
1
1
1
9
1
1
10
11
4
4
2
2
134
1
1
Total
16
12
28
This gives us the exact number of cases with each value, and we can see that the outlier in the
Control group has a value of 134.
The outlier. How should we deal with the outlier in the control group? There are many options to
consider, including the following:
1) track down the outlier to learn more about the case.
2) omit the case from further analysis;
3) Winsorize, by setting the value equal to the next most extreme case;
4) transform with a log or square root;
5) use an alternate analysis that is less sensitive to extreme scores.
By all means, begin with option 1). If the recorded value is an error, find the correct value,
determine an estimate for the value, or drop the case. If the value is valid, the outlier may be the
most interesting case in the sample. Although we have only one case in that range, we should try to
understand what is special about it. Maybe this person submitted poor applications to many
inappropriate places – this might cue us to set some criteria on what we count as an application.
We may not be able to use this one case to generalize to the population that includes few cases in
that range, but that isn’t a reason to ignore the extreme case. Thus, Option 2) should not be used
mechanically, although omitting the case may be best.
Option 3) has the appeal that a case with an extremely large value is retained in the sample, still
with a relatively large value (tied with the largest value). With the smaller value, the case is less
likely to be unduly influential in the analysis. A disadvantage is that the observed value is changed
and information about parameters such as the population mean is lost. Theoretically, one should
Winsorize an equal number of cases from each end of the distribution. In practice, if there is an
outlier at only one end, the impact of Winsorizing at the other end is likely to be negligible.
Option 4) is often useful, but it is important to check the shape of the distribution to make sure that
a transformation will be helpful, and that the best transformation is chosen. In Bumble’s case, the
distributions are not far from normal with the exception of the single extreme value. A log
transform would bring the extreme case in and make it less of an outlier, but the transformation
also would distort the rest of the data, which currently do not look bad. The extreme outlier would
still be an outlier. If there is negative skew with an outlier at the lower end of the distribution, a log
or square root transformation will make the distribution worse, increasing skew and kurtosis. In
that case, one can reverse the scale prior to transforming.
Option 5) provides many choices, including nonparametric tests and resampling tests. In general,
tests that require fewer assumptions also provide less power, but there are exceptions. Also,
alternate tests do not test the same hypothesis as the t-test.
CD05 Nonparametric Statistics
15
Wilcoxon W and Mann-Whitney U:
WS = [N1*(N1+1)/2] + U
The Wilcoxon W and Mann-Whitney U tests are often mentioned together as Wilcoxon-MannWhitney because they are really the same test. The formula to translate from W to U and vice versa
is U = WS - [N1*(N1+1)/2] where N1 is the smaller of N1 and N2 and WS is the smaller sum of
ranks for the smaller group. U is the smaller of U and U' where U + U' = N1*N2.
The null hypothesis for Wilcoxon W and Mann-Whitney U tests asserts that if we pool
observations from the two populations and rank them, the average rank of scores is the same for
each population. This does not imply that the means are hypothesized to be equal, though if the
two populations have the same shape and dispersion, then W gives a test of the equality of
medians. If both populations are also symmetric, then W gives a test of the equality of the means.
Thus, a test of W is generally interpreted as a test of equality of central tendency.
Because these tests are based on ranks rather than means, outliers have much less influence. The
largest score is ranked 1 whether it is 10 or 10,000,000. These tests are nearly as powerful as the
parametric t-test when the assumptions of the t-test are satisfied, and they often are more powerful
when the assumptions of the t-test are violated.
To run these tests, click Analyze, Nonparametric Tests, Legacy Dialogs, 2 Independent Samples...,
move applic into the Test Variable List, and move group into the Grouping Variable window.
Click Define Groups…, assign the value 0 to Group 1 and the value 1 to Group 2, click Continue.
Select Test Type to be Mann-Whitney U, under Options select Descriptive, click Continue, OK.
Te st Statisticsb
Ranks
Number of
applications
Training
Group
Control
Training
Total
N
16
12
28
Mean
Rank
10.72
19.54
Sum of
Ranks
171.50
234.50
Number of
applications
Mann-W hit ney U
35.500
W ilcox on W
171.500
Z
-2. 828
As ymp. Sig. (2-tailed)
.005
a
Ex act Sig. [2*(1-tailed
.004
Sig.)]
The table of Ranks shows us that the average rank for
cases in the Control group is 10.72, while the average
a. Not correct ed for ties.
rank for cases in the Training group is 19.54. This tells
b. Grouping Variable: Training Group
us that the typical number of contacts is greater in the
Training group than in the Control group. The tests of
statistical significance indicate that the two populations are significantly different. We can
conclude that a randomly chosen person from the Training population probably has more contacts
than a randomly chosen person from the Control population.
Red flag alert!!! Notice that the sum of ranks is larger for the smaller group (Ws' = 234.5). This
tells us that if we reversed direction of the ranking, we would find a smaller value for Ws. With
reversed rankings, Ws = N1*(N1+N2+1) – Ws' = 12*(12+16+1) – 234.5 = 113.5. This is the correct Ws.
SPSS has the correct value for U = 35.5, but if you reported this W provided by SPSS you would be wrong.
We can use the formula Ws = (N1)(N1+1)/2 + U. This gives Ws = (12)(12+1)/2 + 1 = 78.0 + 35.5 = 113.5.
This is the correct value for Ws, though SPSS reports W = 171.5. The Probability of Superiority (PS) is
1 – U/(N1*N2) = 1 – 35.5/(12*16) = 1 – 35.5/192 = 1 - .185 = .815, a large effect.
CD05 Nonparametric Statistics
16
Median Test
The median test is generally less powerful than Wilcoxon because it does not use as much
information. For the median test, we first pool the scores from both samples and find the overall
median. We classify each case as above or below this overall median. Then we use a 2x2 chisquare test of independence test whether the two populations differ in the proportion of cases
greater than the pooled median. In our example, we can find the overall median from the crosstab
table that we created earlier. Note that there are 28 cases, so 14 cases are below the median. By
counting from either end, we see that the score for the median case is between 4 and 5. We can
create a new dichotomous variable to indicate whether a case has more or fewer than 4.5 contacts
and conduct a chi-square test of independence from group membership. The chi-square test is
appropriate only if we have enough data so that expected values are greater than 5 in each cell.
Here is code that you can type in to the Syntax window and execute, or find by point-and-click.
recode applic (0 thru 4.5=0)(4.5 thru hi=1) into c2.
CROSSTABS /TABLES=group BY c2
/FORMAT= AVALUE TABLES
/STATISTIC=CHISQ
/CELLS= COUNT .
Chi-Square Tests
Pearson Chi-Square
Continuity Correctiona
Likelihood Rati o
Fis her's Exact Test
Linear-by-Linear
As soci ation
N of Valid Cases
Value
9.333b
7.146
10.008
9.000
df
1
1
1
As ymp. Sig.
(2-sided)
.002
.008
.002
1
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.006
.003
.003
28
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count les s than 5. The mi nimum expected count is
6.00.
By writing simple syntax, we can also conduct this test with SPSS nonparametrics.
Note: this command is not available through point-and-click.
NPAR TESTS
/median(4.5)=applic by group(0 1).
Median Test
Frequencies
Number of a
applications
> Median
<= Median
Training Group
Control
Training
4
10
12
2
a. Median specified as 4.5.
CD05 Nonparametric Statistics
17
Wilcoxon T for paired data: D13
Dale Berger, CGU
This is a detailed example of an application of the nonparametric Wilcoxon T test for paired data,
including a discussion of the logic, demonstration of hand calculations, and of SPSS analysis.
The problem: We wish to evaluate the impact of a program intended to increase recycling of
newspapers. A random selection of 14 homes in a city is visited by Boy Scouts who deliver a
brochure that describes why it is good to recycle newspapers. Researchers weigh the newspapers
recycled by each household during the week before the visit and during the week following the
visit. These data are represented in Table 1 in the columns labeled Before and After. The change in
the amount of recycling is reported in the column labeled Diff. [Research design note: In practice
you should include a control group and much larger samples, if you can.]
Consider methods of analysis. If the program had no effect, we would expect the average Diff
score to be zero, with positive and negative differences approximately balancing each other. A
dependent (i.e., paired) t-test would be a good choice if the assumptions of that test are satisfied.
The null hypothesis of the t-test is that the mean Diff in the population is zero. An assumption for
the t-test is that the sampling distribution for mean Diff scores based on samples of N=14 is
approximately normal. To judge the validity of this assumption, we plot our sample Diff scores to
see if that plot is close enough to normal so that it is reasonable to assume that the sampling
distribution of the mean Diff is normal. As we scan down the Diff column, the number 55 stands
out. With such an extreme outlier and such a small sample, we are not willing to assume normality.
We examine Home 5 to make sure this is not a coding error. If we discovered that Home 5 was the
only apartment house in our sample, we might chose to drop it from this analysis and plan a second
study focused on apartment houses. However, if Home 5 looks like a legitimate case, we might
choose to use the nonparametric Wilcoxon T, which is based on ranks.
Table 1: Pounds of newspapers recycled before and after
Boy Scout visit
ID(home) Before After Diff
1
0
0
0
2
0
6
6
3
0
0
0
4
14
18
4
5
10
65
55
6
0
17
17
7
0
10
10
8
5
12
7
9
0
0
0
10
6
10
4
11
17
10
-7
12
12
12
0
13
15
14
-1
14
12
16
4
Sum of ranks:
(Low to High Ranking)
Rank
+T1
-T1
5
5
3
10
9
8
6.5
3
10
9
8
6.5
3
6.5
3
1
3
55
6.5
1
3
47.5
7.5
CD05 Nonparametric Statistics
18
Wilcoxon T for matched-pairs: hand calculations. The Wilcoxon test is based on ranks of the
Diff scores, where Diff scores are ranked according to size from smallest to largest, ignoring the
direction of the difference, and ignoring cases where Diff scores are zero. The null hypothesis for
the Wilcoxon test is that the sum of the ranks of positive Diff scores is equal to the sum of the
ranks of negative Diff scores in the population represented by our sample.
Our first step is to rank the Diff scores. We ignore cases where Diff = 0. Diff scores are ranked
according to their absolute values (ignoring sign) from low to high. The smallest Diff score
(ignoring sign) is -1, so we assign it a rank of 1. The next smallest Diff score is 4, and there are
three of them. If they weren’t tied, they would be given ranks of 2, 3, and 4. Because they are tied,
they are assigned their average rank, which is 3 (i.e., [2+3+4]/3). The fifth smallest Diff is 6,
which is assigned Rank=5. The sixth and seventh smallest are two Diff scores at 6 (or -6), so they
are assigned their average rank of 6.5. The next three Diff scores of 10, 17, and 55 are assigned
ranks of 8, 9, and 10, respectively.
If the null hypothesis is true, the sum of ranks associated with positive Diff scores should be about
the same as the sum of ranks for negative Diff scores. In our example, the sum of the ranks for the
positive Diff scores, column +T1, is 47.5 and the sum of negative Diff (-T1) scores is 7.5. We can
check for a computation error: the sum of ranks 1 through N is (N)(N+1)/2. If we had three
numbers, the sum of ranks would be 1+2+3=6 and (N)(N+1)/2 = (3)(4)/2 = 12/2 = 6. In our
example, N=10 for non-zero Diff scores, so the sum of ranks 1 through 10 is (10)(11)/2 = 55. The
sum of +T1 and –T1 = 47.5 + 7.5 = 55. Check!
Which is more indicative of a difference, a small T or a large T? If there was no difference in the
matched pairs in the population, we would expect about half of the ranks to be positive and half to
be negative. The sum of –T and +T = (N)(N+1)/2 = 55, so if the null hypothesis is true the
expected value of T is half of the sum, or (N)(N+1)/4 = 27.5. The most extreme difference between
groups would be if all differences were in the same direction, so the sum of ranks for the opposite
direction would be zero, giving T=0. We found T=7.5. How surprising is this?
We can test the statistical significance of our finding with Table T in Howell. Thus, when we use
the table, we keep in mind that a smaller T value gives a smaller p value. Table T shows us the
one-tailed p value for various T values. When N=10, we find the following values:
N
10
p = 0.05
T
α
10
0.0420
p = 0.025
T
α
8
0.0244
p = 0.01
T
α
5
0.0098
p = 0.005
T
α
3
0.0049
11
9
6
4
0.0527
0.0322
0.0137
0.0068
This is a somewhat unusual table in that it gives exact one-tailed p values for the two outcomes
that are on either side of the critical p value shown at the head of the table. For example, if we
wished to conduct a two-tailed test with alpha = .05, we would use the T values found in the
column headed by p=.025. To attain statistical significance with one-tailed p<.025, we would need
to find T less than or equal to 8, because the probability of observing T less than or equal to 8 is
.0244. For T=9, the probability is .0322. If we have tied scores, we may have a fractional T value,
such as T=10.5. We can use linear interpolation to estimate the p value associated with T=10.5 as
half way between the p values for T=10 and T=11. From the table, we find this p value to be
(.0420 + .0527)/2 = .0474, which is still statistically significant with p<.05 one-tailed.
CD05 Nonparametric Statistics
19
From the table one-tailed p=.0137 for T=6 and p=.0244 for T=8. We estimate the p value for our
T=7.5 to be about ¾ of the distance between .0137 and .0244. This calculation is
(.0244 - .0137)(3/4) + .0137 = .008 + .0137 = .0217. Thus, our one-tailed p is not significant at the
alpha = .01 level. However, it is significant at the .05 level, both one-tailed and two-tailed (for
two-tailed p, we double the one-tailed p, giving us a two-tailed p of about .0434).
Howell’s table goes up to N=50. When we have more than 50 cases, the T statistic is
approximately normally distributed, and we can use a normal approximation. Under the null
hypothesis, the expected value of T = E(T) = (N)(N+1)/4 and the SD of T is the square root of
(N)(N+1)(2N+1)/24. In our example where N=10, E(T) = 27.5 as calculated earlier, and SD =
square root of 96.25 = 9.81. Applying this test to our data gives
T - E(T) (7.5  27.5)
z

 2.04 . Consulting a z table gives p=.021 one-tailed. In our case where
SDT
9.81
N=10, this is not a very reliable test. The normal approximation is reasonably accurate when
N>50, but the tabled values are exactly correct. SPSS gives an approximate p-value based on a
normal approximation.
Wilcoxon T for Matched Pairs: SPSS Nonparametric Application. If we wish to use SPSS to
apply the Wilcoxon test, our first task is to enter the data into SPSS. We need to tell SPSS the
‘before’ measure and the ‘after’ measure of recyling for each home. In the opening SPSS for
Windows window, select Type in data, click OK. You now see the Untitled – SPSS Data Editor,
a spread sheet. Before we enter data, let us define our variables. Click the tab at the bottom labeled
Variable View. In the column headed by Name, in Row 1 enter ID for an identification code.
Move to the second row under Name and you will notice that SPSS changed the name to all lower
case, id. We know we are not measuring id and ego, so it is OK. In Row 2 under Name, enter
before, and in Row 3 enter after. All three of these variables are numeric with no decimal places.
Under Decimals, we can change the number from the default value of 2 to our preferred value of
zero (0). When you click on a cell under Decimals, a tiny control bar opens, allowing you to
increase or decrease the number in the cell, or you can simply type the number you want. If you
exit the cell, you can come back and copy it and paste the value into other cells. In the first three
rows under Label, we can enter Identification Number, Before Treatment, and After
Treatment.
Enter data into SPSS. Click the tab at the bottom labeled Data View. In the first 14 rows under
the id column, enter the numbers 1 through 14. In the columns labeled before and after, enter the
appropriate observed numbers. When you are finished, the spreadsheet should look like the first
three columns in Table 1.
Analyze the data. Ordinarily, we would begin with descriptive analyses to help us understand our
data set and to check for errors. In our example, we noted an extreme score (Diff = 55 for id=5)
and we decided that the parametric dependent t-test is not appropriate, and that the nonparametric
Wilcoxon T is a better choice. In SPSS, click Analyze in the upper menu bar, click Nonparametric
Tests, 2 Related Samples…. In the Two-Related-Samples Tests window, highlight both before and
after, and click the little black triangle to move both of these variables into the Test Pair(s) List
window. Under Test Type, select Wilcoxon, and under Options, select Descriptives, and Continue.
Click Paste. In the syntax window of SPSS, click Run.
CD05 Nonparametric Statistics
20
SPSS Output. Below are the syntax and the output for our analysis.
NPAR TEST
/WILCOXON=before WITH after (PAIRED)
/STATISTICS DESCRIPTIVES
/MISSING ANALYSIS.
NPar Tests
De scriptive Stati stics
N
BEFORE
AFTER
Mean
6.50
13.57
14
14
St d. Deviation
6.607
16.018
Minimum
0
0
Maximum
17
65
Wilcoxon Signed Ranks Test
Ra nks
N
AFTER - BEFORE Negative Rank s
Positive Ranks
Ties
Total
2a
8b
4c
14
Mean Rank
3.75
5.94
Sum of Ranks
7.50
47.50
a. AFTER < BEFORE
b. AFTER > BEFORE
c. BEFORE = AFTER
Test Statisticsb
Z
As ymp. Sig. (2-tailed)
AFTER BEFORE
-2.045a
.041
a. Based on negative ranks .
b. Wilcoxon Signed Ranks Test
We can compare the SPSS output to the results of our hand calculations. The z-test is not as
accurate as the exact values shown in Howell’s tables, because our sample is small. However, in
our example, the conclusions are the same. According to SPSS, the observed sum of negative
ranks is 7.5 which agrees with our hand calculation of T=7.5. A value of T this small or smaller is
unlikely. SPSS reports a two-tailed probability of .041 based on a normal approximation, which
gives one-tailed probability of about .021. This is consistent with our hand calculation of the test
using the normal approximation. Given our small sample, we should be suspicious of the normal
approximation for testing T. Our calculations using Howell’s tables show that the one-tailed
probability for T=7.5 with N=10 is about .0217. The two-tailed probability is twice that, or about
.043. In our example, we could make a good argument for using a one-tailed test, because we
probably are not interested in the treatment if it reduces recycling.
CD05 Nonparametric Statistics
21
Categorical Data Analysis
Dale Berger
SPSS CROSSTABS D14a
Statistics for 2x2 Contingency Tables
When we are interested in the relationship between two categorical variables, SPSS CROSSTABS
provides many useful descriptive summaries and statistical tests. In this handout we will ask SPSS
to show us everything for a 2x2 table, and we will examine each analysis. Note that in practice
many of these statistics will not be relevant for any specific research application.
Example: We are interested in the relationship between education level and compensation level in
a large organization. We have data from a random sample of 100 employees, including whether the
employee has a BA degree or not and whether the employee is
No BA
BA
on hourly compensation or on salary. We found 51 with a BA
Hourly a 16
b 9
25
on salary, 9 with a BA on hourly, 24 with no BA on salary, and Salary c 24
d 51
75
16 with no BA on hourly.
40
60 100
We can enter this data set into SPSS in several ways. When we are given a frequency table,
perhaps the easiest method is to enter one line of data for each cell showing the levels of the
factors and the cell frequency. Then we can weight the analysis by the cell frequency.
Call up SPSS and select the Variable View tab. Under Name, enter three variable names: pay,
educ, and freq. Decimals can be set to zero for each of
these variables. Under Values, for pay set 0=Hourly and
1=Salary, and for educ set 0=”No BA” and 1=BA. Then
select the Data View tab. For the first cell, enter 0 for
pay, 0 for educ, and 16 for freq. Similarly, enter the
information for the other four cells.
Now we are ready to run CROSSTABS. First we need to weight our cases by freq. Click Data,
Weight Cases…, Weight cases by, select freq, click OK. Click Analyze, Descriptive statistics…,
Crosstabs, select pay for the row variable and educ for the column variable. Click Statistics, and
for illustration select everything except the last one (Cochran’s MH). Click Cells and select as
shown. Again, this will produce more output than we generally would want to request.
CD05 Nonparametric Statistics
22
First, check the Count and the labels to verify that we have specified the data correctly for SPSS.
pay * educ Crosstabulation
educ
0
pay
0 hourly
1 salary
Total
Count
Expected Count
% within pay
% within educ
Residual
Std. Residual
Count
Expected Count
% within pay
% within educ
Residual
Std. Residual
Count
Expected Count
% within pay
% within educ
16
10.0
64.0%
40.0%
6.0
1.9
24
30.0
32.0%
60.0%
-6.0
-1.1
40
40.0
40.0%
100.0%
1
9
15.0
36.0%
15.0%
-6.0
-1.5
51
45.0
68.0%
85.0%
6.0
.9
60
60.0
60.0%
100.0%
Total
25
25.0
100.0%
25.0%
75
75.0
100.0%
75.0%
100
100.0
100.0%
100.0%
The Expected Count is based on an independence model. If educ and pay are independent, then
we can use the marginal division on one variable (e.g., pay is split 25% hourly and 75% salary) to
predict how cases are split between those levels for each level of the other variable. Thus, for the
40 cases with No BA and the 60 cases with a BA, we predict 25% at each level would be hourly
and 75% would be salaried if we have independence between pay and education. For the first cell,
this gives (RowSum * ColumnSum / N) = (25*40/100) = 10 as the expected count.
The “% within pay” tells us that 36% of the hourly people have a BA degree, while 68% of the
salary people have a BA.
The “% within educ” tells us that 60% of those with no BA are salaried while 85% of those with a
BA are salaried.
The Residual is the difference between the observed Count and the Expected Count.
The Std. Residual can be interpreted as a z-score to test the null hypothesis that the observed
count is consistent with the expected count for any given cell.
CD05 Nonparametric Statistics
23
Some of the statistics in the table of Chi-Square Tests will not be useful for most applications, but
they can be very useful for special applications.
Chi-Square Tests
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fis her's Exact Test
Linear-by-Linear
As sociation
McNemar Test
N of Valid Cases
Value
8.000b
6.722
7.901
7.920
df
1
1
1
1
As ymp. Sig.
(2-sided)
.005
.010
.005
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.009
.005
.005
.014c
100
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 10.
00.
c. Binomial distribution us ed.
Pearson Chi-Square tests the null hypothesis of independence
( foij  feij ) 2
2
between the row and column variables in the population. This is an   
feij
approximation goodness-of-fit test that may not be accurate if the
expected value in any cell is extremely small (e.g., less than 5).The observed and expected
frequencies in Cell ij are foij and feij, respectively. Distributed approximately as chi-square with
df = (number of rows – 1) * (number of columns – 1).
Continuity Correction should be used only for a 2x2 table where both marginals are fixed (i.e.,
known before the data are tested, as with a median test for two groups).
The Likelihood Ratio = G2 is an alternate approximate chi-square test of independence.
 fo
 Distributed approximately as chi-square with
G 2  2 foij ln  ij
feij  df = (# of rows – 1)*(# of columns – 1)

Fisher’s Exact Test is a multinomial nonparametric test that can be used with very small samples,
including cases where expected values of some cells are less than five.
Linear-by-Linear Association is most useful when both variables are on an interval scale.
r2 (N-1) is distributed approximately as chi-square with df=1.
McNemar’s test is useful for 2x2 ‘before-after’ designs (e.g., one variable is Pass vs. Fail before
training and the second variable is Pass vs. Fail after training). The null hypothesis is that the two
marginal distributions are equal in the population (i.e., there was no change in passing rate). A
binomial test is conducted on only the two cells that show a change under the null hypothesis that
in the population there were as many changes in one direction as in the other.
Look for one set of N cases classified into two groups on two occasions, giving a 2x2 table. The
null hypothesis is that the split is the same on the two margins, not that the two marginal variables
are statistically independent.
CD05 Nonparametric Statistics
24
Directional Measures
Nominal by
Nominal
Lambda
Goodman and
Kruskal tau
Uncertainty Coefficient
Ordinal by Ordinal
Somers' d
Nominal by Interval
Eta
Symmetric
pay Dependent
educ Dependent
pay Dependent
educ Dependent
Symmetric
pay Dependent
educ Dependent
Symmetric
pay Dependent
educ Dependent
pay Dependent
educ Dependent
Value
.108
.000
.175
.080
.080
.064
.070
.059
.281
.250
.320
.283
.283
As ymp.
a
Std. Error
.071
.000
.114
.056
.055
.045
.049
.041
.098
.090
.110
Approx. Approx.
b
T
Sig.
1.414
.157
c
.
.c
1.414
.157
.005d
.005d
1.422
.005e
1.422
.005e
1.422
.005e
2.756
.006
2.756
.006
2.756
.006
a. Not as suming the null hypothes is.
b. Us ing the asymptotic standard error ass uming the null hypothesis.
c. Cannot be computed because the asymptotic standard error equals zero.
d. Based on chi-s quare approximation
e. Likelihood ratio chi-square probability.
B
If we are interested in predicting one variable from the other,
we may use a Directional Measure.
Lambda gives an indication of how helpful information on one
variable is for predicting the level of the other variable.
A
No BA
BA
Hourly a 16
b 9
Salary c 24
d 51
40
60
CA = Number of cases classified correctly on A
using the marginal on A
CA|B = Number of cases classified correctly on A given information on B
25
75
100
Lambda = (CA|B - CA)/ (N - CA)
In our example, if we wish to predict pay level with no knowledge of education level, we can be
correct on 75 cases by predicting ‘salary’ for everyone, so CA = 75. Knowledge of education level
doesn’t help us predict pay level because the majority of people are on salary at both levels of
education. Thus we will predict ‘salary’ for everyone and we will be correct on 24 cases with No
BA and on 51 cases with a BA, for a total of CA|B = 24+51 = 75. Thus, lambda(pay) = 0.
If we wish to predict education level, we will be correct on 60 cases if we predict ‘BA’ for
everyone, giving CA = 60. Knowledge of pay level improves prediction. If someone is hourly, we
predict ‘No BA’ and we are correct on 16 cases. If someone is salaried, we predict ‘BA’ and we
are correct on 51 cases, giving us CA|B = 16+51 = 67. Thus, lambda(educ) = (67-60)/(100-60) = 7/40
= .175. Information on pay level accounts for 17.5% of the errors that are made in predicting
education level without knowledge of pay level.
CD05 Nonparametric Statistics
25
Symm etri c Measures
Nominal by Nominal
Ordinal by Ordinal
Int erval by Interval
Measure of Agreement
N of V alid Cas es
Phi
Cramer's V
Contingenc y Coefficient
Kendall's t au-b
Kendall's t au-c
Gamma
Spearman Correlat ion
As ymp.
a
St d. E rror
Approx . T
.098
.087
.160
2.756
2.756
2.756
Approx . Sig.
.005
.005
.005
.006
.006
.006
.283
.098
2.919
.004
.283
.267
100
.098
.095
2.919
2.828
.004c
.005
Value
.283
.283
.272
.283
.240
.581
Pearson's R
Kappa
b
c
a. Not as suming the null hypothesis.
b. Us ing the asymptotic s tandard error as suming the null hypothesis.
c. Based on normal approximation.
2
Phi  
2 
N
B
In a 2x2 table, Phi = Pearson r.
(ad  bc) 2 N
(a  b)(c  d )(a  c)(b  d )
0
A1
1
0
No BA
BA
Hourly a 16
b 9
Salary c 24
d 51
40
60
25
75
100
(ad  bc) 2
Phi  r 
(a  b)(c  d )( a  c)(b  d )
Cramer’s V adjusts for the size of the table by dividing phi by the
square root of (L-1) where L is the lesser of the number of rows
and the number of columns.
Contingency Coefficient ranges between 0 and 1
2
Cramer' s V  
2
Contingency Coefficient  
N ( L  1)
(N   2 )
Kendall’s tau-b adjusts for ties in ranks.
Kendall’s tau-c also adjusts for table size for tables larger than 2x2.
Gamma provides a good test of ordinal by ordinal relationships. Gamma ignores ties and can be
used if there are many ties. A pair of cases from cells a and d is ‘concordant’ because the case with
the greater value on A has the greater value on B. P = the number of concordant pairs = a*d =
16*51 = 816. A pair of cases from cells b and c are ‘discordant’ because the case with the greater
value on A has the lesser value on B. Q = the number of discordant pairs = b*c = 9*24 = 216.
Gamma = (P-Q)/(P+Q) = (816-216)/(816+216) = 600/1032 = .581. With large N, gamma (G) can
be tested with Z = G * square root of [(P+Q) / N(1 – G*G)].
Spearman Correlation is the Pearson Correlation conducted on the ranks of the scores. This is
useful for an ordinal by ordinal relationship with few ties.
Cohen’s Kappa is a measure of agreement adjusted for chance. Po is the observed percent
agreement and Pc is the chance agreement. Kappa = (Po – Pc) / (1 – Pc). Thus, kappa can be
interpreted as a proportion improvement over chance.
CD05 Nonparametric Statistics
26
Risk Estimate
Value
Odds Ratio for pay
(0 hourly / 1 salary)
For cohort educ = 0
For cohort educ = 1
N of Valid Cases
95% Confidence
Interval
Upper
Lower
3.778
1.461
9.767
2.000
.529
100
1.286
.307
3.111
.913
No BA
BA
Hourly a 16
b 9
Salary c 24
d 51
40
60
a
16
c  24  3.778
b
9
d
51
a
16
c
24  2.000


( a  b)
(25)
(c  d )
(75)
b
9
d
51  .529


( a  b)
(25)
(c  d )
(75)

25
75
100
Odds ratios are not the same as ‘risk ratios.’ For example, the odds that someone with a BA is at
salary is 51/9 = 5.667 while the odds that someone with No BA is at salary is 24/16 = 1.500. Thus,
the odds ratio is 5.667/1.500 = 3.778.
Psychologists are more familiar with a description in terms of proportions or ‘risk ratios.’ We
might say that 85% of employees with a BA are on salary (i.e., 51/60) while only 60% (i.e., 24/40)
of employees without a BA are on salary. Thus, the probability of being on salary is 25 percentage
points greater for employees with a BA. We could also say that the probability of being on salary
is 42% greater for those with a B.A. because (85% - 60%) / 60% = .42. These differences can be
confusing unless one is very explicit about how percentages are calculated! Implication: We need
to be very clear about what we are reporting.
A limitation of ‘risk ratios’ is that it may be arbitrary which way we compute the ratio, and it isn’t
possible to convert from one to the other without additional information. For example, we might
compute the risk of being on hourly wages as 9/60 = 15% for those with a BA and 16/40 = 40%
for those with no BA. The probability of being on hourly is more than twice as great for those with
no BA, 40% / 15% = 2.67. Thus, the ‘risk ratio’ of being on hourly for those with no BA vs. those
with a BA is 2.67. However, the probability of being on salary for those with a BA is 85%
compared to 60% for those with no BA, so the ‘risk ratio’ of being on salary is only 85% / 60% =
1.42. However, the odds ratio is something quite different, 3.778.
BIG Caution: A common error is to interpret an odds ratio as a risk ratio. It is not correct to
interpret the odds ratio of 3.778 as indicating that people with a BA were 3.778 times as likely to
be on salary compared to someone without a BA.
Lesson: Odds ratio  big red flag when interpreting
CD05 Nonparametric Statistics
27
Categorical Data Analysis
Dale Berger
SPSS CROSSTABS D14b
Statistics for larger contingency tables
With large crosstab tables, relatively sophisticated analyses may be needed. For illustration, we
will use a hypothetical example adapted from Franke, et al. (2012). Researchers wish to compare
the effectiveness of three different methods of serving families with children at risk for abuse or
neglect. A sample of 731 cases was randomly assigned to one of three treatments: (1) parenting
education, (2) community services, or (3) wraparound that included both of the first two services
plus case management. Outcomes after one year were classified into four categories: (1) no further
contact with Child Protective Services (CPS), (2) a referral to CPS, (3) substantiated allegations of
abuse or neglect, or (4) child removed from the home. Below is syntax and output for the example.
CROSSTABS
/TABLES=Outcome BY Treatment
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ GAMMA
/CELLS=COUNT COLUMN SRESID PROP
/COUNT ROUND CELL.
CD05 Nonparametric Statistics
28
The Pearson chi-square test for independence = 36.771 with df = 6, p < .001. We reject the null
hypothesis, and conclude that there is a relationship between the two variables.
If the null hypothesis were true, then in the population the proportion of cases with any given
outcome is the same for every treatment. In the example, if the null hypothesis is true, then in the
population the proportion of cases where the child is removed is the same for every treatment (say
7%), and the proportion of cases referred to CPS is the same for every treatment (say 18%), etc. Of
course, because of sampling variability, these proportions would not be exactly the same in our
sample even if the null hypothesis is true. However, statistical significance tells us that the
differences in proportions that we observed in this hypothetical sample are highly unlikely if the
null hypothesis is true.
So, where are the differences? A chi-square test with more than one degree of freedom is a ‘blob’
test. It tells us that there are differences between treatments in the proportion of cases in the
various outcomes. However, it does not tell us where the differences are.
A common error is to draw conclusions about specific differences based on a visual examination of
the crosstab table. For example, we might be impressed that the Wraparound treatment had the
highest proportion of cases with no new CPS contact (72.3%), compared to 52.2% for the
Education treatment and 63.8% for the Community treatment. Similarly, we see that those who
received only Education had the highest proportion of Child Removed (14.5%) compared to only
4.3% and 4.2% for the other two treatment conditions. However, we cannot use the overall chisquare test to conclude that those specific differences are statistically significant. Special, more
focused tests are needed.
Standardized Residual tests
A simple, but very limited test is provided by the Standardized Residuals for each individual cell.
This statistic, distributed as standardized Z, is computed as the square root of the contribution of a
cell to the Pearson chi-square statistic: (Observed – Expected) / SQRT(Expected). For example,
the observed frequency of Child Removed in the Education group (Row 4, Column 1) is 27. The
expected frequency for this cell if the treatment and outcome were independent is computed as
(Row Total)*(Column Total) / N = (50)*(186) / 731 = 12.72. The standardized residual is (27 –
12.72) / SQRT(12.72) = 14.28 / 3.57 = 4.00 (see Std. Residual in the table). Because this statistic
exceeds 1.96, the two-tailed p-value is less than .05. The precise p-value is .00006.
This test may not be of much practical value because it deals with one cell at a time in comparison
to the entire model. We can conclude that there are more cases of Child Removed in the Education
condition that we would expect if treatment and outcome were independent. In practice, it may be
more useful to compare specific conditions to each other. Also, we may be concerned with alpha
inflation if we consider a separate test for each cell. In this example, we have 3x4 = 12 cells.
Applying the Bonferroni logic, the critical p-value would be .05/12 = .0042. Because the observed
p-value of .0006 is less than .0042, even this conservative test attains statistical significance.
Gamma test of ordinal by ordinal relationship
If the values for both the row and column variables can be ordered from low to high on some
underlying concept, then the gamma statistic provides an index and test of the ordinal by ordinal
relationship. A statistically significant positive gamma indicates that larger values on one variable
CD05 Nonparametric Statistics
29
are associated with larger values on the other variable.
A negative gamma indicates a relationship in opposite
direction.
1
2
3
4
Total
1
97
38
24
27
186
2
120
42
18
8
188
3
258
49
35
15
357
Total
475
129
77
50
731
As noted earlier in the discussion of 2x2 tables, gamma
is based on the number of ‘concordant’ vs. ‘discordant’
pairs of cases. For the gamma calculation we consider
each pair of cases where the cases do not share a row or a column. For example, any one of the 97
cases from the Cell[1,1] could be paired with any one of the 42 cases from Cell[2,2]. Any pair like
this would be considered concordant because the case that has a larger value on the row variable
also has a larger value on the column variable.
For each case in Cell[1,1], a concordant pair can be formed with any case from any cell that is
larger in both row and column, displayed to the right and down from Cell[1,1] in the current
example. There are 42 + 49 + 18 + 35 + 8 + 15 = 167 such cases, so the number of concordant
pairs where one case is in Cell[1,1] is 97 * 167 = 16,199. Similarly, we can compute the number of
concordant pairs where the first case is in Cell[1,2] as 120 * (49 + 35 + 15) = 11,880. No
concordant pairs can be formed with a case from Cell[1,1] and any of the other cases in Row 1 or
Column 1. For Cell[2,1] with 38 cases, there are 38 * (18+35+8+15) = 2,888 concordant pairs.
Continuing, with cells [2,2], [3,1], and [3,2], we find 2,100, 552, and 270 more concordant pairs,
respectively. Adding the numbers of concordant pairs gives a total of 33,889 = P.
For discordant pairs, the case with a larger value on the row variable has a smaller value on the
column variable (ties are ignored). For each of the 258 cases in Cell[1,3], a discordant pair can be
formed with any case taken from any of the cells to the left and downward. There are 157 cases in
those cells, giving 258 * 157 = 40,506 discordant pairs involving Cell[1,3]. Cells [1,2], [2,2],
[2,3], [3,2], and [3,3] produce 10680, 2142, 3773, 486, and 1225 discordant pairs with cases in
cells to their left and downward, giving a total of Q = 58,812 discordant pairs.
Gamma = (P-Q) / (P+Q) = (33,889 - 58,812) / (33,889 + 58,812) = -24,923 / 92,701= -.269. With
large N, gamma (G) can be tested with standardized Z = G * square root of [(P+Q) / N(1 – G*G)].
Z = -.269 * √[92,701/(731*(1 – (-.269^2))] = -.269*√678.16 = -7.01; p < .001.
Because gamma = -.269 is negative, we know that larger values on the treatment variable tend to
be associated with smaller values on the outcome variable. From the coding, we see that larger
values on the Treatment variable indicate a more intense treatment, while larger values on the
outcome variable indicate a worse outcome. Thus, the negative gamma indicates that more intense
treatment is associated with lower values on the index of negative outcomes. It may be easier to
describe this finding by reversing the coding on the outcome variable, so a larger number indicates
a more favorable outcome. Then, the sign would be reversed on gamma. We don’t actually need to
re-run the analysis for gamma with the reversed scale because we know what it would be: gamma
= .269. The interpretation is “More intense treatments are associated with better outcomes.”
CD05 Nonparametric Statistics
30
Contrasts to test specific hypotheses
Suppose we have a specific hypothesis that families that receive the Wraparound treatment are less
likely have a referral to Child Protective Services (CPS) than families that receive either of the
other two treatments. The overall test of independence was χ2 (6, N = 731) = 36.771, p < .001,
indicating that there is a significant relationship between treatment and outcome. However, the
overall test is a ‘blob’ test that does not allow us to draw conclusions about specific comparisons.
As with ANOVA, we can construct contrasts to test specific hypotheses. The test is a Z test,
computed as the value of a contrast divided by the standard error of the contrast (Goodman, 1963).
̂ = ∑ 𝑤𝑖 (𝑝𝑖 ) where 𝑝𝑖 is a group
A contrast comparing group proportions is computed as 𝛹
proportion and wi is the weight assigned to that proportion. The weights are defined so that the
sum of the weights is zero; thus, if the null hypothesis is true, the expected value of the contrast is
zero. We begin by identifying the proportions we wish to compare, and the appropriate weights.
The proportions of cases Referred to CPS in the Education and Community groups are 38/186 =
.2043 = p1 and 42/188 = .2234 = p2, respectively. The total number of cases in these two groups is
(186 + 188) = 374, and the relative weights assigned to these two groups correspond to their share
of the N for these groups pooled: 186/374 = .4973 and 188/374 = .5027. Note that the sum of
weights for these two groups is +1.0000.
The proportion of cases Referred to CPS in the Wraparound treatment group is 49/357 = .1373 =
p3. We assign weight -1 to the Wraparound group.
̂ = ∑ 𝑤𝑖 (𝑝𝑖 ) = (.4973)*(.2034) + (.5027)*(.2234) + (-1.000)*(.1373)
The value of the contrast is 𝛹
= .1016 + .1123 + (-.1373) = .0766.
The standard error squared for each sample proportion is (pi * qi) / Ni where pi is a specific
proportion of interest, qi = 1 – pi, and Ni is the number of cases upon which the proportion pi is
2
based. The squared standard error for a contrast is 𝑆𝐸𝛹
= ∑ 𝑤𝑖2 (
Treatment
Group
Education
Community
Wraparound
Sum
Z=
̂
𝛹
𝑆𝐸𝛹
Ni
186
188
357
𝒑𝒊
𝒘𝒊
.2043 .4973
.2234 .5027
.1373 -1.0000
0.0000
𝒘𝒊 (𝒑𝒊 )
.1016
.1123
-.1373
̂
.0766 = 𝛹
𝒒𝒊
.7957
.7766
.8627
𝑝𝑖 𝑞𝑖
𝑁𝑖
)
𝒑𝒊 𝒒𝒊
𝑵𝒊
.0008740
.0009228
.0003318
𝒑𝒒
𝒘𝟐𝒊 ( 𝑵𝒊 𝒊)
𝒊
.0002161
.0002332
.0003318
.0007811 = 𝑆𝐸𝛹2
.02795 = 𝑆𝐸𝛹
= .0766 / .02795 = 2.74, p = .006 two-tailed.
If this was an a priori hypothesis and we did not wish to make any alpha adjustments for possible
multiple tests, we could conclude that we have statistically significant evidence (p < .01) of fewer
referrals to Child Protective Services for families that received the Wraparound treatment
compared to families that received the other two treatments.
CD05 Nonparametric Statistics
31
Holm’s test: If multiple tests are considered, then it may be appropriate to make adjustments. With
a small set of a priori orthogonal contrasts, no adjustments may be needed. With a set of
nonorthogonal a priori hypotheses, Holm’s test (Holm, 1979) is a good choice. Suppose you wish
to test k contrasts (e.g., k = 5) controlling family-wise alpha error at α, such that if there is no
effect of treatments at all, the probability of even one false significant finding for the set of k tests
is α (e.g., α = .01). The test procedures uses a different critical value for each contrast.
Compute the p-value for each contrast and order them from smallest observed p-value to largest.
Then test the smallest p-value against the critical value of α / k (e.g., .01/5 = .0020), test the next
smallest p-value against α / (k – 1) (e.g., .01/4 = .0025), the next against α / (k – 2) (e.g., .01/3 =
.0033), and so on until the last p-value is tested against α / 1 (e.g., .05). An important rule is to stop
at any point where the observed p-value exceeds its critical value.
Scheffe’s test: If the contrasts are selected after looking at the data, the much more conservative
Scheffe’s test is appropriate. Compute the Z values in the usual way, but compare to the square
root of the critical value for the original chi-square test for the full table. If we wished to use alpha
= .01for a Scheffe’s test on our example with a 3x4 table, we find the critical χ2 (6, α = .01) =
16.81, and take the square root of this value to find 4.10. A calculated Z-value must exceed 4.10 to
be considered statistically significant using Scheffe’s test with alpha = .05 on a 3x4 table.
Suppose we notice that 258 of the 357 families that received Wraparound treatment had no further
involvement with CPA (72.27%), which appears greater than the 97 of 186 families that received
Education only (52.15%). The contrast is .7227 - .5215 = .2012. The SE for the contrast is
.04362, giving a Z score of .2012 / .04362 = 4.61. To apply the Scheffe test, we compare the 4.61
to 4.10. Our conclusion is that this difference is statistically significant, even taking into account
the fact that we looked at the data to find this large effect.
Treatment
Group
Education
Wraparound
Sum
Ni
186
357
𝒑𝒊
.5215
.7227
𝒘𝒊
-1
+1
0.0000
𝒘𝒊 (𝒑𝒊 )
-.5215
.7227
̂
.2012 = 𝛹
𝒒𝒊
.4785
.2774
𝒑𝒊 𝒒𝒊
𝑵𝒊
.001342
.0005614
𝒑𝒒
𝒘𝟐𝒊 ( 𝑵𝒊 𝒊)
𝒊
.001342
.0005614
.001903 = 𝑆𝐸𝛹2
.04362 = 𝑆𝐸𝛹
Caution: For the Z test to be accurate, the sample sizes should be large. A rule of thumb is that the
product Ni * pi * qi for any cell involved in a contrast should exceed 5. In our example, the
smallest observed Npq is for the Child Removed in the Community Treatment condition, where
only 8 out of 188 were removed. This pi = 8/188 = .0425, so qi = .9575. The product Npq = 7.66.
Thus, for this example, all cells have enough data for us to be comfortable with the Z test, which
assumes normal distributions. With smaller samples, it may not be appropriate to use the Z test
procedure to compare proportions in individual cells that have few observations.
Franke, T. M., Ho, T., & Christie, C. A. (2012). The chi-square test: Often used and more often
misinterpreted. American Journal of Evaluation, 33, 448-458.
Goodman, L. (1963). Simultaneous confidence intervals for contrasts among multinomial populations. The
Annals of Mathematical Statistics, 35, 716-725.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics,
6, 65-70.
CD05 Nonparametric Statistics
32
McNemar’s test of related proportions: D15
Dale Berger, CGU
McNemar’s test is applied to a 2x2 table to determine whether the row and column marginal
proportions are different from each other in the population from which the data were sampled.
For example, if one set of N cases is classified into two categories (+ or –) at two different times
(A and B) or by two different raters (A and B), we can test whether the +/– split is different on the
two occasions or for the two raters.
We wish to know whether children are more likely to attain a rating of ‘excellent’ after they
complete a training program. In a sample of 40 children, only 14 (i.e., 14/40 = 35%) were rated
‘excellent’ before training, but after training 20 (i.e., 50%) were rated ‘excellent.’ Do we have
statistically significant evidence of improvement? The null hypothesis is that in the population
represented by this sample the proportion rated ‘excellent’ is the same before and after training.
+
+
a
c
A −
totals a+c
B
− totals
b a+b
d c+d
b+d N
before
+
+
after
7
20
1 19
20
totals 14 26
40
−
13
− totals
If the null hypothesis is true, we expect the number of + responses to be the same before and after
training. That is, we expect the totals a+b = a+c, or b=c. We can say “If the null hypothesis of no
effect is true, then we expect to see as many cases that switch from + to – as cases that switch from
– to +.” In our example we see that a total of b+ c = 7+1 = 8 cases switched classification, and of
these, 7 switched from – to + while only 1 switched from + to –. Is this evidence of significant
improvement?
The relevant model is the binomial distribution. The null hypothesis is that the b+c cases that
switched are from a binomial distribution where we expect cases to be split into the b and c cells
with p=1/2 for each. How surprising is it to find a split as extreme as 1 in 8? This is just as
surprising as it would be to toss a fair coin 8 times and observe 1 or fewer heads.
We can use a computer program like StatWISE to find the exact probability of observing 0 or 1
heads out of 8 coin tosses. We use N=8, X=1, P=.50 and we find p(x<=1) = .0352. If we can
justify a one-tailed test a priori, then we conclude that we have statistically significant evidence
that a higher proportion of children attain a rating of ‘excellent’ after training than before training
(p=.0352). If we would like to be able to detect a change in either direction, then we should apply a
two-tailed test by simply multiplying the observed p value by two, giving p = 2*.0352 = .0704.
In this case, we do not attain statistical significance at the .05 level. If A and B were two raters and
we were interested in testing whether there is a difference in how liberal these raters are in
assigning ‘+’ vs. ‘– ‘ then we should apply the two tailed test.
CD05 Nonparametric Statistics
33
In general, we should use two-tailed tests unless we know, before looking at the data, that we are
willing to ignore a difference in one of the two directions. In our example, perhaps our decision is
whether to implement the new training program. If training makes things worse or doesn’t help, we
won’t implement it, so we are interested only in whether we have evidence that the training
improves performance. In this case we could conduct a one-tailed test with a pure heart.
The binomial distribution gives the exact probability and so it is the preferred test. McNemar’s test
is often presented as a chi-square test with df=1, but we should be aware that this is an
approximation that may not be accurate with small samples.
Note: The vertical lines around b-c indicate ‘absolute value’ which is
always treated as a positive value. Thus |1-7| would become +6.
In our example, we find 
2
df 1

( 7  1  1) 2
7 1

(6  1) 2 25

 3.125 , giving p=.0771.
8
8
Although the p value comes from the upper tail of the chi-square distribution, this is a two-tailed
test of the marginal probabilities because we would get a large value of chi-square if a large
difference was observed in either direction.
Important notes:
The McNemar test does not test independence. The focus is on the cases only where there is
disagreement in the ratings. The cases where there is agreement are ignored and have no impact on
the McNemar test. Thus, the two tables below give identical results for McNemar tests. It is
interesting to note that in the first table below, the proportion of + responses changes from 49.3%
to 50.7%, while in the second table the proportion of + changes from 12.5% to 50.0%, yet the test
results are identical because the focus of the test is on only those cases that change. In many
practical applications, we should include information on agreement as well as disagreement to
provide a context for the McNemar test that focuses on disagreements only. In the examples
below, we should use the binomial distribution rather than the McNemar chi-square
approximation, because the number of cases with disagreement is very small. The binomial is
always correct, even with very small n, while the chi-square approximation may be quite far off
with small n.
before
+
after
−
+
− totals
200
7 207
1 200 201
totals 201 207 408
before
after
+
− totals
+
1
7
8
−
1
7
8
totals 2
14
16
Be careful with labeling to make sure that you focus on the cells with cases that change. If the +
and – columns were switched, we would focus on cells a and d rather than cells b and c.
CD05 Nonparametric Statistics
34
Dale Berger
Spearman r and SPSS: D16
The p values reported by SPSS for the Spearman correlation are wrong. They apparently are based
on the parametric t-test rather than on the exact probabilities.
Suppose we would like to measure and test agreement between two raters. If we have a reasonably
bivariate normal distribution, Pearson correlation is the index of choice, and we can use a t-test to
test the null hypothesis that the correlation is zero in the population. However, if we have an
outlier, especially in a small sample, the correlation can be affected greatly and that test of
statistical significance would not be very accurate. Here is an example.
Two professors assigned the following ratings to four essays:
Professor A: 86 81 72 20 (A_rate)
Professor B: 92 95 85 12 (B_rate)
What is the probability of such close agreement if their ratings are independent in the population of
all possible essays represented by this sample? The Pearson correlation might not be a very good
statistic here because of the very small sample size and the outlier (one essay in this small sample
apparently was extremely weak).
The Spearman correlation analyzes ranks rather than the raw data. Thus, we analyze the following
ranked data:
Professor A: 1 2 3 4 (A_rank)
Professor B: 2 1 3 4 (B_rank)
Spearman correlation (rs) is based on the agreement between ranks. For each case (each essay in
this example), we compute the difference between ranks (di for case i). Then we square each di and
find the sum. The minimum possible value for the sum of squared di values is 0 in the case where
all ranks agree perfectly. The maximum possible value for this sum of squared di values is N(N21)/3 where N is the number of cases that are ranked. In our example, this maximum is 4(16-1)/3 =
20. You can check this out. If the two professors ordered the four essays in perfectly opposite
order, the sum of the squared differences in ranks would be (1-4)2 + (2-3)2 + (3-2)2 + (4-1)2 =
9+1+1+9 = 20. If there is no relationship, we expect the sum of squared di to be about half way
between zero and N(N2-1)/3, i.e., N(N2-1)/6.
In our example, this would be 4(16-1)/6 = 10.
6 d i2
Spearman r can be computed as rs  1 
N ( N 2  1)
In our example, the sum of di squared is (1-2)2 + (2-1)2 + (3-3)2 + (4-4)2 = 1+1+0+0 = 2.
This gives us rs = 1 – (6*2)/(4*15) = 1 – 2/10 = 1 - .200 = .800.
Spearman r is simply Pearson r computed on ranked scores. The computation formula for
Spearman r is a shortcut formula for Pearson r in the special case when we have ranked data.
How surprising is such a large value for Spearman correlation? Can we apply the usual t-test?
Clearly we have not satisfied the assumptions for the parametric t-test. The distribution of ranks is
not normal and residuals from the regression line are not normally distributed. What to do?
We can compute exactly how likely an observed Spearman r is by considering all possible
CD05 Nonparametric Statistics
35
outcomes that we might observe when we have two ratings of the same N objects. If we order the
outcomes from the first rater as 1, 2, 3, …, N, then we can consider all of the possible different
orders that we might observe for the second rater. How many ways can we order N objects? From
our counting rules, we know that is N*(N-1)*(N-2)*…*(3)*(2)*(1) = N factorial = N!
In our example, there are 4! = 4*3*2*1 = 24 ways to order 4 distinct objects. If there is no
relationship between the two raters, then each of these 24 possible pairings is equally likely.
Here are the 24 possible rankings for Rater B, the sum of di squared, the rs value, and the
probabilities of observing a value that large or larger if there is no relationship in the population.
Rank_B
1
1
1
2
2
1
1
2
3
1
3
2
3
2
4
2
3
4
4
3
3
4
4
4
2
2
3
1
1
3
4
3
1
4
2
4
1
3
1
4
2
1
2
4
4
2
3
3
3
4
2
3
4
4
2
1
2
3
1
1
4
4
2
3
4
3
1
1
2
3
1
2
4
3
4
4
3
2
3
4
4
2
4
3
2
1
3
1
1
2
3
2
1
1
2
1
Σ(di2)
0
2
2
2
4
6
6
6
6
8
8
10
10
12
12
14
14
14
14
16
18
18
18
20
rs
1.000
.800
.800
.800
.600
.400
.400
.400
.400
.200
.200
.000
.000
-.200
-.200
-.400
-.400
-.400
-.400
-.600
-.800
-.800
-.800
-1.000
Cumulative p
1/24
.042
4/24
5/24
.167
.208
9/24
.375
11/24
.458
13/24
.542
15/24
.625
19/24
20/24
.792
.833
23/24
24/24
.958
1.000
There is only one order out of 24 that gives perfect agreement (1, 2, 3, 4), so the probability by
chance of observing perfect agreement with rs = 1.000 is 1/24 = .042. There are three ways to
obtain rs = .800, so the probability of observing an rs value of .800 or higher is 4/24 = .167.
Let’s see what SPSS tells us. For illustration, I used the original rating data from the two
professors, A_rate and B_rate, as well as the ranking data, A_rank and B_rank. SPSS can compute
the Spearman r in the bivariate correlation analysis that produces Pearson r.
CD05 Nonparametric Statistics
36
I entered these four variables into SPSS and asked for Pearson correlation, Spearman correlation,
and Kendall’s tau_b. Kendall’s tau_b is another rank-order statistic that should give us p values
equivalent to Spearman (according to Siegel, 1956, p. 219).
To run the SPSS analysis, click Analyze, Correlate, Bivariate…, to open the Bivariate
Correlations window. Select all four variables for analysis, check the boxes for Pearson, Kendall’s
tau_b, and Spearman. Select the One-tailed test of significance. This generates the following
syntax.
CORRELATIONS
/VARIABLES=A_rate B_rate A_rank B_rank
/PRINT=ONETAIL NOSIG
/STATISTICS DESCRIPTIVES
/MISSING=PAIRWISE .
NONPAR CORR
/VARIABLES=A_rate B_rate A_rank B_rank
/PRINT=BOTH ONETAIL NOSIG
/MISSING=PAIRWISE .
First, we see the Pearson correlations. The correlation between the two ratings is .992. This is
inflated because of the outlier. A scatterplot shows how the outlier inflates the correlation.
Correlations
A_rate
A_rate
B_rate
A_rank
B_rank
Pearson Correlation
Sig. (1-tailed)
N
Pearson Correlation
Sig. (1-tailed)
N
Pearson Correlation
Sig. (1-tailed)
N
Pearson Correlation
Sig. (1-tailed)
N
1
4
.992**
.004
4
-.879
.060
4
-.837
.082
4
B_rate
.992**
.004
4
1
4
-.816
.092
4
-.836
.082
4
A_rank
-.879
.060
4
-.816
.092
4
1
4
.800
.100
4
B_rank
-.837
.082
4
-.836
.082
4
.800
.100
4
1
4
**. Correlation is s ignificant at the 0.01 level (1-tailed).
Also, we see that the correlation between the two ranked variables is .800. This demonstrates that
Pearson’s correlation computed on ranks is exactly equal to the Spearman rho value. If we had
normal residuals, this test of statistical significance would be correct:
t
r N 2
1 r2

.800 4  2
1  .800 2

.8(1.4142)
 1.856 ; one-tailed p = .100.
.6000
However, from our hand calculations, we know that the correct one-tailed p value for a Spearman r
= .800 with N=4 is p = .167.
CD05 Nonparametric Statistics
37
Now check what SPSS does when it reports Spearman’s rho. We see the same value for the
correlation between the ratings as we see for the correlation between ranks. Both are .800. That is
correct, because the Spearman correlation for both the ratings and the rankings are based on
rankings. SPSS calculated the value for Spearman’s rho correctly. However, SPSS did not
compute the p value correctly. The reported p value is .100, the same value reported for the
Pearson correlation on ranks. That cannot be correct. Shame on SPSS!
Correlations
Kendall's tau_b
A_rate
B_rate
A_rank
B_rank
Spearman's rho
A_rate
B_rate
A_rank
B_rank
A_rate
1.000
.
4
.667
.087
4
-1.000*
.021
4
-.667
.087
4
1.000
.
4
.800
.100
4
-1.000**
.000
4
-.800
.100
4
Correlation Coefficient
Sig. (1-tailed)
N
Correlation Coefficient
Sig. (1-tailed)
N
Correlation Coefficient
Sig. (1-tailed)
N
Correlation Coefficient
Sig. (1-tailed)
N
Correlation Coefficient
Sig. (1-tailed)
N
Correlation Coefficient
Sig. (1-tailed)
N
Correlation Coefficient
Sig. (1-tailed)
N
Correlation Coefficient
Sig. (1-tailed)
N
B_rate
.667
.087
4
1.000
.
4
-.667
.087
4
-1.000*
.021
4
.800
.100
4
1.000
.
4
-.800
.100
4
-1.000**
.000
4
A_rank
-1.000*
.021
4
-.667
.087
4
1.000
.
4
.667
.087
4
-1.000**
.000
4
-.800
.100
4
1.000
.
4
.800
.100
4
B_rank
-.667
.087
4
-1.000*
.021
4
.667
.087
4
1.000
.
4
-.800
.100
4
-1.000**
.000
4
.800
.100
4
1.000
.
4
*. Correlation is s ignificant at the 0.05 level (1-tailed).
**. Correlation is s ignificant at the 0.01 level (1-tailed).
Further note that the Spearman rho between A_rate and A_rank is -1.000. That is correct because
the smallest rank was assigned to the largest rating, etc. But there are only 24 possible orders for
Spearman, so even the most extreme outcome of 1.000 has a p value of 1/24 = .042. Yet SPSS
reports Sig. (1-tailed) = .000. Also, the p values for Kendall’s tau_b and Spearman’s rho should be
the same (Siegel, 1956, p. 219), but not in this SPSS analysis. Double shame!
Take Home Lesson: Don’t trust a computer program for an important decision unless you can
verify that the program is working correctly.
Bumble’s axiom: To err is human; to really screw up it takes a computer.
CD05 Nonparametric Statistics
38