Download Chapter 2

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
2.1
2. Two-way contingency tables
2.1 Probability structure for contingency tables
Setup:
 Let X be a categorical variable with i=1,…,I levels.
 Let Y be a categorical variable with j=1,…,J levels.
 There are IJ different possible combinations of X and Y
together.
 Frequency counts of these combinations can be
summarized in an IJ “contingency table”.
 Often called “two-way” tables since there are two
variables of interest.
Example: Larry Bird (data source: Wardrop, American
Statistician, 1995)
Free throws are typically shot in
pairs. Below is a contingency table
summarizing Larry Bird’s first and
second free throw attempts during
the 1980-1 and 1981-2 NBA
seasons. Let X=First attempt and
Y=Second attempt.
Second
Made Missed
Made 251
34
First
Missed 48
5
Total 299
39
Total
285
53
338
 2010 Christopher R. Bilder
2.2
Interpreting the table:
 251 first and second free throw attempts were both
made
 34 first free throw attempts were made and the
second were missed
 48 first throw attempts were missed and the second
free throw were made
 5 first and second free throw attempts were both
missed
 285 first free throws were made regardless what
happened on the second attempt
 299 second free throws were made regardless what
happened on the first attempt
 338 free throw pairs were shot during these
seasons
What types of questions would be of interest for this
data?
Example: Field goals
Below is a two-way table summarizing field goals from
the 1995 NFL season (Bilder and Loughin, Chance,
1998). The data can be considered a representative
sample from the population. The two categorical
variables in the table are stadium type (dome or
outdoors) and field goal result (success or failure).
 2010 Christopher R. Bilder
2.3
Field goal result
Success Failure
335
52
Stadium Dome
type
Outdoors
927
111
Total
1262
163
Total
387
1038
1425
What types of questions would be of interest for this
data?
Example: Salk vaccine clinical trials
From p. 186 of the S-Plus 6 Guide to Statistics Volume I
In the Salk vaccine trials, two large groups were involved
in the placebo-control phase of the study. The first
group, which received the vaccination, consisted of
200,745 individuals. The second group, which received a
placebo, consisted of 201,229 individuals. There were
57 cases of polio in the first group and 142 cases of
polio in the second group.
Vaccine
Placebo
Total
Polio
57
142
199
Polio
free
200,688
201,087
401,775
 2010 Christopher R. Bilder
Total
200,745
201,229
401,974
2.4
What types of questions would be of interest for this
data?
Contingency tables do not have to be 22!
Example: #7.24
Subjects were asked whether methods of birth control
should be available to teenagers between the ages of 14
and 16.
Religious attendance
Teenage birth control
strongly agree agree disagree strongly disagree
Never
49
49
19
9
<1 per year
31
27
11
11
1-2 per year
46
55
25
8
several times per year
34
37
19
7
1 per month
21
22
14
16
2-3 per month
26
36
16
16
nearly every week
8
16
15
11
every week
several times per
week
32
65
57
61
4
17
16
20
Notice the “total” column and row are not necessary to
include with a contingency table. Also, notice that both
categorical variables are ordered.
 2010 Christopher R. Bilder
2.5
What types of questions would be of interest for this
data?
In the previous examples, subjects were allowed to fall in
only one cell of the contingency table. There are times when
subjects may fall in more than one cell!
Example: Education and SOV
Loughin and Scherer (Biometrics, 1998) examine a
sample of 262 Kansas livestock farmers who are asked,
“What are your primary sources of veterinary
information?” Farmers may pick as many sources that
apply from (A) professional consultant, (B) veterinarian,
(C) state or local extension service, (D) magazines, and
(E) feed companies and representatives. Since
respondents may pick any number out the possible
categorical responses, Coombs (1964) refers to this type
of variable as a “pick any/c” variable (“pick any/c” is read
as “pick any out of c” and c is the number of categorical
responses). Farmers are also asked many demographic
questions including their highest attained level of
education. Note that individual farmers may be
represented more than once in the table since they may
pick all sources that apply.
 2010 Christopher R. Bilder
2.6
Information source
Education
A
Total
Total
B
C
D
E
High school
19 38
29
47
40
173
88
Vocational school
2
6
8
8
4
28
16
2-year college
1
13
10
17
14
55
31
4-year college
19 29
40
53
29
170
113
Other
3
8
6
6
27
14
Total responses
4
44 90
Responses Farmers
95 131 93
262
Higgins (An Introduction to Modern Nonparametric
Statistics, 2003) also discusses data in this format.
The data is given in a multinomial format in Agresti
(2002, p. 484-6).
What types of questions would be of interest for this
data?
Notes:
 Unless otherwise mentioned, all of the contingency
tables in this course will have subjects (or items) who fall
in only one cell.
 There are many other examples of contingency tables
from marketing, psychology, …
 The contingency tables presented here are called “twoway” since there are only two categorical variables.
Later, we will discuss “three-way” contingency tables
when there are three categorical variables. Future
chapters will discuss four-way, five-way,…
 2010 Christopher R. Bilder
2.7
Probability distributions for contingency tables
Let ij = P(X=i, Y=j); i.e., the probability that category i of
X and category j of Y is chosen. These probabilities can
be put into a contingency table format. If I=2 and J=2,
then the following table is produced:
Y
1
X
2
1 11 12
2 21 22
Notes:
 11, 12, 21, and 22 form the “joint” probability
distribution for X and Y (joint since two random
variables).
 Notice the row number goes first in the subscript for
 and the column number goes second.
 11+12+21+22=1; thus, every item falls in one of
the cells.
Suppose that only the probability distribution for Y is
examined. This is called the “marginal” probability
distribution for Y. It is denoted by
P(Y=1) = +1, P(Y=2) = +2, and +1++2=1
 2010 Christopher R. Bilder
2.8
The “+” in the subscript denotes that all possible values
of X are being summed over. Thus,
+1 = 11 + 21 and +2 = 12 + 22
Equivalently, +1 = P(Y=1) = P(Y=1, X=1) + P(Y=1, X=2).
The marginal distribution of X, 1+ and 2+, can be found
in a similar manner. You will often see a  instead of +
used exactly the same way in other textbooks.
The contingency table of the probabilities can be
extended to include the marginal distribution of Y and X.
Notice how the “marginal” probability distribution is put in
the “margins” of the table.
Y
1
X
2
1 11 12 1+
2 21 22 2+
+1 +2 1
Each of these ij’s are population parameters. These
parameters can be estimated by taking a sample.
Counts from the sample are summarized in a
contingency table as shown below in a general format.
 2010 Christopher R. Bilder
2.9
Y
1
X
2
1 n11 n12 n1+
2 n21 n22 n2+
n+1 n+2
n
Thus, n11 denote the table count for X=1 and Y=1.
Also, n1+= n11+ n12 denotes the table count for X=1
without regards Y. Finally, n = n11+n12+ n21+ n22 is the
total sample size. This could also be denoted by n++.
Using the contingency table counts, the parameter
estimates are found using pij = nij/n, pi+ = ni+/n, and p+j
= n+j/n. Note that ̂ij could also be used as notation,
but Agresti prefers to use a “p”. The resulting
contingency table with the “sample proportions” or
“sample probabilities” or “estimated probabilities”… is:
Y
1
X
2
1 p11 p12 p1+
2 p21 p22 p2+
p+1 p+2
 2010 Christopher R. Bilder
1
2.10
22 contingency tables can be extended to IJ tables
as shown below:
Y
1
X

2
J
1
11
12  1J
1+
2
21
22  2J
2+






I
I1
I2

IJ
I+
+1
+2  +J
1
J
I
j1
i1
where i+ =  ij for i=1,…,I and +j =  ij for
j=1,…,J
Y
X
1
2

J
1
n11
n12

n1J
n1+
2
n21
n22

n2J
n2+






I
nI1
nI2

nIJ
nI+
n+1
n+2

n+J
n
J
I
j1
i1
where ni+ =  nij for i=1,…,I and n+j =  nij for
j=1,…,J
 2010 Christopher R. Bilder
2.11
Y
X
1
2

J
1
p11
p12

p1J
p1+
2
p21
p22

p2J
p2+






I
pI1
pI2

pIJ
pI+
p+1
p+2

p+J
1
J
I
j1
i1
where pi+ =  pij for i=1,…,I and p+j =  pij for
j=1,…,J
The contingency table could also be written in terms
of the expected cell counts, ij, which is simply E(nij).
Note that ij = nij.
Example: Larry Bird (bird.R)
Second
Made Missed Total
Made n11=251 n12=34 n1+=285
First
Missed n21=48 n22=5 n2+=53
Total n+1=299 n+2=39 n=338
 2010 Christopher R. Bilder
2.12
First
Second
Made
Missed
Total
Made p11=0.7426 p12=0.1006 p1+=0.8432
Missed p21=0.1420 p22=0.0148 p2+=0.1568
Total p+1=0.8846 p+2=0.1154
1
For example, p11 = 251/338 = 0.7426 and p1+ = 285/338
= 0.8432.
Make sure you can interpret the probabilities in the table!
How are the contingency tables entered into R?
Below is the code for one method.
> #Create contingency table - notice the data is entered by
# columns
I, J
>
n.table <- array(data = c(251, 48, 34, 5), dim = c(2, 2),
dimnames = list(First = c("made", "missed"), Second =
c("made", "missed")))
> n.table
Rows first
Second
First
made missed
made
251
34
missed
48
5
> n.table[1,1]
[1] 251
> #Find the estimated proportions
> p.table <- n.table/sum(n.table)
> p.table
Second
First
made
missed
made
0.7426036 0.1005917
 2010 Christopher R. Bilder
Notice how the
division is performed
on each element
2.13
missed 0.1420118 0.0147929
What if the data did not come in a contingency table
format?
Suppose the data is in its “raw” form:
> all.data2
first second
1 missed missed
2 missed missed
3 missed missed
4 missed missed
5 missed missed
6 missed
made
7 missed
made
8 missed
made

336
made
made
337
made
made
338
made
made
The above data is stored in a data.frame (it is
constructed in bird.R). To find a contingency table for
the data, use the table() or xtabs() functions.
> #Find contingency table two different ways
> bird.table1 <- table(all.data2$first, all.data2$second)
> bird.table1
made missed
made 251
34
missed
48
5
> bird.table1[1, 1]
[1] 251
 2010 Christopher R. Bilder
2.14
> bird.table2<-xtabs(formula = ~ first + second,
data=all.data2)
> bird.table2
second
first
made missed
made
251
34
missed 48
5
> bird.table2[1,1]
[1] 251
Note: For those of you with SAS experience, the
corresponding output is similar to the output produced
from PROC FREQ in SAS.
Conditional probability distributions
Often when one categorical variable is considered a
“response” or “dependent” variable and another
categorical variable is considered an “explanatory” or
“independent” variable, we would like to look at the
probability distribution for the response variable GIVEN
the level of the explanatory variable. These can be
examined through conditional probability distributions.
From STAT 218:
Suppose two events are denoted by A and B. The
conditional probability of A given B happens is
denoted by
 2010 Christopher R. Bilder
2.15
P(A | B) 
P(A and B)
, provided P(B)0
P(B)
For example, A = Bird’s 2nd free throw attempt
outcome and B = Bird’s 1st free throw attempt
outcome
For STAT 875, we can define conditional probabilities
the following way.
Suppose Y (columns) is the response variable and X
(rows) is the explanatory variable. Let
j|i = P(Y=j | X=i).
Note that j|i = ij/i+ = P(X=i and Y=j) / P(X=i).
The conditional probability distribution has
J
probabilities 1|i, 2|i, …, J|i and   j|i  1 for i=1,...,I.
j1
Thus, one can think of each row of the contingency
table as one conditional probability distribution.
Estimators for the conditional probabilities are
pj|i = pij/pi+ = (nij/n) / (ni+/n) = nij/ni+.
Example: Larry Bird
 2010 Christopher R. Bilder
2.16
First
Second
Made
Missed
Total
Made p11=0.7426 p12=0.1006 p1+=0.8432
Missed p21=0.1420 p22=0.0148 p2+=0.1568
Total p+1=0.8846 p+2=0.1154
1
Given that Larry Bird misses the first free throw, what is
the estimated probability that he will make the second?
P(2nd made | 1st missed) = 1|2
Be careful with the notation for this problem!
The corresponding estimator is p1|2 = p21/p2+ =
0.1420/0.1568 = 0.9057. You can also find this
using p1|2 = n21/n2+ = 48/53. Be careful with making
sure you know which variable level is represented
first and which variable level is represented second
in the subscript notation for p1|2.
Therefore it is still very likely that Larry Bird will make the
second free throw even if the first one is missed.
Question for basketball fans: Why would this
probability be important to know?
If the first free throw result is thought of as an
explanatory variable and the second free throw result is
 2010 Christopher R. Bilder
2.17
thought of as a response variable, we can find the
following table of conditional probabilities:
First
Second
Made
Missed
Made p1|1=0.8807 p2|1=0.1193
Missed p1|2=0.9057 p2|2=0.0943
Total
1
1
Notice the estimated probability of making the second
free throw is larger after (given) the first free throw is
missed!
Independence
Suppose Y is a response variable and X is an
explanatory variable. Also, suppose Y is independent of
X. What is j|i equal to?
Remember that j|i = P(Y=j | X=i). Independence
means that the probability of Y=j does not depend
on the level of X. Therefore, the probability is the
same for all levels of X; i.e.,
P(Y=j | X=i) = P(Y=j) for i=1,…,I and j=1,…,J
 j|i = +j for i=1,…,I and j=1,…,J
This can be rewritten as
 2010 Christopher R. Bilder
2.18
j|1 = j|2 = … = j|I for j=1,…,J
Thus, there is equality across rows for the conditional
probability distributions. When both categorical
variables can be thought of as response variables,
independence can be written without the use of
conditional probability distributions. Statistical
independence occurs if
ij = i++j for i=1,…,I and j=1,…,J.
Thus, ij is equal to the product of the corresponding
marginal probabilities.
The equivalence of the two ways to write independence
can be shown as follows:
ij = i++j for i=1,…,I and j=1,…,J
 ij/i+ = i++j/i+ for i=1,…,I and j=1,…,J
 j|i = +j for i=1,…,I and j=1,…,J
Example: Larry Bird
What does independence mean in this example?
Do you think independence occurs?
 2010 Christopher R. Bilder
2.19
Poisson, binomial, and multinomial sampling
How do counts in a contingency table come about with
respect to probability distributions? There are 4 ways
where 3 are discussed here:
1) We can often treat each cell of an IJ contingency
table as independent Poison random variables; i.e., nij
~ ind. Poisson(ij). Thus,
nijij eij
f(nij ) 
for nij = 0, 1, 2, …
nij !
When use this distribution, we have Poisson
sampling. The total sample size, n, is NOT fixed.
2) When n is fixed (or conditional on sample size),
multinomial sampling occurs over all of the cells of the
contingency table; i.e., (n11, n12, …, nIJ) ~
Multinomial(n, 11, 12, …, IJ). A random sample of
size n from one multinomial distribution is taken and
summarized by the sample counts in cells of the table.
Note (n11, n12, …, nIJ) ~ Multinomial(n, 11, 12, …, IJ)
could also be expressed as (n11, n12, …, n-ijnij) ~
Multinomial(n, 11, 12, …, 1-ijij) since nIJ = n-ijnij
and IJ = 1-ijij.
 2010 Christopher R. Bilder
2.20
3) Sometimes n1+, n2+,…, nI+ are fixed by the sampling
design. For example in a clinical trial, there may be
only 10 people available for the placebo group and 9
people available for the drug group. Also, suppose
there are only two possible outcomes for the trial –
cured and not-cured. In this case, we have binomial
sampling within each row of the contingency table.
This is often called “independent” binomial sampling
since random variables are independent across the
rows.
When more than two outcomes are possible, say
cured, partially cured, and not cured, then
“independent multinomial sampling” occurs within
each row of the contingency table.
(n11, n12, …, n1J) ~ Multinomial(n1+, 1|1, 2|1, …, J|1),
(n21, n22, …, n2J) ~ Multinomial(n2+, 1|2, 2|2, …, J|2),

(nI1, nI2, …, nIJ) ~ Multinomial(nI+, 1|I, 2|I, …, J|I)
Example: Independent binomial and multinomial
sampling and just multinomial sampling.
Suppose n1+=50 males and n2+=60 females are
wanted for a study. These males and females are
randomly selected from their individual populations.
Suppose there are only 2 possible outcomes – cured
 2010 Christopher R. Bilder
2.21
and not cured. This is an example of independent
binomial sampling.
Y
Not
Cured Cured
X
Male
n11
n12
n1+
Female
n21
n22
n2+
n+1
n+2
n
Thus, n11~Binomial(n1+, 1|1) and n21~Binomial(n2+,
1|2) where n11 is independent of n21.
Suppose n1+=50 males and n2+=60 females are
wanted for a study. These males and females are
randomly selected from their individual populations.
Suppose there are now 3 possible outcomes –
cured, partially cured, and not cured. This is an
example of independent multinomial sampling.
Y
Partially Not
Cured Cured Cured
X
Male
n11
n12
n13
n1+
Female
n21
n22
n23
n2+
n+1
n+2
n+3
n
 2010 Christopher R. Bilder
2.22
Thus, (n11, n12, n13) ~ Multinomial(n1+, 1|1, 2|1, 3|1)
and (n21, n22, n23) ~ Multinomial(n2+, 1|2, 2|2, 3|2)
where the n1j’s are independent of the n2j’s.
Suppose n=110 subjects are wanted for a study.
Males and females are randomly selected from the
one population. This is an example of multinomial
sampling. The n1+ and n2+ are not fixed for this
study.
Y
Partially Not
Cured Cured Cured
X
Male
n11
n12
n13
n1+
Female
n21
n22
n23
n2+
n+1
n+2
n+3
n
Thus, (n11, n12, n13, n21, n22, n23) ~ Multinomial(n, 11,
12, 13, 21, 22, 23)
Instead of Male and Female, we could have drug
and placebo groups. Typically, the number of
subjects receiving the drug and the number
receiving the placebo will be fixed. Thus,
independent binomial or multinomial sampling will be
used. You can kind of think of this as a Completely
Randomized Design used in ANOVA where you
fixed the number of people receiving each treatment.
 2010 Christopher R. Bilder
2.23
What about Poisson sampling? Perhaps this could
occur if the study allowed anyone who volunteered
(with no upper limit) to participate in it.
Notes:
 Although Poisson sampling may occur, n or ni+ are
often conditioned upon.
 For the analyses to be examined in this book, we will
usually get the same results no matter what types of
sampling methods are used.
 You should think about how one can simulate
observations in order to form a contingency table.
 See the p. 40-41 of Agresti (2002) for an additional
example.
 2010 Christopher R. Bilder
2.24
 2010 Christopher R. Bilder
2.25
2.2 Comparing proportions in 22 contingency tables
Difference of proportions or differences of probabilities
Suppose we have the following 22 table
Y
1
X
2
1 n11 n12 n1+
2 n21 n22 n2+
n+1 n+2
n
where n1+ and n2+ are FIXED. Thus, we have
independent binomial sampling. Suppose Y=1 equates
to a success and Y=2 equates to a failure.
We can then write the table in terms of the conditional
probability distributions.
Y
1=success 2=failure
X
1
1|1
2|1
1
2
1|2
2|2
1
The sample proportions or probabilities can also be
written in this format.
Note that Agresti writes the table as
 2010 Christopher R. Bilder
2.26
Y
1=success 2=failure
X
1
1
1-1
1
2
2
1-2
1
Example: Larry Bird
Second
Made
Missed
Made p1|1=0.8807 p2|1=0.1193
First
Missed p1|2=0.9057 p2|2=0.0943
Total
1
1
Often of interest is determining if the probability of
success is the same across the two levels of X. If the
probabilities are equal, then 1|1-1|2=0. A confidence
interval can be found to examine the differences of the
proportions (or probabilities).
Remember from Chapter 1 that the estimated proportion,
p, can be treated as an approximate normal random
variable with mean  and variance (1  ) n for a large
sample. Using the notation in this chapter, this means
that
p1|1 ~ N(1|1, 1|1(1-1|1)/n1+) and
p1|2 ~ N(1|2, 1|2(1-1|2)/n2+) approximately
 2010 Christopher R. Bilder
2.27
for large n1+ and n2+. Note that p1|1 and p1|2 are treated
as random variables here, not the observed values in the
last example.
The statistic that estimates 1|1 - 1|2 is p1|1 - p1|2. The
distribution can be approximated by
N(1|1-1|2, 1|1(1-1|1)/n1+ + 1|2(1-1|2)/n2+)
for large n1+ and n2+.
Note: Var(p1|1 - p1|2) = Var(p1|1) + Var(p1|2) since p1|1
and p1|2 are independent random variables. Some
of you may have seen the following: Let X and Y be
independent random variables and let a and b be
constants. Then Var(aX+bY) = a2Var(X) + b2Var(Y).
Thus, an approximate (1-)100% confidence interval for
1|1-1|2 is
Estimator  (distributional value)(standard deviation of estimator)
 p1|1-p1|2Z1-/2
p1|1(1  p1|1)
n1

p1|2 (1  p1|2 )
n2
Notice how p1|1 and p1|2 replace 1|1 and 1|2 in the
standard deviation of the estimator. This is another
example of a Wald confidence interval
 2010 Christopher R. Bilder
2.28
Do you remember the problems with the Wald
confidence interval in Chapter 1? Similar problems
happen here.
Agresti and Caffo (2000) recommend using the “add two
successes and two failures” methods for an interval of
ANY level of confidence.
Let p1|2 
n21  1
n 1
and p1|1  11
.
n2   2
n1  2
The confidence interval is
p1|1  p1|2  Z1 / 2
p1|1(1  p1|1) p1|2 (1  p1|2 )

n1  2
n2   2
Again, Agresti and Caffo do not change the adjustment
for different confidence levels!
Below are two plots from the paper comparing the
Agresti and Caffo interval to the Wald interval (similar to
p. 1.45). The solid line denotes the Agresti and Caffo
interval. The y-axis shows the true confidence level
(coverage) of the confidence intervals. The x-axis
shows various values of 1|1 where 1|2 is fixed at 0.3.
 2010 Christopher R. Bilder
2.29
To find the estimated true confidence level, 10,000
samples from a binomial probability distribution with
1|2=0.3 and 10,000 samples from a binomial
probability distribution with 1|1=x-axis value. The
sample size is given on the bottom of the plot. For
each of the 10,000 samples from binomial #1 and
binomial #2, the confidence interval is calculated.
The proportion of time that 1|1-0.3 is inside the
interval is calculated as the “true confidence level”.
In the plots, p1 represents our 1|1, and p2
represents our 1|2.
 2010 Christopher R. Bilder
2.30
 2010 Christopher R. Bilder
2.31
For the plots below, the value of 1|1 was no longer fixed.
The Agresti and Caffo interval tends to be much better
than the Wald interval.
 2010 Christopher R. Bilder
2.32
Note that other confidence intervals can be done.
Agresti and Caffo’s (2000) objective was to present a
“better” than the Wald interval which could be used in
elementary statistics courses. See Newcombe
(Statistics in Medicine, 1998, p. 857-872) for other
intervals.
Example: Larry Bird (bird.R)
Find a (1-)100% confidence interval for 1|1-1|2; i.e.,
P(2nd made | 1st made) – P(2nd made | 1st missed).
95% Wald confidence interval:
-0.1122  1|1 - 1|2  0.0623
95% Agresti-Caffo confidence interval:
-0.1022  1|1 - 1|2  0.0764
There is not sufficient evidence to indicate a difference in
the proportions. What does this mean in terms of the
original problem?
R code and output:
> #Confidence interval for difference of proportions
> alpha <- 0.05
> p.1.1 <- p.table[1, 1]/sum(p.table[1, ])
> p.1.2 <- p.table[2, 1]/sum(p.table[2, ])
> p.1.1
[1] 0.8807018
 2010 Christopher R. Bilder
2.33
> p.1.2
[1] 0.9056604
> #Wald
> lower <- p.1.1 - p.1.2 - qnorm(1 - alpha/2) *
sqrt((p.1.1*(1-p.1.1))/sum(n.table[1,]) + (p.1.2*(1p.1.2))/sum(n.table[2,]))
> upper <- p.1.1 - p.1.2 + qnorm(1 - alpha/2) *
sqrt((p.1.1*(1-p.1.1))/sum(n.table[1,]) + (p.1.2*(1p.1.2))/sum(n.table[2,]))
> cat("The Wald C.I. is:", round(lower, 4), "<= pi.1.1pi.1.2 <=", round(upper, 4))
The Wald C.I. is: -0.1122 <= pi.1.1-pi.1.2 <= 0.0623
>
>
>
>
#Agresti-Caffo
p.1.1<-(n.table[1,1]+1)/(sum(n.table[1,])+2)
p.1.2<-(n.table[2,1]+1)/(sum(n.table[2,])+2)
lower<-p.1.1-p.1.2-qnorm(1-alpha/2)*
sqrt(p.1.1*(1-p.1.1)/(sum(n.table[1,])+2) +
p.1.2*(1-p.1.2)/(sum(n.table[2,])+2))
> upper<-p.1.1-p.1.2+qnorm(1-alpha/2)*
sqrt(p.1.1*(1-p.1.1)/(sum(n.table[1,])+2) +
p.1.2*(1-p.1.2)/(sum(n.table[2,])+2))
> cat("The Agresti-Caffo interval is:", round(lower,4) ,
"<= pi.1.1-pi.1.2 <=", round(upper,4))
The Agresti-Caffo interval is: -0.1035 <= pi.1.1-pi.1.2 <=
0.0778
Agresti provides code for these and a few other intervals for
the difference of two proportions and other measures at
www.stat.ufl.edu/~aa/cda/R/two_sample/R2/index.html
Relative risk
Suppose there is independent binomial sampling.
 2010 Christopher R. Bilder
2.34
The ratio of two probabilities may be more meaningful
than their difference when the proportions are close to 0
or 1 than 0.5. Consider two cases examining the
probabilities of people who experience adverse reactions
to a drug (1) or a placebo (2):
Adverse reactions
Yes
No
Total
Drug
1|1=0.510 2|1=0.490
1
Placebo 1|2=0.501 2|2=0.499
1
1|1 - 1|2 = 0.510 – 0.501 = 0.009
Adverse reactions
Yes
No
Total
Drug 1|1=0.010 2|1=0.990
1
Placebo 1|2=0.001 2|2=0.999
1
1|1 - 1|2 = 0.010 – 0.001 = 0.009
In both cases, the difference in proportions is the same.
However in the second case, it is 10 times more likely to
experience an adverse reaction by taking the drug!
The relative risk is the ratio of two probabilities. In the
above example (2nd case), it is 1|1/1|2=0.010/0.001 =
10.
Consider the table below.
 2010 Christopher R. Bilder
2.35
Y
1=success 2=failure
X
1
1|1
2|1
1
2
1|2
2|2
1
General interpretation: A Y=1 (success) is 1|1/1|2
times more likely when X=1 rather than when X=2.
Typically, it is easier to interpret this quantity when
the relative risk is greater than 1. Thus, you may
want to invert the ratio. Of course, “invert” your
interpretation as well!!!
The sample version of the relative risk is the ratio of two
sample conditional probabilities.
Questions:
 What does a relative risk of 1 mean?
 What is the range of the relative risk?
One version of an approximate (1-)100% confidence
interval is
  p1|1 
1  p1|1 1  p1|2 
exp log 


  Z1 / 2

p
n
p
n
p
  1|2 
1 1|1
2 1|2 
for large n1+ and n2+ (see #2.15). This is a Wald
confidence interval. The estimated standard deviation
 2010 Christopher R. Bilder
2.36
used in the formula is derived using the “delta method”
(see Chapter 14 of Agresti (2002) for a nice
introduction).
Example: Larry Bird (bird.R)
First
Second
Made
Missed
Made p1|1=0.8807 p2|1=0.1193
Missed p1|2=0.9057 p2|2=0.0943
Total
1
1
p1|1/p1|2 = 0.8807/0.9057 = 0.9724
If the relative risk is inverted: p1|2/p1|1 = 0.9057/0.8807 =
1.0284. Thus, a successful second free throw is
estimated to be 1.0284 times more likely to occur when
the first free throw is missed rather than made.
R code and output:
> ####################################################
#Relative risk
> p.1.1 <- p.table[1,1]/sum(p.table[1,])
> n.1 <- sum(n.table[1,])
> p.1.2 <- p.table[2,1]/sum(p.table[2,])
> n.2 <- sum(n.table[2,])
> cat("The sample relative risk is", p.1.1/p.1.2, "\n \n")
The sample relative risk is 0.9724415
> alpha <- 0.05
> lower <- exp(log(p.1.1/p.1.2) - qnorm(1 - alpha/2) *
sqrt((1-p.1.1)/(n.1*p.1.1) + (1- p.1.2)/(n.2*p.1.2)))
 2010 Christopher R. Bilder
2.37
> upper <- exp(log(p.1.1/p.1.2) + qnorm(1 - alpha/2) *
sqrt((1-p.1.1)/(n.1*p.1.1) + (1- p.1.2)/(n.2*p.1.2)))
> cat("The Wald interval for RR is:", round(lower, 4), "<=
pi.1.1/pi.1.2 <=", round(upper, 4))
The Wald interval for RR is: 0.8827 <= pi.1.1/pi.1.2 <=
1.0713
> #Invert
> cat("The Wald interval for RR is:", round(1/upper, 4),
"<= pi.1.2/pi.1.1 <=", round(1/lower, 4))
The Wald interval for RR is: 0.9334 <= pi.1.2/pi.1.1 <=
1.1329
Standard interpretation: I am approximately 95%
confident that a second FT success is between 0.9334
and 1.1329 times more likely when the first FT is missed
rather than made.
What else could be said here if one wanted to do a
hypothesis of Ho: 1|1/1|2 = 1 vs. Ho: 1|1/1|2 ≠ 1
What if the interval was 21|1/1|24?
 2010 Christopher R. Bilder
2.38
2.3 The odds ratio (OR)
Suppose there is independent binomial sampling with
the following set of conditional probabilities:
Y
1=success 2=failure
X
1
1|1
2|1
1
2
1|2
2|2
1
 For row 1, the “odds of a success” is
odds1 = 1|1/(1-1|1) = 1|1/2|1.
 For row 2, the “odds of a success” is
odds2 =1|2/(1-1|2) = 1|2/2|2.
In general, the odds of a success are
P(success)/P(failure). Notice that the odds are just a
rescaling of the P(success)! For example, if P(success)
= 0.75, then the odds are 3 or “3 to 1 odds”. The odds of
a success are three times larger than for a failure.
The estimated odds are:
odds1 
p1|1
p2|1
n
odds2  21
n22

p11 / p1 p11 n11 / n n11
and



p12 / p1 p12 n12 / n n12
 2010 Christopher R. Bilder
2.39
Notice what cells these correspond to in the contingency
table.
Y
1 2
X
1 n11 n12 n1+
2 n21 n22 n2+
n+1 n+2
n
Questions:
 What is the range of an odds?
 What does it mean for an odds to be 1?
To incorporate information from both rows 1 and 2 into a
single number, the ratio of the two odds is found. This is
called an “odds ratio”. Formally, it is defined as:
1|1 / 2|1 1|12|2
odds1 1|1 /(1  1|1)




odds2 1|2 /(1  1|2 ) 1|2 / 2|2 1|2 2|1
“Odds ratio” is often abbreviated by “OR”. ORs are
VERY useful in categorical data analysis and will be
used throughout this book!
ORs measure how much greater the odds of success
are for one level of X than for another level of X.
 2010 Christopher R. Bilder
2.40
Questions:
 What is the range of an OR?
 What does it mean for an OR to be 1?
 What does it mean for an OR > 1?
 What does it mean for an OR < 1?
The OR can be estimated by
p /(1  p1|1) p1|1p2|2
ˆ  odds1  1|1

odds2 p1|2 /(1  p1|2 ) p1|2p2|1
n11 n22
n n
n n
 1 2  11 22
n21 n12 n21n12
n2 n1
This is the maximum likelihood estimate of  (“invariance
property” of maximum likelihood estimators).
Notice how the OR is not dependent on a particular
variable being called a “response” variable. If the roles
of Y and X were switched, we would get the same OR!
This is not true for relative risk (try it yourself).
If there was multinomial sampling for the entire table,
one could just condition on the rows to obtain the same
OR. Also, note that
1122 ( 11 / 1 )( 22 / 2 ) 1|12|2


12 21 ( 12 / 2 )( 21 / 1 ) 1|2 2|1
 2010 Christopher R. Bilder
2.41
which is same OR as before. Also,
p11p22
p12p21
n11 n22
n n
 n n  11 22
n12 n21 n12n21
n n
is the same estimated odds ratio as before.
Interpretation of the OR:
 The odds of Y=1 (success) are  times larger when
X=1 than when X=2.
 The odds of X=1 are  times larger when Y=1 than
when Y=2.
When <1, we will often want to invert the OR. Below is
how the interpretations could change:
 The odds of Y=1 (success) are 1/ times larger when
X=2 than when X=1 since
1 odds2 1|2 /(1  1|2 ) 1|1 / 2|2 1|2 2|1




 odds1 1|1 /(1  1|1) 1|2 / 2|1 1|12|2
 The odds of X=1 are 1/ times larger when Y=2 than
when Y=1.
Also, the interpretations could change to:
 2010 Christopher R. Bilder
2.42
 The odds of Y=2 are 1/ times larger when X=1 than
when X=2 since
Odds of failure for row #1 2|1 / 1|1 2|11|2 1



Odds of failure for row #2 2|2 / 1|2 2|2 1|1 
 The odds of X=2 are 1/ times larger when Y=1 than
when Y=2.
The table below is used a lot for the rearrangement of
terms above.
Y
1=success 2=failure
X
1
1|1
2|1
1
2
1|2
2|2
1
Work through these on your own to make sure you can
show these relationships. You will need to become very
comfortable with inverting an OR!
Confidence interval for 
Since ̂ is a maximum likelihood estimate, we can use
the “usual” properties of them to find the confidence
 2010 Christopher R. Bilder
2.43
interval. However, using the log( ̂ ) often works better
(i.e., its distribution is closer to being a normal
distribution). It can be shown that:
 log( ̂ ) has an approximate normal distribution with
mean log() for large n.
 The “asymptotic” (for large n) standard deviation of
1
1
1
1
log( ̂ ) is
. This is derived using



n11 n12 n21 n22
the “delta method” (see Chapter 14 of Agresti (2002)
for a nice introduction).
The approximate (1-)100% confidence interval for
log() is
log(ˆ )  Z1 / 2
1
1
1
1



n11 n12 n21 n22
The approximate (1-)100% confidence interval for  is

1
1
1
1 
ˆ
exp log()  Z1 / 2




n
n
n
n
11
12
21
22 

Lui and Lin (Biometrical Journal, 2003, p. 231) show this
interval is conservative. What does “conservative”
mean?
 2010 Christopher R. Bilder
2.44
Problems with small cell counts
n n
What happens to ˆ  11 22 if nij=0 for some i, j?
n21n12
When there is a 0 or small cell count, the OR estimator
is changed a little to help prevent problems. The OR
estimator is

(n11  0.5)(n22  0.5)
(n21  0.5)(n12  0.5)
Thus, 0.5 is added to each cell count. The “asymptotic”
standard deviation of log(  ) is then
1
1
1
1



n11  0.5 n12  0.5 n21  0.5 n22  0.5
and the confidence interval for  can be found.
Sometimes, a small number is just added to a cell with a
0 count instead.
Example: Larry Bird (bird.R)
 2010 Christopher R. Bilder
2.45
Second
Made Missed Total
Made n11=251 n12=34 n1+=285
First
Missed n21=48 n22=5 n2+=53
Total n+1=299 n+2=39 n=338
ˆ  n11n22  251 5  0.7690 .
n21n12 48  34
Interpretation:
 The estimated odds of a made second free throw
attempt are 0.7690 times larger when the first free
throw is made than when the first free throw is missed.
 The estimated odds of a made first free throw attempt
are 0.7690 times larger when the second free throw is
made than when the second free throw is missed.
Note that this does not necessarily make sense to
examine for this problem.
Often when the OR<1, the OR is inverted and the
interpretation is changed. Therefore, the estimated odds
of a made second free throw attempt are
1/0.7690=1.3004 times larger when the first free throw is
missed than when the first free throw is made.
The approximate 95% confidence interval for  is
0.2862    2.0659. If the interval is inverted, the
approximate 95% confidence interval for 1/ is
0.4841  1/  3.4935.
 2010 Christopher R. Bilder
2.46
The interpretation can be extended to be:
With approximately 95% confidence, the odds of a
made second free throw attempt are between
0.4841 and 3.4935 times larger when the first free
throw is missed than when the first free throw is
made.
Since 1 is in the interval, there is not sufficient evidence
to indicate that the first free throw result has an effect on
the second free throw result.
R code and output:
> ####################################################
> #OR
> theta.hat <- (n.table[1,1] * n.table[2,2]) /
(n.table[1,2] * n.table[2,1])
> theta.hat
[1] 0.7689951
> 1/theta.hat
[1] 1.300398
> alpha <- 0.05
> lower <- exp(log(theta.hat) - qnorm(1 - alpha/2) *
sqrt(1/n.table[1,1] + 1/n.table[2,2] + 1/n.table[1,2] +
1/n.table[2,1])
> upper <- exp(log(theta.hat) + qnorm(1 - alpha/2) *
sqrt(1/n.table[1,1] + 1/n.table[2,2] + 1/n.table[1,2] +
1/n.table[2,1]))
> cat("The Wald interval for OR is:", round(lower, 4), "<=
theta <=", round(upper, 4))
The Wald interval for OR is: 0.2862 <= theta <= 2.0659
> #Invert
 2010 Christopher R. Bilder
2.47
> cat("The Wald interval for OR is:", round(1/upper, 4),“<=
1/theta <=", round(1/lower, 4))
The Wald interval for OR is: 0.4841 <= 1/theta <= 3.4935
Be careful with the inverted OR. I could have put “the
Wald interval for 1/OR is:…”.
Please note that it is incorrect to replace the word “odds”
with “probability”. Also, a statement such as “it is 1.3
times more likely the second free throw is made when
the first free throw is missed rather than made.” The
word “likely” means probabilities are being compared.
Example: Salk vaccine clinical trials (polio.R)
Vaccine
Placebo
Polio
57
142
Polio
free
200,688
201,087
Total
200,745
201,229
R code and output:
> n.table<-array(data = c(57, 142, 200688, 201087), dim =
c(2,2), dimnames=list(Trt = c("vaccine", "placebo"), Result
= c("polio", "polio free")))
> n.table
Result
Trt
polio polio free
vaccine
57
200688
placebo
142
201087
> theta.hat <- (n.table[1,1] * n.table[2,2]) / (n.table[1,2] *
n.table[2,1])
 2010 Christopher R. Bilder
2.48
> theta.hat
[1] 0.4022065
> 1/theta.hat
[1] 2.486285
> alpha <- 0.05
> lower <- exp(log(theta.hat) - qnorm(1 - alpha/2) *
sqrt(1/n.table[1,1] + 1/n.table[2,2] + 1/n.table[ 1,2] +
1/n.table[2,1]))
> upper <- exp(log(theta.hat) + qnorm(1 - alpha/2) *
sqrt(1/n.table[1,1] + 1/n.table[2,2] + 1/n.table[ 1,2] +
1/n.table[2,1]))
> cat("The Wald interval for OR is:", round(lower, 4), "<=
theta <=", round(upper, 4))
The Wald interval for OR is: 0.2958 <= theta <= 0.5469
> #Invert
cat("The Wald interval for 1/OR is:", round(1/upper, 4), "<=
1/theta <=", round(1/lower, 4))
The Wald interval for OR is: 1.8283 <= 1/theta
<= 3.381
The estimated odds of getting polio are 0.4022 times
higher when the vaccine is given instead of a placebo. If
this OR is inverted, a more meaningful interpretation
results:
The estimated odds of getting polio are 2.4863 times
higher when the placebo is given instead of the vaccine.
With approximately 95% confidence, the odds of getting
polio are between 1.8283 and 3.3810 times higher when
the placebo is given instead of the vaccine.
 2010 Christopher R. Bilder
2.49
The odds ratio interpretation could also be written as:
The estimated odds of not getting polio are 2.4863 times
higher when the vaccine is given instead of the placebo.
Would you want to receive the vaccine?
ORs can be calculated for larger contingency tables. For
example, suppose the table below is of interest.
Y
X
1
2
3
1
n11
n12
n13
n1+
2
n21
n22
n23
n2+
3
n31
n32
n33
n3+
n+1
n+2
n+3
n
Many ORs could be calculated here. For example,
n n
 The estimated odds of Y=1 vs. Y=2 are ˆ  11 22 times
n21n12
larger when X=1 than when X=2. Also, the estimated
n n
odds of X=1 vs. X=2 are ˆ  11 22 times larger when
n21n12
Y=1 than when Y=2.
n11n32
ˆ
 The estimated odds of Y=1 vs. Y=2 are  
times
n31n12
larger when X=1 than when X=3.
 2010 Christopher R. Bilder
2.50
n n
 The estimated odds of Y=1 vs. Y=3 are ˆ  11 33 times
n31n13
larger when X=1 than when X=3.
n n
 The estimated odds of Y=2 vs. Y=3 are ˆ  12 33 times
n32n13
larger when X=1 than when X=3.
Notice how each sentence has something like “Y=1 vs.
Y=2”. This is needed since we need to know which levels
are being compared. Before when there was just two, we
could just say “Y=1” since this implies it is being compared
to the only other level.
Notes:
 One could write the odds ratio in terms of the expected
1122
cell counts, ij, as  
for a 22 table.
1221
 Read on your own Section 2.3.4 (Relationship between
the OR and the relative risk), Section 2.3.5 (The odds
ratio applies in case-control studies) and Section 2.3.5
(Types of observational studies).
 The Chapter 2 extra notes for the following contains an
old test problem (responsible for) and other measures of
association in a contingency table (not responsible for).
 2010 Christopher R. Bilder
2.51
2.4 Chi-squared tests of independence
We will be doing a variety of different hypothesis tests
involving contingency tables. In order to do these
hypothesis tests, we will need to find the expected cell
counts under a hypothesis. These expected cell counts
are denoted by ij.
Agresti’s (2007) notation here is not necessarily the
best to use for all situations. It may be more
appropriate to use something like ij(o) to denote the
expected value under a null hypothesis (Ho).
For example, the observed cell count for row i and
column j of a contingency table is nij. Remember that nij
is a random variable. The expected value of nij under a
particular hypothesis is E(nij) = ij. Note that ij = nij if
there are no restrictions upon what ij can be.
Suppose we assume multinomial sampling (n is fixed).
A common hypothesis test is a test for independence:
Ho: ij=i++j for i=1,…,I and j=1,…,J
Ha: Not all equal
Under the null hypothesis of independence restriction,
E(nij) = ij = ni++j. Under Ho or Ha (no restriction), E(nij)
= nij.
 2010 Christopher R. Bilder
2.52
Make sure you understand why ij = ni++j under Ho!
Pearson statistic
The Pearson chi-squared statistic is
2
(n


)
ij
ij
X2   
ij
i1j1
I J
Notes:
 The numerator measures how far the expected value
under Ho and observed cell counts are from each
other. Think of this as a squared residual.
 The denominator helps account for the scale of the cell
count.
(nij  ij )2
 The larger
, the more evidence that the null
ij
hypothesis is incorrect.
 Large values of X2 indicate the null hypothesis is
incorrect.
 For large n, X2 has an approximate 2 distribution with
a particular number of degrees of freedom. The
degrees of freedom are dependent on the hypotheses
being tested. This is a right tail test.
 Typical recommendations for a “large n” involve ij  5
(or nij  5).
 2010 Christopher R. Bilder
2.53
 Remember that with nij ~ Poisson(ij), then
(nij  ij ) ij is an approximate standard normal
value. Thus, (nij  ij )2 ij is an approximate 12 value.
 See Section 24 of Ferguson (1996) for general uses of
the Pearson statistic.
Suppose we assume multinomial sampling (n is fixed).
When a test for independence is done, the hypotheses
are:
Ho: ij=i++j for i=1,…,I and j=1,…,J
Ha: Not all equal
The Pearson statistic has ni++j substituted for ij:
I J (nij  ni  j )2
.
X2   
ni  j
i1j1
Problem:
Notice the parameter values are in the statistic!
Thus, this statistic is difficult to calculate.
To solve the problem, the corresponding estimators
replace the parameters. The expected cell count
under independence is estimated by
ni n j ni n j
ˆ ij  npi p j  n

.
n n
n
 2010 Christopher R. Bilder
2.54
The statistic becomes
2
2
I J (nij  
I
J
)
(n

n
n
/
n)
ˆ
ij
ij
i  j
.
X2   

ˆ ij
ni n j / n
i1j1
i1j1
For large n, this statistic has an approximate 2
distribution with (I-1)(J-1) degrees of freedom under Ho.
The distribution can be denoted symbolically as
(I21)(J1) .
Where does the (I-1)(J-1) degrees of freedom come
from?
In general, the degrees of freedom can be
calculated as:
[# of parameters under Ha - # of restrictions under Ha] –
[# of parameters under Ho - # of restrictions under Ho]
= [# of free parameters under Ha] –
[# of free parameters under Ho]
For a test of independence, the number of free
parameters under Ha is IJ – 1.
Reason: There are IJ ij parameters. There is
one restriction since ijij=1.
 2010 Christopher R. Bilder
2.55
For a test of independence, the number of free
parameters under Ho is I+J-2.
Reason: There are I i+ parameters and J +j
parameters. There are two restrictions since
ii+=1 and j+j=1.
Thus, [IJ–1] – [I+J-2] = IJ – I – J +1 = (I-1)(J-1).
Example: Larry Bird (bird.R)
Second
Made Missed Total
Made n11=251 n12=34 n1+=285
First
Missed n21=48 n22=5 n2+=53
Total n+1=299 n+2=39 n=338
Second
Made
First
Missed
Made
285  299
ˆ 11 
338
 252.11
53  299
ˆ 21 
338
 46.88
 2010 Christopher R. Bilder
Missed
285  39
ˆ 12 
338
 32.88
53  39
ˆ 22 
338
 6.11
2.56
2
2
I
J
(n


)
(n

n
n
/
n)
ˆ
ij
ij
ij
i  j
X2   

ˆ ij
ni n j / n
i1j1
i1j1
(n11  ˆ 11)2 (n12  ˆ 12 )2 (n21  ˆ 21)2 (n22  ˆ 22 )2




ˆ 11
ˆ 12
ˆ 21
ˆ 22
I J
(251  252.11)2 (34  32.88)2 ( 48  46.88)2 (5  6.11)2




252.11
32.88
46.88
6.11
= 0.0049 + 0.0382 + 0.0268 + 0.2017
= 0.2716
2
2
The critical value at =0.05 is 0.95,(2
1)(21) = 0.95,1 =
3.84. The p-value for the test is 0.6015. Thus, there is
not sufficient evidence to reject independence. Of
course, this does not mean that the first and second
attempts ARE independent!
 2010 Christopher R. Bilder
2.57
1.0
0.5
Chi-square f(x)
1.5
2
1
0
1
2
3
4
5
x
par(xaxs = "i", yaxs = "i") #Removes extra space on
x and y-axis
curve(expr = dchisq(x, df=1), col = "red", xlim =
c(0,5), ylab = "Chi-square f(x)", main =
expression(chi[1]^2))
Note that executing demo(plotmath) at the command
prompt shows more of what you can do for plotting
mathematical symbols.
Below is the R code and output.
> ind.test<-chisq.test(n.table, correct=F)
> names(ind.test)
 2010 Christopher R. Bilder
2.58
[1] "statistic" "parameter" "p.value"
"data.name" "observed"
[7] "expected" "residuals"
> ind.test
"method"
Pearson's Chi-squared test
data: n.table
X-squared = 0.2727, df = 1, p-value = 0.6015
> #just p-value
> ind.test$p.value
[1] 0.6015021
> ind.test$expected
Second
First
made
missed
made
252.11538 32.884615
missed 46.88462 6.115385
> #Another way using the raw data
> chisq.test(x = all.data2$first, y = all.data2$second,
correct=F)
Pearson's Chi-squared test
data: all.data2$first and all.data2$second
X-squared = 0.2727, df = 1, p-value = 0.6015
> #critical value
> qchisq(p = 0.95, df = 1)
[1] 3.841459
> 1 - pchisq(q = ind.test$statistic, df = 1)
X-squared
0.6015021
> #Two more ways!
> bird.table2<-xtabs(formula = ~ first + second,
data=all.data2)
> summary(bird.table2)
Call: xtabs(formula = ~first + second, data = all.data2)
Number of cases in table: 338
Number of factors: 2
Test for independence of all factors:
Chisq = 0.27274, df = 1, p-value = 0.6015
 2010 Christopher R. Bilder
2.59
> bird.table3<-table(all.data2$first, all.data2$second)
> summary(bird.table3)
Number of cases in table: 338
Number of factors: 2
Test for independence of all factors:
Chisq = 0.27274, df = 1, p-value = 0.6015
Notes:
 When the sample size is small, a 2 approximation to the
distribution of X2 may not do a good job. The Yates’
continuity correction can be used to allow for a better
approximation. With the correction, the Pearson statistic
becomes:
X
2
nij  ni n j / n  0.5 


2
I J
i1 j1
ni n j / n
You can produce this statistic with the chisq.test()
function by using the correct=TRUE option. We will
discuss other alternatives later for when the sample size
is small. Here is a quote from Agresti (1996, p.43),
regarding the use of the correction:
There is no longer any reason to use this
approximation, however, since modern software
makes it possible to conduct Fisher’s exact test for
fairly large samples…
 2010 Christopher R. Bilder
2.60
 The Pearson statistic can also be derived from the point of
view of having independent multinomial sampling (ni+ fixed
– each row of the contingency table represents a
population). Instead of testing for independence as stated
previously, equality of the j|i across the rows for each
j=1,…,J is tested. Stated formally, the hypotheses are
Ho:j|1=…=j|I for j=1,…,J vs. Ha: At least one 
The hypotheses here are equivalent to the independence
hypotheses (see p. 2.17 – 2.18). The Pearson test
statistic and its asymptotic distribution are also the same.
Some books go into detail explaining the differences and
how they end up being equivalent. See Chapter 2 of
Christensen (1990) if you are interested.
Likelihood ratio test (LRT) statistic
From Chapter 1 notes:
The LRT statistic, , is the ratio of two likelihood
functions. The numerator is the likelihood function
maximized over the parameter space restricted under
the null hypothesis. The denominator is the likelihood
function maximized over the unrestricted parameter
space. The test statistic is written as:
Max. lik. when parameters satisfy Ho

Max. lik. when parameters satisfy Ho or Ha
 2010 Christopher R. Bilder
2.61
Note that the ratio is between 0 and 1 since the
numerator can not exceed the denominator.
Questions:
 Why can’t the numerator exceed the denominator?
 What does it mean when the ratio is close to 1?
 What does it mean when the ratio is close to 0?
The actual test statistic used for a LRT is –2log(). The
reason is because this statistic has an approximate 2
distribution for large n. The degrees of freedom are
found the same way as for the Pearson statistic.
Assuming multinomial sampling, –2log() becomes
 nij 
G  2   nij log  
 ij 
i1j1
 
2
I J
where ij is restricted under the null hypothesis. Note
that ij under Ho or Ha ends up being just nij. The G2
notation is used throughout this book and by many other
authors to denote this statistic.
Questions:
 What happens if nijij?
 What could produce a large value of G2?
 2010 Christopher R. Bilder
2.62
The Pearson and G2 will often yield the same
conclusions, but rarely the exact same statistic values.
Each will always have the same large sample
(asymptotic) distribution under the null hypothesis.
Suppose we assume multinomial sampling (n is fixed).
When a test for independence is done, the hypotheses
are:
Ho: ij=i++j for i=1,…,I and j=1,…,J
Ha: Not all equal
G2 has ni++j substituted for ij:
I J
 nij 
2
G  2   nij log 


n


i1j1
 i  j 
Problems:
1) What if nij=0? Often, 0.5 or some other small constant
is added to the cell.
2) Notice the parameter values in G2! Thus, this statistic
is difficult to calculate.
To solve the problem, the corresponding estimators
replace the parameters. The expected cell count
under independence is estimated by
ni n j ni n j
ˆ ij  npi p j  n

.
n n
n
The statistic becomes
 2010 Christopher R. Bilder
2.63
 nij 
G  2   nij log  
 ˆ ij 
i1 j1
 
I J
2
 nij

 2   nij log 


n
n
/
n
i1 j1
 i  j 
I J
For large n, this statistic has an approximate (I21)(J1)
distribution.
Example: Larry Bird (bird.R)
From the last example,
Second
Made Missed Total
Made n11=251 n12=34 n1+=285
First
Missed n21=48 n22=5 n2+=53
Total n+1=299 n+2=39 n=338
Second
Made
Missed
Made ̂11  252.11 ̂12  32.88
First
Missed ̂21  46.88 ̂22  6.11
 nij 
G  2   nij log  
 ˆ ij 
i1j1
 
2
2 2
= 2 251log 

251 
 34 
 48 
 5 

34log

48log

5log







 252.11
 32.88 
 46.88 
 6.11 
 2010 Christopher R. Bilder
2.64
= 0.2858
The p-value is 0.5930. Thus, there is not sufficient
evidence to reject independence. Remember the p-value
from using the Pearson statistic was 0.6015.
For a small contingency table like this, you may have to do
the calculations by hand on a test. Below is how the test
can be done a few different ways in R.
> library(vcd)
Loading required package: MASS
Attaching package 'vcd':
The following object(s) are masked from package:graphics :
barplot.default fourfoldplot mosaicplot
The following object(s) are masked from package:base :
print.summary.table summary.table
> assocstats(n.table)
X^2 df P(> X^2)
Likelihood Ratio 0.28575 1 0.59296
Pearson
0.27274 1 0.60150
Phi-Coefficient
: 0.028
Contingency Coeff.: 0.028
Cramer's V
: 0.028
The package, vcd, contains a function assoc.stats()
which can calculate the LRT statistic and p-value.
This package is not installed by default with R. You
 2010 Christopher R. Bilder
2.65
can install the package by selecting PACKAGES >
INSTALL PACKAGE(S) FROM CRAN. Select the vcd
package from the list and select OK.
R may ask if you want to delete the installation files.
You can type “Y” for deletion. In order to load the
package (make ready for use) in any R session, use the
library(vcd) code. This must be done before using any
functions within the package.
See the Chapter 2 additional notes for how you can
program the statistic itself into R.
 2010 Christopher R. Bilder
2.66
Large n
The
2
(I1)(J1)
distributional approximations for X2 and G2
both rely on a “large n” for them to work. Below is a
quote from Agresti (1990, p.49) that describes the
approximation in more detail:
It is not simple to describe the sample size needed
for the chi-squared distribution to approximate well
the exact distribution of X2 and G2. For a fixed
number of cells, X2 usually converges more quickly
than G2. The chi-squared approximation is usually
poor for G2 when n/IJ<5. When I or J is large, it can
be decent for X2 for n/IJ as small as 1, if the table
does not contain both very small and moderately
large expected frequencies.
P. 395-6 of Agresti (2002) contains similar information.
Example: Salk vaccine clinical trials (polio.R)
Vaccine
Placebo
Polio
57
142
Polio
free
200,688
201,087
Total
200,745
201,229
# Test for independence - Pearson chi-square
 2010 Christopher R. Bilder
2.67
> ind.test <- chisq.test(n.table, correct = F)
> ind.test
Pearson's chi-square test without Yates' continuity
correction
data: n.table
X-square = 36.1201, df = 1, p-value = 0
#critical value
> qchisq(p = 0.95, df = 1)
[1] 3.841459
> 1 - pchisq(q = ind.test$statistic, df = 1)
X-square
1.855266e-009
> ind.test$expected
Result
Trt
polio polio free
vaccine 99.3802
200645.6
placebo 99.6198
201129.4
#####################################################
# Test for independence – LRT
> library(vcd)
> assocstats(n.table)
X^2 df
P(> X^2)
Likelihood Ratio 37.313 1 1.0059e-09
Pearson
36.120 1 1.8553e-09
Phi-Coefficient
: 0.009
Contingency Coeff.: 0.009
Cramer's V
: 0.009
There is evidence against the independence of the
treatment and polio result.
 2010 Christopher R. Bilder
2.68
Suppose subjects can pick more than one X and Y
response. Below is an example of where this can happen:
In this case, farmers can choose more than one type of
swine waste storage method and more than one type of
source of veterinary information. The previous methods
for testing independence assume a subject (farmer here)
is represented only once in the table. Therefore, they
can not be used. As part of my research, I have derived
a few different testing approaches for this. See Bilder
and Loughin (Biometrics, 2004) for more information.
Residuals
Suppose the hypothesis of independence is rejected.
The next step would be to determine WHY it was
rejected. Summary measures like an OR can help
determine what type of dependence exists. Cell
residuals can also help determine where independence
is a bad “fit”.
 2010 Christopher R. Bilder
2.69
 Cell deviations: nij- ̂ij - hard to interpret because of the
size of the counts
2
(n


)
ˆ
ij
ij
 Cell 2:
- can be “roughly” treated as 12
ˆ ij
(nij  ˆ ij )
 Pearson residual:
- this is just the square root
ˆ ij
of the cell 2; it can be treated “roughly” as a N(0,1);
use 2 or 3 as “general” guidelines to help determine
what cells are “outlying” or indicate evidence against
independence
(nij  ˆ ij )
 Standardized residual:
for a test of
ˆ ij (1  pi )(1  p j )
independence. Note that the denominator is
Var(nij  ˆ ij ) . For large n, this can be treated as a
approximate N(0,1) random variable. Use 2 or 3 as
guidelines to help determine what cells are “outlying”
or indicate evidence against independence.
Questions:
 For the Pearson residual, why does it make sense to
divide by ̂ij ?
 The standardized residual will change if a different
hypothesis is tested.
 The Pearson residual and the standardized residual
are the equivalent of semistudentized residuals and
 2010 Christopher R. Bilder
2.70
studentized residuals typically discussed in a
regression analysis course similar to STAT 870. See
Section 10.2 of my STAT 870 lecture notes at
www.chrisbilder.com/stat870/schedule.htm for more
information.
Example: Larry Bird (bird.R)
From the last example,
Second
Made Missed
Made n11=251 n12=34
First
Missed n21=48
n22=5
Total n+1=299 n+2=39
Total
n1+=285
n2+=53
n=338
Second
Made
Missed
Made ̂11  252.11 ̂12  32.88
First
Missed ̂21  46.88 ̂22  6.11
Pay close attention to how elementwise subtraction and
division are being done even though matrices are being
used!
#General way
> mu.hat<-ind.test$expected
> cell.dev <- n.table - mu.hat
> cell.dev
second made second missed
first made
-1.115385
1.115385
 2010 Christopher R. Bilder
2.71
first missed
1.115385
-1.115385
> pearson.res <- cell.dev/sqrt(mu.hat)
> pearson.res
second made second missed
first made -0.07024655
0.1945039
first missed 0.16289564
-0.4510376
> ind.test$residuals #Pearson residuals easier way
Second
First
made
missed
made
-0.07024655 0.1945039
missed 0.16289564 -0.4510376
> stand.res <- matrix(NA, 2, 2)
> #find standardized residuals
for(i in 1:2) {
for(j in 1:2) {
stand.res[i, j] <- pearson.res[i,j] /
sqrt((1-sum(n.table[i,])/n) * (1-sum(n.table[,j])/n))
}
pi+
}
p+j
> stand.res
[,1]
[,2]
[1,] -0.5222416 0.5222416
[2,] 0.5222416 -0.5222416
#Note that the Pearson residuals can also be found with:
> ind.test<-chisq.test(n.table, correct=F)
> ind.test$residuals
second made second missed
first made
-0.07024655
0.1945039
first missed 0.16289564
-0.4510376
Notice that none of the residuals are indicating that
independence provides a bad fit to the contingency
table. Why does this make sense?
 2010 Christopher R. Bilder
2.72
Example: Salk vaccine clinical trials (polio.R)
Vaccine
Placebo
Polio
57
142
Polio
free
200,688
201,087
Total
200,745
201,229
> n.table
polio polio free
vaccine
57
200688
placebo
142
201087
> pearson.res<-ind.test$residuals
> pearson.res
Result
Trt
polio polio free
vaccine -4.251215 0.09461241
placebo 4.246099 -0.09449856
> stand.res <- matrix(data = NA, nrow = 2, ncol = 2) #find
standardized residuals
> for(i in 1:2) {
for(j in 1:2) {
stand.res[i, j] <- pearson.res[i, j]/sqrt((1 –
sum(n.table[i, ]/n)) * (1 - sum(n.table[, j]/n)))
}
}
pi+
p+j
> stand.res
[,1]
[,2]
[1,] -6.009997 6.009997
[2,] 6.009997 -6.009997
Notice that the residuals are indicating all cells contribute
to the dependence.
Example: #7.13 (birth_control.R)
 2010 Christopher R. Bilder
2.73
This example shows what happens when a table larger
than 22 is used. Note that it may be difficult to
summarize all of the dependence with ORs since the
table is 94 in size!
Subjects were asked whether methods of birth control
should be available to teenagers between the ages of 14
and 16. Notice the ordered categorical variables!
Religious attendance
Teenage birth control
strongly agree agree disagree strongly disagree
Never
49
49
19
9
<1 per year
31
27
11
11
1-2 per year
46
55
25
8
several times per year
34
37
19
7
1 per month
21
22
14
16
2-3 per month
26
36
16
16
nearly every week
8
16
15
11
every week
several times per
week
32
65
57
61
4
17
16
20
Below is the R code and output.
n.table<-array(c(49, 31, 46, 34, 21, 26, 8, 32, 4,
49, 27, 55, 37, 22, 36, 16, 65, 17,
19, 11, 25, 19, 14, 16, 15, 57, 16,
9, 11, 8, 7, 16, 16, 11, 61, 20),
dim=c(9,4), dimnames=list( Religous.attendance =
c("Never", "<1 per year", "1-2 per year", "several
times per year", "1 per month", "2-3 per month",
 2010 Christopher R. Bilder
2.74
"nearly every week", "every week", "several times per
week"),
Teenage.birth.control = c("strongly agree", "agree",
"disagree", "strongly disagree")))
> n.table
Religous.attendance
Never
<1 per year
1-2 per year
several times per
1 per month
2-3 per month
nearly every week
every week
several times per
Teenage.birth.control
strongly agree agree disagree strongly disagree
49
49
19
9
31
27
11
11
46
55
25
8
year
34
37
19
7
21
22
14
16
26
36
16
16
8
16
15
11
32
65
57
61
week
4
17
16
20
######################################################
# Test for independence - Pearson
> ind.test <- chisq.test(n.table, correct = F)
> ind.test
Pearson's chi-square test without Yates' continuity correction
data: n.table
X-square = 106.1941, df = 24, p-value = 0
> mu.hat<-ind.test$expected
> mu.hat
Teenage.birth.control
Religous.attendance
strongly agree
agree
Never
34.15335 44.08639
<1 per year
21.68467 27.99136
1-2 per year
36.32181 46.88553
several times per year
26.29266 33.93952
1 per month
19.78726 25.54212
2-3 per month
25.47948 32.88985
nearly every week
13.55292 17.49460
every week
58.27754 75.22678
several times per week
15.45032 19.94384
disagree strongly disagree
26.12527
21.634989
16.58747
13.736501
27.78402
23.008639
20.11231
16.655508
15.13607
12.534557
19.49028
16.140389
10.36717
8.585313
44.57883
36.916847
11.81857
9.787257
######################################################
# Test for independence - LRT
> #easiest way
> library(vcd)
 2010 Christopher R. Bilder
2.75
> assocstats(n.table)
X^2 df
P(> X^2)
Likelihood Ratio 112.54 24 2.0284e-13
Pearson
106.19 24 2.5890e-12
Phi-Coefficient
: 0.339
Contingency Coeff.: 0.321
Cramer's V
: 0.196
######################################################
# Find residuals
> pearson.res<-ind.test$residuals
> pearson.res
Never
<1 per year
1-2 per year
several times per year
1 per month
2-3 per month
nearly every week
every week
several times per week
strongly agree
2.5404573
2.0004242
1.6058693
1.5030986
0.2726315
0.1031195
-1.5083590
-3.4421839
-2.9130570
agree
0.7400280
-0.1873785
1.1850612
0.5253346
-0.7008651
0.5423137
-0.3573330
-1.1791057
-0.6591897
disagree strongly disagree
-1.3940262
-2.71641759
-1.3719091
-0.73834198
-0.5281708
-3.12893004
-0.2480249
-2.36589885
-0.2920103
0.97882315
-0.7905900
-0.03494422
1.4388522
0.82410537
1.8603644
3.96370252
1.2163031
3.26446417
#find standardized residuals
> stand.res <- matrix(NA, 9, 4)
> for(i in 1:9) {
for(j in 1:4) {
stand.res[i, j]<-pearson.res[i,j] /
sqrt((1-sum(n.table[i,]/n)) * (1 - sum(n.table[, j]/n)))
}
}
> stand.res
[,1]
[,2]
[,3]
[,4]
[1,] 3.2012973 0.9874517 -1.6845693 -3.21118144
[2,] 2.4512975 -0.2431349 -1.6121413 -0.84876153
[3,] 2.0337928 1.5892453 -0.6414677 -3.71746225
[4,] 1.8606698 0.6886070 -0.2944292 -2.74746527
[5,] 0.3327059 -0.9056755 -0.3417328 1.12058040
[6,] 0.1274202 0.7095804 -0.9368124 -0.04050671
[7,] -1.8164011 -0.4556525 1.6616023 0.93098781
[8,] -4.6010650 -1.6689014 2.3846584 4.97026514
[9,] -3.5220717 -0.8439433 1.4102460 3.70267268
 2010 Christopher R. Bilder
2.76
There is strong evidence against independence. The
deviation from independence appears to occur in the
“corners” of the table. Notice the upper left and lower
right have positive values, and the lower left and upper
right have negative values. This could be due to the
ordinal nature of the categorical variables. Models which
take into this into account will be discussed later.
The type of dependence here is called “positive”
dependence (not “negative” dependence). The upper
left and lower right have positive values mean the (1,1),
(9,4),… cells are occurring more frequently than
expected under independence. Thus, low row and
column indices occur together and the high row and high
column indices occur together. The lower left and upper
right have negative values mean the (9,1), (1,4),… cells
are occurring less frequently than expected under
independence.
If this is hard to understand, think of the positive
relationship that typically occurs with high school
and college GPAs. See the data set in the R
Introduction notes.
Partitioning Chi-squared (p.32-3)
Read on your own
 2010 Christopher R. Bilder
2.77
Comments on Chi-squared tests (p.33-34)
Read on your own
Note that X2 and G2 do not depend on the order of the
rows or columns. Thus, they do not change for any
ordering of the rows and columns. These tests assume
the categorical variables are nominal. If the categorical
variables are ordinal, the tests ignore the ordinal
information.
 2010 Christopher R. Bilder
2.78
2.5 Testing independence for ordinal data
The previous tests for independence assumed each
categorical variable was nominal. If at least one of the
variables was ordinal, useful information may be ignored
by using the previous tests!
Generally, tests which incorporate the ordinal
information will be more POWERFUL in detecting
dependence than tests which do not.
What does being more POWERFUL mean???
Linear trend alternative to independence
Suppose the row and column categorical variables are
ordinal. If either of the categorical variables are nominal
with only two categories, the test shown below can also
be used.
Tests using the ordinal information assign “scores” to the
each level of the row and each level of the column
categorical variables.
Let u1u2…uI denote the scores for the row
variable with at least one  replaced with a <.
Let v1v2…vJ denote the scores for the column
variable with at least one  replaced with a <.
 2010 Christopher R. Bilder
2.79
Example: #7.13 (birth_control.R)
Religious attendance
Teenage birth control
strongly agree agree disagree strongly disagree
Never
49
49
19
9
<1 per year
31
27
11
11
1-2 per year
46
55
25
8
several times per year
34
37
19
7
1 per month
21
22
14
16
2-3 per month
26
36
16
16
nearly every week
8
16
15
11
every week
several times per
week
32
65
57
61
4
17
16
20
Teenage birth control opinions could have scores of v1=1
(strongly agree), v2=2 (agree), v3=3 (disagree), and v4=4
(strongly disagree).
Religious attendance could have scores of u1=0 (never),
u2=1 (<1 per year),…, u9=8 (several times per week).
Changing the levels to yearly could produce the
following scores also: u1=0, u2=1, u3=1.5, u4=(3+12)/2 =
7.5, u5=12, u6=2.512=25, u7=(52+25) /2=38.5, u8=52,
and u9=522=104.
Notice there generally is more than one way of assigning
scores! One should try a few different ways to see if
inferences are affected.
 2010 Christopher R. Bilder
2.80
Suppose each observation is replaced with their (ui, vj)
pair. In the last example, there are 49 observation pairs
of (u1,v1), …, 20 observation pairs of (u9, v4). Using this
“new” data set, the Pearson product-moment correlation
(often denoted by r) can be calculated and interpreted in
its usual way!
Review from STAT 218 for a Pearson correlation:
Suppose X and Y are two variables. We observe
(x1, y1), …. , (xn, yn) pairs where n is the sample size.
The Pearson correlation is calculated as
n
r
 (xi  x)(yi  y)
i1
n
n
2
 (xi  x)  (yi  y)
2
i1
i1
n

 xi yi  nxy

i 1
n
 x  nx
i 1
2
i
2

n
2
2
 yi  ny
i 1

r is scaleless and 0r1.
Since there are a number of observations with the same
(ui, vj) pair, we can simplify the formula for the correlation
to be
 2010 Christopher R. Bilder
2.81

 I
 J
u
v
n

u
n
v
n
  i j ij   i i    j  j 
 i1
  j1
i1 j1

I J
r
2
2
J



I

 


u
n
  v jn j  

i
i



 I u2n   i1
   J v 2n   j1
 
  j j

i i
i

n
n
1
j1





 


Compare this formula on your own to the formula for the
Pearson product-moment correlation.
Notes:
 -1r1
 Values close to -1 or 1 indicate strong negative or
positive dependence, respectively.
 Values close to 0 indicate independence or small
dependence.
 To test,
Ho:Independence
Ha:Linear dependence,
use M2=(n-1)r2 as the test statistic. This statistic has
an approximate 12 distribution for large n.
 Notice the null hypothesis is the same as previously
used for “test of independence” with X2 and G2.
However, the alternative hypothesis is not the same.
 2010 Christopher R. Bilder
2.82
This alternative hypothesis specifies the “type” of
dependence. Previously, any “type” of dependence
was given in the alternative hypothesis.
 The alternative hypothesis here is a subset of the
alternative hypothesis used with X2 and G2.
Example: #7.13 (birth_control.R)
Ho:Independence
Ha:Linear dependence
r = 0.31, M2 = (926-1)0.312 = 88.96, p-value<0.0001
There is evidence of positive linear dependence. Notice
the pattern of the residuals on p. 2.75. This is indicative
of a linear relationship! The “corner” residuals are
“large”. When the u and v scores are both small or
large, the residuals are positive. When the u and v
scores are opposite in their values (i.e. u small, v large
or vice versa), the residuals are negative.
Below is the R code and output. Notice how the data is
put into its “raw” form.
> #########################################################
# ordinal measures
#scores
> u <- 0:8
> #u_c(0, 1, 1.5, 7.5, 12, 25, 38.5, 52, 104)
 2010 Christopher R. Bilder
2.83
> v <- 1:4
> all.data <- matrix(data = NA, nrow = 0, ncol = 2)
Combine u
> #Put data in "raw" form
and v scores
for(i in 1:9) {
for(j in 1:4) {
all.data <- rbind(all.data, matrix(data = c(u[i],v[j]),
nrow = n.table[i, j], ncol = 2, byrow=T))
}
}
Number of rows of
#find correlation the same u and v
Number of columns
> r <- cor(all.data)
of the same u and v
> r
[,1]
[,2]
[1,] 1.0000000 0.3101243
[2,] 0.3101243 1.0000000
> M.sq <- (sum(n.table) - 1) * r[1, 2]^2
> M.sq
[1] 88.96382
> 1 - pchisq(M.sq, 1)
[1] 0
When the second set of u scores are used, r= 0.3067.
Notes:
 r and M2 do not change for different sets of equal spaced
scores. For example, scores of 1,2,3,4 and 0,1,2,3 give
the same results.
 See the example using the data in Table 2.7 of Agresti
(2007). The column variable is nominal, but one can still
find r since it has only two levels.
 See Agresti (2007) use of “midranks” to find the scores.
 2010 Christopher R. Bilder
2.84
Model based approaches for ordinal data will be discussed
later in Chapter 7. Chapter 9 of Agresti (2002) discusses
these in detail.
What if one of the variables is ordinal and the other
variable is nominal (with more than two categories)? One
can look at mean scores across the levels of the nominal
variable. For example, suppose X is nominal and Y is
ordinal. Find the mean scores for Y at each level of X.
See Chapter 9 of Agresti (2002) again.
 2010 Christopher R. Bilder
2.85
2.6 Exact inference for small samples
X2 and G2 for a fixed n do NOT exactly have 2
distributions!!! We use a 2 distribution when n is large
because the statistics “approximately” have this
distribution. What happens if the sample size is not
large???
A good overview of exact inference is:
Agresti, A. (1992). A survey of exact inference for
contingency tables. Statistical Science 7, 131-153.
Exact inference refers to the “exact” probability
distribution of the statistic being used. The ClopperPearson interval is an example of exact inference.
Here’s a quote from Agresti (1992) which quotes R. A.
Fisher’s Statistical Methods for Research Workers 1st
edition (1926) book:
… the traditional machinery of statistical processes
is wholly unsuited to the needs of practical research.
Not only does it take a cannon to shoot a sparrow,
but it misses the sparrow! The elaborate
mechanism built on the theory of infinitely large
samples is not accurate enough for simple
laboratory data. Only by systematically tackling
 2010 Christopher R. Bilder
2.86
small sample problems on their merits does it seem
possible to apply accurate tests to practical data.
Small samples here does not just mean a small n. It
also means having a mix of small and large cell counts.
Hypergeometric distribution
Here’s the classic set up for a random variable with a
hypergeometric probability distribution:
Suppose an urn has n balls with a of them being red
and b of them being blue. Thus a+b=n. Suppose
kn balls are drawn from the urn without
replacement. Let m be the number of red balls
drawn out.
The random variable m has a hypergeometric
distribution with density function of
 a  b 
 m  k  m 
 for m = 0, 1,…, k
P(m)   
n
k 
 
subject to m ≤ a and k – m ≤ b. Note that
 e
e!

C

 d  e d d!(e  d)! = “e choose d”. Also, notice that
 
 2010 Christopher R. Bilder
2.87
a, n, b, and k are FIXED values. The only random
variable is m!
Example: Let n=10, a=4, b=6, k=3, and m=2
 4  6 
 2  3  2  6  6 3

P(m  2)   

120 10
 10 
3
 
Example: Urns (tea_taster.R)
Suppose there are n=8 balls in an urn with a=4 of them
red and b=4 of them blue. Suppose k=4 balls are drawn
from the urn. What is the probability of getting m=3 red
balls?
 a  b 
 m  k  m 

P(3)   
n
k 
 
 4  4 
 3  4  3  4  4
 

 0.2286
8!
8
 4
4!4!
 
The entire probability distribution is
m P(m)
0 0.0143
1 0.2286
2 0.5143
 2010 Christopher R. Bilder
2.88
3 0.2286
4 0.0143
Is it reasonable to observe m  3?
R code and output:
> #P(3)
> dhyper(3, 4, 4, 4)
[1] 0.2285714
> #P(0),...,P(4)
> dhyper(0:4, 4, 4, 4)
[1] 0.01428571 0.22857143 0.51428571 0.22857143 0.01428571
In general, the function is dhyper(m, a, b, k).
Fisher’s exact test
The hypergeometric distribution can be used with 22
tables to test for independence! Below is a 22 table.
m
Y
1
X
2
1 n11 n12 n1+
2 n21 n22 n2+
n+1 n+2
a
b
n
n
Suppose n1+, n2+, n+1, n+2, and n are FIXED by the
sampling design. This means before the sample is
k
 2010 Christopher R. Bilder
2.89
taken or the experiment is conducted, these values are
KNOWN. Given these known quantities, how many of
the 4 cell counts (n11, n12, n21, and n22) are needed
before all of the other cell counts are known?
Since only one of the four cell counts is needed to know
the rest of the table counts, n11 can be treated as the
only random variable! If you know n11, you know the rest
of the table!
Suppose X and Y are independent. The probabilities of
observing different n11 values (and thus different 22
tables) can be calculated using the hypergeometric
distribution:
 n1   n2  
 n  n  n 
P(n11)   11   1 11  
 n 
n 
 1 
 n1   n2  
n  n 
 11   21  .
 n 
n 
 1 
The probabilities are calculated under the assumption of
independence between X and Y. A low probability
indicates that a particular n11 is not likely to be observed.
Thus, its corresponding 22 table is not likely under
independence.
Using the hypergeometric distribution in a test for
independence with 22 contingency tables is called
 2010 Christopher R. Bilder
2.90
Fisher’s exact test. Note that the hypergeometric is the
EXACT distribution for n11. Thus, this is where the name
exact inference comes from.
Tea taster experiment
This is a common example discussed often in statistics.
See p. 46 of Agresti (2007) for the set up or “The Lady
Tasting Tea: How Statistics Revolutionized Science in
the Twentieth Century” book by David Salsburg. Below
is a “possible” outcome of the observed data (the actual
data is unknown?).
Guess Pour First
Milk
Tea
Poured Milk
First Tea
3
1
4
1
3
4
4
4
8
Before the experiment, it was decided to have 4 cups
with milk poured first and 4 cups with tea poured first.
Thus, the row marginal totals are FIXED.
Since the taster was told before the experiment 4 cups
had milk poured first and 4 cups had tea poured first,
one would think the taster would guess 4 of each type.
Thus, the column totals are FIXED.
 2010 Christopher R. Bilder
2.91
Questions:
 How likely is it to have an experiment with both row and
column totals fixed?
 Suppose that the taster really can not tell the difference.
What does this mean in terms of the problem?
 What is the probability that the taster would have
guessed correctly three of the milk poured first?
Under the assumption that the taster can not tell the
difference, the probability can be found with the
hypergeometric distribution: P(3)=0.2286.
 Does guessing 3 or more of the milk poured first
correctly seem reasonable under the assumption that
the taster can not tell the difference?
P(3)+P(4) = 0.2286 + 0.0143 = 0.2429
 What is the p-value of Ho:=1 (independence) vs. Ha:>1
(positive dependence)?
 Why is this test chosen instead of Ha:1 or Ha:<1?
Notice the only way to show there is some evidence that
the taster can tell the difference is when n11=4.
The small sample size here is the reason.
R code and output from tea_taster.R:
 2010 Christopher R. Bilder
2.92
> n.table<-array(data = c(3, 1, 1, 3), dim = c(2,2),
dimnames=list(Actual = c("Pour Milk", "Pour Tea"),
Guess = c("Pour Milk", "Pour Tea")))
> n.table
Guess
Actual
Pour Milk Pour Tea
Pour Milk
3
1
Pour Tea
1
3
> fisher.test(x = n.table)
Fisher's Exact Test for Count Data
data: n.table
p-value = 0.4857
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.2117329 621.9337505
sample estimates:
odds ratio
6.408309
> fisher.test(n.table, alternative = "greater")
Fisher's Exact Test for Count Data
data: n.table
p-value = 0.2429
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
0.3135693
Inf
sample estimates:
odds ratio
6.408309
The two-tail test p-value is given by fisher.test(). The
two-tail test adds all probabilities that are  P(n11); i.e.,
sum the table probabilities that are no more likely than
 2010 Christopher R. Bilder
2.93
the observed. In this case, this includes the probabilities
of P(0), P(1), P(3), P(4).
Larger than 22 tables
Fisher’s exact test can be extended to tables larger than
22 by using the multiple hypergeometric distribution.
The probability of observing cell counts that are not in
row I or column J (see blue cells) is:

 I
 J
  ni !    n j ! 
 i1
  j1

I J
n!   nij !
i1 j1
The marginal totals of the contingency table are again
assumed to be fixed. Below is the IJ table shown for
review:
 2010 Christopher R. Bilder
2.94
1
X
2
Y
 J-1
J
1
n11
n12 
n1,J-1
n1J
n1+
2

n21

n22 
 
n2,J-1
n2J

n2+

I-1 nI-1,1 nI-1,2 
nI-1,J-1
nI-1,J
I
nI1
nI2

nI,J-1
nIJ
nI+
n+1
n+2 
n+,J-1
n+J
n
For 22 tables, the multiple hypergeometric simplifies to
the hypergeometric.
Example: Table 2.10 of Agresti (1996, p. 45) (tab2.10.R)
n.table <- array(data = c(0, 1, 0, 7, 1, 8, 0, 1, 0, 0, 1, 0,
0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0), dim = c(3,
9))
> n.table
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]
0
7
0
0
0
0
0
1
1
[2,]
1
1
1
1
1
1
1
0
0
[3,]
0
8
0
0
0
0
0
0
0
> fisher.test(n.table)
Fisher's Exact Test for Count Data
data: n.table
p-value = 0.001505
alternative hypothesis: two.sided
 2010 Christopher R. Bilder
2.95
> x.sq<-chisq.test(n.table, correct=F)
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(n.table, correct = F)
> x.sq
Pearson's Chi-squared test
data: n.table
X-squared = 22.2857, df = 16, p-value = 0.1342
Notice the difference in the p-values between the two
tests.
Permutation tests
Introduction to Modern Nonparametric Statistics by
James J. Higgins (2003) is a very good reference on
these types of tests.
Similar to Fisher’s exact test, it would be nice if we could
write out the exact probability distribution for statistics
like X2 or G2 and use these distributions to judge how
likely it is to observe the test statistic value under a null
hypothesis. In the tea tasting experiment, there are 5
unique 22 tables under independence that produce the
following probabilities:
 2010 Christopher R. Bilder
2.96
n11
Table
P(n11)
X2
0
0
4
4
0
0.0143
8
1
1
3
3
1
0.2286
2
2
2
2
2
2
0.5143
0
3
3
1
1
3
0.2286
2
4
4
0
0
4
0.0143
8
Notice that the X2 is the same for some tables. Taking
this into account, the exact probability distribution of X2
can be written as
X2
0
2
8
P(X2)
CDF
0.5143 0.5143
0.4571 0.9715
0.0286 1.0000
 2010 Christopher R. Bilder
2.97
The CDF column represents the “cumulative distribution
function.” Remember that with a Pearson chi-square
test for independence, we would use a 12 distribution to
approximate this discrete distribution. Below are some
tables and plots showing how poor this approximation is:
X2
0
2
8
2
P(X2)
CDF CDF for 1
0.5143 0.5143 0.0000
0.4571 0.9714 0.8427
0.0286 1.0000 0.9953
2
1
0.2
0.4
0.6
Exact
0.0
Cumulative probability
0.8
1.0
CDFs
0
2
4
6
X
2
 2010 Christopher R. Bilder
8
2.98
My perm_test_motivate.R program does the new
calculations.
A more general way to see this same exact distribution
representation is to consider all possible “permutations”
of the row and column numbers. For example, we
observed the table:
Guess Pour First
Milk
Tea
Poured Milk
First Tea
3
1
4
1
3
4
4
4
8
There are 8 distinct observations the lady needs to
make. We could label these as z1, z2, …, z8. Suppose
we observed the following:
Row
1
1
1
1
2
2
2
2
Column
z1 = 1
z2 = 1
z3 = 1
z4 = 2
z5 = 1
z6 = 2
z7 = 2
z8 = 2
which produces the table above and X2 = 2. Under
independence, these column numbers could have
 2010 Christopher R. Bilder
2.99
appeared with any of the row numbers. For example,
we could have had
Row
1
1
1
1
2
2
2
2
Column
z2 = 1
z1 = 1
z3 = 1
z4 = 2
z5 = 1
z6 = 2
z7 = 2
z8 = 2
resulting in the same 22 table, so that X2 = 2 again.
Also, we could have had
Row
1
1
1
1
2
2
2
2
Column
z1 = 1
z2 = 1
z7 = 2
z4 = 2
z5 = 1
z6 = 2
z3 = 1
z8 = 2
resulting in a contingency table with all 2’s in the cells
and X2 = 0. These last two examples are “permutations”
of the data, and there are 8! = 40,320 permutations in
total. Because of the independence assumption, each of
these are equally likely to occur – i.e., 1/40,320. If we
 2010 Christopher R. Bilder
2.100
found all possible permutations, we could form a table as
follows:
X2 # of permutations Proportion
0.5143
0
20,736
0.4571
2
18,432
0.0286
8
1,152
which is the same exact distribution that we saw before!
In fact, one could have found this with
> dhyper(0:4, 4, 4, 4)*factorial(8)
[1]
576 9216 20736 9216
576
in R.
In order to calculate a p-value, we can use this exact
distribution. With X2 = 2 observed, the p-value is P(A 
2) = 0.4571 + 0.0286 = 0.4857 where A is a random
variable with this exact distribution (in a more
mathematical statistics setting, one would write x2 = 2 is
observed and the p-value is P(X2  2))
Frequently, the number of permutations is going to be so
large that we can not calculate every permutation.
Instead, we will randomly select a large number, say B,
and calculate the estimate of the exact distribution from
those. This estimate is often referred to as the
“permutation distribution.” Using this distribution to do a
hypothesis test is referred to as a “permutation test.”
 2010 Christopher R. Bilder
2.101
Below is a description of a general way to find the
permutation distribution.
1)Randomly permute the column numbers. Put these
back into a data set with the row numbers.
2
2
2)Calculate X . Denote this statistic by X to avoid
confusion with the observed X2.
3)Repeat 1) and 2) B times where B is a large number
(1,000 or more).
2
4)Plot a histogram of the X 's . This serves as a visual
estimate of the exact distribution of X2.
To calculate our p-value, we can obtain an initial
impression if it will be small or large by seeing where X2
falls on it. To calculate it formally, we can use step 5.



1
# of X 2  X 2 . Small p-values
B
indicate the observed X2 would be unusual to obtain if
independence was true.
5)The p-value is
How can we do all of this in R? First, we will need to put
the data into its “raw form” (this is my own term), so that
every cell in the contingency table is represented by row
and column numbers like on p. 2.98. We can use then
the sample() function to find each permutation. The
example next shows the whole process.
 2010 Christopher R. Bilder
2.102
Example: Table 2.10 of Agresti (1996) (tab2.10-v2.R)
> n.table<-array(data
7,
0,
0,
0,
0,
0,
1,
1,
= c(0, 1, 0,
1, 8,
1, 0,
1, 0,
1, 0,
1, 0,
1, 0,
0, 0,
0, 0), dim=c(3,9))
> x.sq<-chisq.test(n.table, correct=F)
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(n.table, correct = F)
> x.sq
Pearson's Chi-squared test
data: n.table
X-squared = 22.2857, df = 16, p-value = 0.1342
Note that X2 = 22.29.
>##########################################################
> #Put data into raw form
> all.data<-matrix(data = NA, nrow = 0, ncol = 2)
>
> #Put data in "raw" form
> for (i in 1:nrow(n.table)) {
for (j in 1:ncol(n.table)) {
all.data<-rbind(all.data, matrix(data = c(i, j),
nrow = n.table[i,j], ncol = 2, byrow=T))
}
}
There were 16 warnings (use warnings() to see them)
> #Note that warning messages will be generated since
n.table[i,j]=0 sometimes
>
 2010 Christopher R. Bilder
2.103
> all.data
[,1] [,2]
[1,]
1
2
[2,]
1
2
[3,]
1
2
[4,]
1
2
[5,]
1
2
[6,]
1
2
[7,]
1
2
[8,]
1
8
[9,]
1
9
[10,]
2
1
[11,]
2
2
[12,]
2
3
[13,]
2
4
[14,]
2
5
[15,]
2
6
[16,]
2
7
[17,]
3
2
[18,]
3
2
[19,]
3
2
[20,]
3
2
[21,]
3
2
[22,]
3
2
[23,]
3
2
[24,]
3
2
> save<-xtabs(~all.data[,1]+ all.data[,2])
> save
all.data[, 2]
all.data[, 1] 1 2 3 4 5 6 7 8 9
1 0 7 0 0 0 0 0 1 1
2 1 1 1 1 1 1 1 0 0
3 0 8 0 0 0 0 0 0 0
> rowSums(save)
1 2 3
9 7 8
> colSums(save)
1 2 3 4 5 6
1 16 1 1 1 1
7
1
8
1
9
1
 2010 Christopher R. Bilder
2.104
This matches with the original contingency table so the
raw data part worked. Note what the row and column
marginal totals are!
Below is a further explanation of the code used to put
the data into raw form:
The "c(i,j)" creates a vector containing the row (i)
and column (j) index for the raw data format. The
"matrix( ... )" part tells R to create a matrix with
contents of "c(i,j)" and a number of rows of
"n.table[i,j]", number of columns of "2", and do this
by row (meaning, c(i,j) will be a 1x2 vector). Since
c(i,j) is only one vector, R duplicates as many times
as it is told to do by specifying "n.table[i,j]" as the
number of rows (R calls this recycling). The "rbind(
... )" tells R to combine everything in "all.data" and
"matrix( ... )" by row. Thus, everything that was in
"all.data" comes first and the "matrix( ... )" is put
below it. This is done for all rows and columns of
the data through using the two for loops.
>
>
>
>
########################################################
#Do one permutation to illustrate – i.e., find one X^2*
set.seed(4088)
all.data.star<-cbind(all.data[,1],
sample(all.data[,2], replace=F))
> all.data.star
[,1] [,2]
[1,]
1
2
[2,]
1
2
[3,]
1
9
 2010 Christopher R. Bilder
2.105
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[12,]
[13,]
[14,]
[15,]
[16,]
[17,]
[18,]
[19,]
[20,]
[21,]
[22,]
[23,]
[24,]
1
1
1
1
1
1
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
2
2
2
2
4
2
2
8
2
2
2
2
2
6
2
1
3
2
2
5
7
> calc.stat<-chisq.test(all.data.star[,1],
all.data.star[,2], correct=F)
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(all.data.star[, 1], all.data.star[, 2], correct
= F)
> calc.stat$statistic
X-squared
17.33036
> save.star<-xtabs(~all.data.star[,1] + all.data.star[,2])
> save.star
all.data.star[, 2]
all.data.star[, 1] 1 2 3 4 5 6 7 8 9
1 0 7 0 1 0 0 0 0 1
2 0 6 0 0 0 0 0 1 0
3 1 3 1 0 1 1 1 0 0
> rowSums(save.star)
1 2 3
 2010 Christopher R. Bilder
2.106
9 7 8
> colSums(save.star)
1 2 3 4 5 6 7
1 16 1 1 1 1 1
>
8
1
9
1
Notes:
 To illustrate one possible permutation of the data, the
all.data.star data set is found. Notice how the column
numbers are permuted using the sample() function.
The row numbers are held fixed. The row and column
numbers are then put back together to form a matrix.
The xtabs(), rowSums(), and colSums() functions’
output show the marginal totals are still the same as
2
with the observed data. The X statistic is 17.33 for
this permutation.
 What is the probability this one permutation would
occur?
 Suppose I did a different permutation. Let

set.seed(4089). For this seed, X2 =16.46:
> save.star
all.data.star[, 2]
all.data.star[, 1] 1 2 3 4 5 6 7 8 9
1 0 6 0 0 1 1 1 0 0
2 1 4 1 0 0 0 0 0 1
3 0 6 0 1 0 0 0 1 0
> rowSums(save.star)
1 2 3
9 7 8
> colSums(save.star)
1 2 3 4 5 6 7
1 16 1 1 1 1 1
8
1
9
1
 2010 Christopher R. Bilder
2.107
 Now, I would like to repeat this process B=1,000 times
2
2
to get 1,000 different X . These X ’s will then
represent my permutation distribution.
> #########################################################
> # A simple function and for loop to find the permutation
distribution.
> do.it<-function(data.set){
all.data.star<-cbind(data.set[,1],
sample(data.set[,2], replace=F))
chisq.test(all.data.star[,1], all.data.star[,2],
correct=F)$statistic
}
> summarize<-function(result.set, statistic, df, B) {
par(mfrow = c(1,2))
#Histogram
hist(x = result.set, main = expression(paste("Histogram
of ", X^2, " perm. dist.")), col = "blue", freq =
FALSE)
curve(expr = dchisq(x = x, df = df), col = "red", add =
TRUE)
segments(x0 = statistic, y0 = -10, x1 = statistic, y1 =
10)
#QQ-Plot
chi.quant<-qchisq(p = seq(from = 1/(B+1), to =
1/(B+1), by = 1/(B+1)), df =
plot(x = sort(result.set), y = chi.quant, main
expression(paste("QQ-Plot of ", X^2, " perm.
abline(a = 0, b = 1)
par(mfrow = c(1,1))
#p-value
mean(result.set>=statistic)
}
 2010 Christopher R. Bilder
1df)
=
dist.")))
2.108
> #Example use of do.it function
> do.it(data.set = all.data)
X-squared
16.14286
Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(all.data.star[, 1], all.data.star[, 2], correct
= F)
> B<-1000
> results<-matrix(data = NA, nrow = B, ncol = 1)
> set.seed(5333)
> for(i in 1:B) {
results[i,1]<-do.it(all.data)
}
There were 50 or more warnings (use warnings() to see the
first 50)
> summarize(results, x.sq$statistic, (nrow(n.table)1)*(ncol(n.table)-1), B)
[1] 0.003
2
2
QQ-Plot of X perm. dist.
30
25
10
15
20
chi.quant
0.4
0.3
0.2
5
0.1
0.0
Density
0.5
35
0.6
40
Histogram of X perm. dist.
16 17 18 19 20 21 22
16 17 18 19
result.set
20 21 22
sort(result.set)
 2010 Christopher R. Bilder
2.109
Notes:
 do.it() is a user written function! I have put the
2
sampling part and the calculation of X inside of it.
Notice the syntax used with the function. Also, notice
the example where I used the function once with the
all.data data set. AND, notice the last line of function
2
gives the X value. For all functions written in R, the
last line defines what is returned as a result of the

function. Notice here the X2 value was printed
without me asking for it to be printed!
 The for loop is used to repeat using the do.it() function
B=1,000 times. The results are then stored in a matrix
called results. The warning messages are just with
regards to the 2 approximation to the distribution of

each X2 is probably not appropriate.
 The set.seed() function is used before the for loop so
that the results produced here can be reproduced by
others. Notice that it only needs to be set once before
the loop.
 summarize() is another user written function to help
summarize the results in a histogram, QQ-plot, and to
find the p-value. Notice again the last line finds the pvalue and this is returned as the result of the function.
2
 Remember the 16
distribution is used with X2 for a
“regular” Pearson chi-square test for independence.
2
The QQ-plot plots the quantiles of a 16
distribution
2
versus X values. If the values fall on a straight line
 2010 Christopher R. Bilder
2.110
2
at 45 from the origin, the X values would all be
2
equal to the quantiles of a 16
distribution. Thus, the
2
distribution for X2 could be approximated by a 16
distribution. As you can see, this does NOT happen
here! See qq_plot_chi.square.R for an example where
a simulated sample from a chi-square distribution is
used.
 There is strong evidence against independence since
the p-value is 0.003. Agresti (1996) found a p-value of
0.001.

 Below are the actual values of X2 obtained.
> table(round(results,2))
15.66
67
15.8 15.89
100
30
15.9 16.14 16.19 16.33
129
61
6
12
16.4 16.46 16.71 16.74 16.75 16.83
40
123
60
47
7
2
16.9 17.19 17.33 17.71 17.83 17.89 18.02 18.05 18.24 18.48 18.66 19.23 19.69
99
7
12
9
31
53
1
32
11
9
16
2
8
19.9
20 20.08 20.14 20.46 21.02 21.19 22.29 22.31
2
4
7
5
1
1
3
1
2
Again, one can think of the permutation test as a way to
obtain an estimate of the probability distribution function
of the discrete random variable, X2, under Ho. Based
upon the above information, we obtain column 2 in the
table below.
Permutation dist.
P(X2  15.66) 67/1000 = 0.067
Chi-square dist.
0.5231
P(X2  15.80) (67+100)/1000 = 0.167 0.5330
 2010 Christopher R. Bilder
2.111
P(X2  15.89) 0.197

0.5393
P(X2  21.19) 0.997
0.8287
P(X2  22.29) 0.998
0.8659
P(X2  22.31) 1
0.8665
The permutation distribution then replaces the chisquare distribution approximation for X2. Below is P(X2 ≤
2
___ ) using a 16
approximation.
> round(pchisq(q = as.numeric(names(table(
round(results,2)))), df = (nrow(n.table)1)*(ncol(n.table)-1)),4)
[1]
[11]
[21]
[31]
0.5231
0.5974
0.6790
0.7998
0.5330
0.5980
0.6900
0.8223
0.5393
0.6033
0.7035
0.8287
0.5400
0.6079
0.7133
0.8659
0.5568 0.5602 0.5698 0.5746 0.5787 0.5954
0.6266 0.6354 0.6588 0.6661 0.6696 0.6773
0.7431 0.7655 0.7752 0.7798 0.7834 0.7860
0.8665
A plot of the cumulative distribution functions is shown
below (see program for code)
 2010 Christopher R. Bilder
2.112
0.2
0.4
CDF
0.6
0.8
1.0
Compare CDFs
2
16
0.0
Exact
16
18
20
X
22
2
Here’s a simpler way to get the p-value:
> set.seed(7709)
> chisq.test(n.table, correct = FALSE, simulate.p.value =
TRUE, B = 1000)
Pearson’s Chi-squared test with simulated p-value
(based on 1000 replicates)
data: n.table
X-squared = 22.2857, df = NA, p-value = 0.001
 2010 Christopher R. Bilder
2.113
Why did I show the harder way first?
 It will help you understand what the chisq.test()
function is actually doing.
 You can not summarize the results from chisq.test()
with a histogram or QQ-plot.
 A permutation test is a very general approach for
inference. It can be used in many other settings which
are not already programmed into a function like
chisq.test()! A simple example is suppose you would
like to use G2 for the test of independence.
Permutation tests are closely related to bootstrap
hypothesis tests. See the additional Chapter 2 notes for
how one can use functions in the boot package to do
permutation tests.
Example: Larry Bird (bird_perm.R)
> #Create contingency table - notice the data is entered by
columns
> n.table<-array(c(251, 48, 34, 5), dim=c(2,2),
dimnames=list(First = c("made", "missed"),
Second = c("made", "missed")))
> n.table
Second
First
made missed
made
251
34
missed
48
5
> x.sq<-chisq.test(n.table, correct=F)
> x.sq
 2010 Christopher R. Bilder
2.114
Pearson's Chi-squared test
data: n.table
X-squared = 0.2727, df = 1, p-value = 0.6015
> #########################################################
> #Find raw data
> all.data<-matrix(data = NA, nrow = 0, ncol = 2)
> #Put data in "raw" form
> for (i in 1:nrow(n.table)) {
for (j in 1:ncol(n.table)) {
all.data<-rbind(all.data, matrix(data = c(i,j), nrow
= n.table[i,j], ncol = 2, byrow=T))
}
}
> #Check
> xtabs(~all.data[,1]+ all.data[,2])
all.data[, 2]
all.data[, 1] 1
2
1 251 34
2 48
5
Here’s how the test can be done using the methods
demonstrated in the last example. When you do it
yourself, you should only use one of these methods unless
instructed to do otherwise.
Code for method #1: The same do.it() and summarize()
function is used here so only partial results are given:
> summarize(result.set = results, statistic =
x.sq$statistic, df = (nrow(n.table)-1)*(ncol(n.table)1), B = B)
 2010 Christopher R. Bilder
2.115
[1] 0.624
2
2
QQ-Plot of X perm. dist.
0
0.0
2
0.2
4
6
chi.quant
0.4
Density
8
0.6
10
Histogram of X perm. dist.
0
2
4
6
8
10
0
2
result.set
4
6
8
10
sort(result.set)
> #Shows the different X^2* values
> table(round(results,2))
0
190
10.39
1
0.17
186
0.27
179
0.78
101
0.98
110
1.82
76
2.13
66
3.31
45
3.71
22
5.23
13
5.74
5
7.59
4
8.2
2
> #chi-square app.
> round(pchisq(q = as.numeric(names(table(round(results,2)))),
df = (nrow(n.table)-1)*(ncol(n.table)-1)),4)
[1] 0.0000 0.3199 0.3967 0.6229 0.6778 0.8227 0.8556 0.9311 0.9459 0.9778
[11] 0.9834 0.9941 0.9958 0.9987
Code and output for method #2:
#Method #2
> set.seed(8912)
> chisq.test(n.table, correct = FALSE, simulate.p.value =
TRUE, B = 1000)
Pearson's Chi-squared test with simulated p-value
(based on 1000 replicates)
 2010 Christopher R. Bilder
2.116
data: n.table
X-squared = 0.2727, df = NA, p-value = 0.659
Notes:
 The p-value is 0.624 for method #1 and 0.6590 for
method #2 indicating there is not sufficient evidence
against independence.
 The Pearson statistic test for independence had a pvalue of 0.6015. The reason for the general
agreement between this test and the permutation test
is the sample size is large enough for the “asymptotic”
distribution used (chi-square) to work as the
approximate distribution for the X2. See the QQ-plot.
 Notice the “discreteness” of the permutation
distribution. Why do you think this is happening?
 Below is a plot comparing the cumulative distribution
functions (see program for code):
 2010 Christopher R. Bilder
2.117
0.2
0.4
CDF
0.6
0.8
1.0
Compare CDFs
2
1
0.0
Exact
0
2
4
6
X
8
10
2
If you are interested in using exact inference for other
problems outside of categorical data analysis, there is a nice
software package which helps to automate these tests even
more than in R. The software is made by the Cytel
Corporation and is called StatXact. Also, PROC FREQ in
SAS has an EXACT option that will do the test.
 2010 Christopher R. Bilder
2.118
2.7 Association in three-way tables
More than two categorical variables may be of interest.
In this setting, one can construct contingency tables
summarizing the counts of these additional variables.
Tests for independence between all of the variables or
between some of them conditional on the other variables
can be constructed. However, it is often more beneficial
to look at these types of settings from a modeling point
of view. Therefore, the discussion of these settings will
mostly be postponed until we get to models that can
handle them. What is next is an introduction to what a
contingency table would look like for three categorical
variables and some important things to look out for in
this setting (e.g., Simpson’s paradox).
In addition to the categorical variables, X and Y, suppose
there is a third categorical variable, Z, with k=1,…,K levels.
Let nijk enote a cell count for the ith row, jth column, and
kth layer of a “three-way” contingency table. If X has I=2
levels and Y has J=2 levels, then the following is the
contingency table for the counts:
Z=1
Y
1
X
Z=2
2
1 n111 n121 n1+1
Y
1
X
 Z=K
2
1 n112 n122 n1+2
Y
1
X
2
1 n11K n12K n1+K
2 n211 n221 n2+1
2 n212 n222 n2+2
2 n21K n22K n2+K
n+11 n+21 n++1
n+12 n+22 n++2
n+1K n+2K n++K
 2010 Christopher R. Bilder
2.119
Notes:
 A third subscript is added to the n’s to denote the Z
variable.
 There are other ways to display a “three-way”
contingency table. See Table 2.10 of Agresti (2007) for
an example.
 This table could easily be extended to a IJK table.
 The table could have also be written in terms of P(X = i,
Y = j, Z = k) = ijk or pijk=nijk/n as well.
 ijk = E(nijk); i.e., the expected frequency for the ith row,
jth column, and kth layer.
I J K
 Properties such as    ijk  1 are extended to the
i1j1k 1
three-way table.
Z as the control variable
Z often plays the role of a “control” variable. In this case,
the purpose is still to understand the relationship between
X and Y while controlling for Z. In addition to Z being
called a “layer” variable, Z is often called a “stratification”
variable.
Think of this as the categorical equivalent of an
analysis for a randomized complete block design.
The levels of X are the treatments, Y is the
response, and the levels of Z are the blocks.
 2010 Christopher R. Bilder
2.120
Example: Salk vaccine clinical trials
We had the following contingency table set up previously
for this example.
Polio
Polio free
Vaccine
Placebo
X is the drug (vaccine, placebo) and Y is the polio result
(polio, polio free). Z could denote the clinical trial
centers where the clinical trial takes place. Thus, we
could have the following table:
Polio
Omaha Polio free
N.Y.
Vaccine
Vaccine
Placebo
Placebo
Polio
Polio free

L.A.


Polio
Polio free
Vaccine
Placebo
The table above is called a three-way table since three
variables are represented in a contingency table format.
Odds ratios can also be found for a particular level of Z.
Since there are three categorical variables, the variables
of interest are put in the subscript with the level of the
conditioning variable. For 22K tables,
 2010 Christopher R. Bilder
2.121
11k 22k
12k 21k
n11k n22k
ˆ
ˆ
  XY|k   XY(k) 
n12k n21k
  XY|k   XY(k) 
One could also define P(X=i, Y=j, Z=k) / P(Z=k) =
P(X=i, Y=j | Z=k) = ij|k and set up the odds ratios as
 XY|k 
11|k 22|k
12|k 21|k
Conditional and marginal associations
In the Salk vaccine clinical trial example, each individual
22 table that relates drug to polio result for a specific
clinical trial center is called a “partial table”. This is
because each table represents “part” of the 22K table.
The 22 table examined in Chapter 2 (before clinical
center was known) is called a “marginal table” since it
ignores clinical trial center.
Remember how the word “margins” was used earlier
to denote summing over a categorical variable.
The partial table associations (relationships) between X
and Y are also called “conditional associations” since
they are dependent on the level of Z. An example of a
 2010 Christopher R. Bilder
2.122
conditional association measure is XY|k. The marginal
table associations between X and Y can be called
“marginal associations”. An example of a marginal
association is calculating the odds ratio in the 22
marginal table for the Salk vaccine clinical trial example:
 
n n
 XY  11 22 and ˆ XY  11 22
12 21
n12 n21
It is important to distinguish between the two types of
association. The marginal association can be VERY
different from the conditional associations! “Simpson’s
paradox” occurs when this happens.
Example: Simpson’s paradox example
This example comes from Appleton et al. (American
Statistician, 1996, p. 340-341). There were 1,314
women in the UK who participated in a survey in 1972-4
and then followed up on twenty years later. Information
about their age (in 1972-4), smoking status, and survival
status was recorded. Below is a marginal table
summarizing the survival and smoking status.
 2010 Christopher R. Bilder
2.123
Survival status
Dead Alive
Yes 139
443
Smoker
No 230
502
The estimated OR, ̂XY , is 0.68 and a 95% confidence
interval for the population OR is (0.54, 0.88). Therefore,
the odds of being dead are between 0.54 and 0.88 times
larger for smokers and than non-smokers with 95%
confidence. Alternatively, the odds of survival are
between 1.14 and 1.87 times larger for smokers than
non-smokers with 95% confidence. Given this
information, which would you prefer to be a smoker or
non-smoker?
Now, let’s take age into account.
Age = 18-24 Survival status
Dead Alive
Yes 2
53
Smoker
No
1
61
OR: 2.30
Age = 25-34 Survival status
Dead Alive
Yes 3
121
Smoker
No
5
152
OR: 0.75
Age = 35-44 Survival status
Dead Alive
Yes 14
95
Smoker
No
7
114
OR: 2.40
Age = 45-54 Survival status Age = 55-64 Survival status Age = 65-74 Survival status
Dead Alive
Dead Alive
Dead Alive
Yes 27
103
Yes 51
64
Yes 29
7
Smoker
Smoker
Smoker
No
12
66
No
40
81
No 101
28
OR: 1.44
OR: 1.61
OR: 1.15
 2010 Christopher R. Bilder
2.124
Age = 75+
Survival status
Dead Alive
Yes
13
0
Smoker
No
64
0
OR: 0.21
Notice that most of these odds ratios are greater than 1
indicating the estimated odds of dying are larger for
those who smoke than those who do not smoke. For
example, ˆ XY(1824)  2.3 . This is a contradiction of the
results from the conditional associations!
The most important item to get out this example is to
make sure you account for additional variables because
you could make incorrect conclusions.
Read Agresti’s (2007) death penalty example for another
illustration of Simpson’s paradox.
Conditional independence
X is independent of Y at EACH level of Z; i.e.,
independence in each partial table.
More formally, conditional independence can be written
as
 2010 Christopher R. Bilder
2.125
XY(1) = XY(2) = … = XY(K) = 1 for a 22K table
or
ij|k = i+|k+j|k for each i=1,…,I, j=1,…,J and k=1,…,K
What is i+|k? i+|k = jij|k
Marginal independence: XY=1
See Agresti’s (2007) example for another reason why to
not look at the marginal table. There are cases where
the marginal and conditional associations are the same.
These are discussed in Chapter 7 with respect to
loglinear models.
Homogeneous X-Y association
X-Y have the same levels of association across all levels
of Z.
For a 22K table, this means the partial ORs are the
same but not necessarily equal to 1: XY(1) = XY(2) = … =
XY(K). This will be important when discussing the
Cochran-Mantel-Haenszel test in Chapter 4.
 2010 Christopher R. Bilder