Download Computer lab: Analysis of categorical data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genealogical DNA test wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Transcript
Statistics using R (spring of 2017)
Computer lab: Analysis of categorical data
March 7, 2017
1
Introduction
This computer lab contains theoretical background and some exercises on the analysis of categorical (tabular) data. The main focus is on the usage of the R functions chisq.test, fisher.test
and mcnemar.test. Further information about these and related test functions for categorical
data may be found in (Dalgaard, 2008, Chapter 8).
2
Chi-square tests
Suppose you have a certain gene in the human genome for which you may have two different
codes, alleles A and B. Suppose that the genotypes (the set of two gene copies that all humans
have of genes in non-sex chromosomes) are AA, AB or BB. Here, AA means that an individual
have two alleles of variant A, AB means one A allele and one B allele, and BB means two B
alleles. A common assumption for a big population is that the number of individuals in the three
genotype categories, N (AA), N (AB) and N (BB), should be proportional to p2A , 2pA (1−pA ) and
(1 − pA )2 , respectively. Here, pA is the proportion of individual genes with allele A in the whole
population, making (1 − pA ) the proportion of allele B. It may be argued that such proportions
follow from a combination of random mating patterns and an assumption of similar functionality
of the A and B alleles. Such a structure for proportions is called a Hardy-Weinberg Equilibrium.
It is quite common in genetics that one wishes to detect whether there are deviations from such
an equilibrium for a particular gene in a particular population.
Suppose that we know in advance that pA = 0.36, that the total sample size n = 300 and
that we observe group counts n(AA) = 67, n(AB) = 103 and n(BB) = 130 for this sample. In
such a situation, we can test the null hypothesis that we have the predicted probabilities in the
multinomial distribution,
H0 : P (AA) = 0.362 ,
P (AB) = 2 ∗ 0.36 ∗ (1 − 0.36),
P (BB) = (1 − 0.36)2 ,
by applying Pearson’s Chi-squared test for count data. The statistic for this test is denoted by
χ2 and is in general defined as
k
X
(noi − nei )2
χ2 =
,
(1)
nei
i=1
where k is the number of cells in the table of counts, noi is the observed count in cell i and nei is
the expected count in cell i (under the assumption that H0 holds). Use formula (1) to implement
a function chisq.test.stat which takes a vector of observed counts no and a vector of expected
1
counts ne as inputs and computes χ2 . What is the value of the test statistic when applying this
function to our example of three groups?
Of course, there is already a function in R called chisq.test that can be used to perform such
tests. Use it to verify your answer to the previous question. (Note: you will need to supply both
the count data and the proportions of the null hypothesis when calling the function. Otherwise,
chisq.test will just take equal proportions as the hypothesis). What is the resulting p-value
for the test?
Next, suppose that we do not know pA . It must then be estimated before formula (1) can be
applied. It turns out that a rational estimate for our particular example is the relative proportion
of allele A among all sampled alleles, pA = (2n(AA) + n(AB))/2n. When pA is known, the Chisquare statistic has 2 degrees of freedom (there are three cell-probabilities, but these must sum
up to 1, so the true variability is in two dimensions, and 2 = 3 - 1 are the degrees of freedom).
However, when pA is estimated we lose another degree of freedom and we are left with only
1 degree of freedom. When the built-in chisq.test computes the p-value corresponding to
the value of the test statistic χ2 , it does not take such estimation into account. Using your
chisq.test.stat, implement a function that takes the observed counts no as input, uses these
to estimate pA , computes χ2 via formula (1), and finally computes the corresponding p-value
using the function pchisq. What is the resulting p-value for this test?
3
Contingency tables
Suppose we wish to investigate co-occurrence patterns of two categorical properties in a population. Our study objects are thought to be independently sampled individuals from a population
in which we categorize two different attributes, A and B. As a simple example, consider a
human population of working adults and let the two attributes be sex (A) and employment
status (B). Sex has two categories, men and women, corresponding to categories A1 and A2 .
For B we consider three possibilities, B1 = self-employed, B2 = employment in private sector
and B3 = employment in public sector. It is important for the analysis that all individuals will
belong to exactly one A-category and exactly one B-category.
We can now think of cross-classification categories Ai Bj . For example, an individual that
belongs to category A1 B2 will be characterized as a man employed in the private sector. Without
any assumptions about the relation and sizes of the two variables we may conclude that a natural
model for the six cross-classified counts nij = n(A1 B2 ), i = 1, 2, j = 1, 2, 3, is that they form a
Multinomial distribution with parameters
n = number of sampled individuals,
pij = proportion of population in category Ai Bj .
A particularly common question in connection to such cross-classifications concerns dependence
between the two factors. This is usually formulated as an assumption that all the proportions
pij can be factorized as pi,j = pi qj , where pi are the large population proportions of sex Ai and
qj the large population proportion of employment status Bj . Independence then means that the
employment patterns among men and women are the same, but observe that it could still be the
case that therePis a difference from
P 0.5 in the proportion of sexes among all employed!
B
Let nA
=
n
and
n
=
i
j
j ij
i nij (the marginal counts) be the number of sampled individuals in category Ai and Bj , respectively. If we wish to test the hypothesis above we can do
that by applying a Chi-square test to a 2x3 dimensional contingency table. Such a test uses the
B
natural estimates pi = nA
i /n and qj = nj /n under the null-hypothesis, pretend that those are
2
perfect, and use combinations of them to predict expected counts using the formula
B
A B
neij = pi qj n = (nA
i /n)(nj /n)n = ni nj /n.
(2)
The test statistic χ2 is then formed using formula (1), giving
χ2 =
B
2
X (nij − nA
i nj /n)
.
B
nA
i nj /n
i,j
(3)
If the sample size n is large, it can be shown that χ2 in this particular case is approximately
Chi-square distributed with 2 degrees of freedom. This can be seen as follows: In the full model
we have 5 free parameters since we know that the sum of all 6 parameters (the cell proportions)
always is exactly one. Similarly, in the null hypothesis model there is 1 free parameter for sex
and 2 free parameters for employment status. When we test the hypothesis we measure deviation
from a three-dimensional object in a five-dimensional space, i.e. in reality we use 2 dimensions
to build the test statistic. For a general contingency table with r rows and c columns, a standard
Chi-square statistic may be constructed in the same manner. The degree of freedoms will be
(r − 1)(c − 1).
Download the file “Occupation.txt” from the course home page. Each row in this file corresponds to a random individual sampled from the population. The category which an individual
belongs to is coded by there being a 1 in the corresponding group column. Read this file into a
data frame object ct.data, count the number of individuals in each group and create a matrix
ct holding the counts by running
ct.data <- read.table("Occupation.txt", header = TRUE)
ct <- matrix(apply(ct.data, 2, sum), nrow = 2, ncol = 3, byrow = TRUE)
The rows of ct correspond to sex (Male, Female) and the columns to employment status (Self,
Private, Public). Use chisq.test to determine if the null hypothesis that sex and employment
status are independent can be rejected at level 0.05.
4
Contingency tables with one pre-specified margin
Consider the same question as above but assume that we select two different samples of sizes
A
nA
1 = 150 and n2 = 100 of employed men and women, and categorize the employment status as
before. Now the distributions come from two multinomial distributions with three parameters
each, one for the men and one for the women. The natural null hypothesis becomes that these
two distributions have the same parameters. It turns out that exactly the same Chi-square
statistic can be used in this situation and if both sample sizes are large we also get the same
approximative Chi-square distribution under this sampling scheme. Similarly, we may specify
three sample sizes among the three B-categories and test the hypotheses that the three binomial
distributions (multinomial with two groups) are the same by using the original Chi-square test!
Download the file “Occupation2.txt” from the course home page. The format of this file is
the same as that for “Occupation.txt”. The only difference is that “Occupation2.txt” begins
with 150 randomly chosen men and ends with 100 randomly chosen women. Proceed as in the
previous question and use chisq.test to check if the null hypothesis can be rejected at level
0.05.
3
5
Small number contingency tables
In 2x2 contingency tables with small sample sizes it is common to use alternative tests to the
Chi-square, simply because the approximation with Chi-square distributions is not valid. It
turns out that in all three sampling situations described above one may use the following result
to construct a test: under the null hypothesis, the conditional distribution of n11 becomes a
hypergeometric distribution with parameters
nA
1 = the number of sampled objects,
n = the number of objects sampled from,
nB
1 /n
= the proportion of counted objects having property B1 .
The test is called Fisher’s exact test and usually rejects for both kind of deviations in the tails
of this distribution.
Consider the function
generate.2x2 <- function(n, p11, p21, p12, p22) {
s <- sample(1:4, size = n, replace = TRUE, prob = c(p11, p21, p12, p22))
matrix(c(sum(s == 1), sum(s == 2), sum(s == 3), sum(s == 4)),
nrow = 2, ncol = 2)
}
This function generates a random 2x2 table (as a matrix) with total count equal to n, in such
a way that each individual has a probability pij of ending up in the i:th row and j:th column.
Generate 10000 such tables with equal cell probabilites and n = 100 by running the code
n <- 100
table.count <- 10000
tables <- lapply(1:table.count, function(x) generate.2x2(n, 0.25, 0.25, 0.25, 0.25))
tables is now a list holding the random 2x2 tables. Use fisher.test to compute the vector
of p-values obtained when applying Fisher’s test to the 2x2 tables. Further, using this p-value
vector, estimate the probability that the test declares the data to be significant at the 0.05 level.
Also, estimate the same probability using chisq.test, first with option simulate.p.value =
FALSE and then with simulate.p.value = TRUE. (Hint: the function sapply is useful when
applying the test functions to tables. Also note that, given a table x, we can find the p-values
for the tests using chisq.test(x)$p.value and fisher.test(x)$p.value).
When one uses Fisher’s exact test the probability of rejecting the null hypothesis if it is true
is guaranteed to be smaller than the significance level. However, the discreteness of the hypergeometric distribution typically makes the real p-value smaller. Using the large sample motivated
Chi-square test may lead to problems with the threshold in a complex way, and the significance
level is not really guaranteed. Thus, simulation based p-values using the conditional distribution
given the marginals, utilizing the same distribution as Fisher’s exact test, is implemented as an
alternative also in the function chisq.test. This becomes quite similar to Fisher’s exact test,
but the balance between rejection probabilities in the two tails of the hypergeometric distribution
is sometimes lost.
Next, generate a new sequence of random tables in which there is a dependence between
factors by running
n <- 100
table.count <- 10000
tables2 <- lapply(1:table.count, function(x) generate.2x2(n, 0.2, 0.3, 0.3, 0.2))
4
Using the same method as above, estimate the probabilities of rejecting the null hypothesis for
the three different test types given the new cell probabilities. What you have just done is to
compute the power of these test. Note that it is somewhat problematic to do a fair comparison
of the power properties of these three different methods when the exact significance levels are
unequal.
6
Paired categorical data
Imagine a situation where we match some study objects in similar pairs, and then treat one of
them with method A, and the other with method B. In simplistic cases the outcome is recorded
with a 0-1 score, where 0 is failure and 1 denotes success. The outcome of such an experiment
with n pairs will result in n paired observations, each of which will belong the set of possible
pairs (0, 1), (1, 1), (1, 0) and (0, 0). Here we interpret an observation (0, 1) as indicating that
treatment A failed but treatment B succeded in that particular pair. It is tempting to record
these as counts in a contingency table, with A and B as factors with two different levels. However,
due to the dependence structure that is built into the design, ordinary Chi-square tests do not
work. (In fact, that test would be relevant to test independence in the pairing, which we should
not have if we have chosen pairs properly).
A common test in this situation is to ignore all pairs in which treatment effects are equal
and base a test statistic on the conditional probability of n0,1 given the sum n0,1 + n1,0 . Under
the null hypothesis that the treatment effects are the same, this distribution should be binomial
with probability p = 0.5. If A works better than B, we would expect that n1,0 tends to be larger
than n0,1 . This test is called McNemar’s test.
Download the file “TwoTreatments.txt” from the course home page, read it into a data frame
object and attach it. You should now be able to access the two vectors A and B, which code the
treatment success results. Use these vectors to build a 2x2 table (as a matrix) of the counts in
the four different categories. Finally, apply mcnemar.test to the table. What p-value does the
test report? Hint: you can obtain the count for category (0, 0) by running sum(A == 0 & B ==
0).
References
Dalgaard, P. (2008). Introductory Statistics with R. Springer, New York, USA, second edition.
5