Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics using R (spring of 2017) Computer lab: Analysis of categorical data March 7, 2017 1 Introduction This computer lab contains theoretical background and some exercises on the analysis of categorical (tabular) data. The main focus is on the usage of the R functions chisq.test, fisher.test and mcnemar.test. Further information about these and related test functions for categorical data may be found in (Dalgaard, 2008, Chapter 8). 2 Chi-square tests Suppose you have a certain gene in the human genome for which you may have two different codes, alleles A and B. Suppose that the genotypes (the set of two gene copies that all humans have of genes in non-sex chromosomes) are AA, AB or BB. Here, AA means that an individual have two alleles of variant A, AB means one A allele and one B allele, and BB means two B alleles. A common assumption for a big population is that the number of individuals in the three genotype categories, N (AA), N (AB) and N (BB), should be proportional to p2A , 2pA (1−pA ) and (1 − pA )2 , respectively. Here, pA is the proportion of individual genes with allele A in the whole population, making (1 − pA ) the proportion of allele B. It may be argued that such proportions follow from a combination of random mating patterns and an assumption of similar functionality of the A and B alleles. Such a structure for proportions is called a Hardy-Weinberg Equilibrium. It is quite common in genetics that one wishes to detect whether there are deviations from such an equilibrium for a particular gene in a particular population. Suppose that we know in advance that pA = 0.36, that the total sample size n = 300 and that we observe group counts n(AA) = 67, n(AB) = 103 and n(BB) = 130 for this sample. In such a situation, we can test the null hypothesis that we have the predicted probabilities in the multinomial distribution, H0 : P (AA) = 0.362 , P (AB) = 2 ∗ 0.36 ∗ (1 − 0.36), P (BB) = (1 − 0.36)2 , by applying Pearson’s Chi-squared test for count data. The statistic for this test is denoted by χ2 and is in general defined as k X (noi − nei )2 χ2 = , (1) nei i=1 where k is the number of cells in the table of counts, noi is the observed count in cell i and nei is the expected count in cell i (under the assumption that H0 holds). Use formula (1) to implement a function chisq.test.stat which takes a vector of observed counts no and a vector of expected 1 counts ne as inputs and computes χ2 . What is the value of the test statistic when applying this function to our example of three groups? Of course, there is already a function in R called chisq.test that can be used to perform such tests. Use it to verify your answer to the previous question. (Note: you will need to supply both the count data and the proportions of the null hypothesis when calling the function. Otherwise, chisq.test will just take equal proportions as the hypothesis). What is the resulting p-value for the test? Next, suppose that we do not know pA . It must then be estimated before formula (1) can be applied. It turns out that a rational estimate for our particular example is the relative proportion of allele A among all sampled alleles, pA = (2n(AA) + n(AB))/2n. When pA is known, the Chisquare statistic has 2 degrees of freedom (there are three cell-probabilities, but these must sum up to 1, so the true variability is in two dimensions, and 2 = 3 - 1 are the degrees of freedom). However, when pA is estimated we lose another degree of freedom and we are left with only 1 degree of freedom. When the built-in chisq.test computes the p-value corresponding to the value of the test statistic χ2 , it does not take such estimation into account. Using your chisq.test.stat, implement a function that takes the observed counts no as input, uses these to estimate pA , computes χ2 via formula (1), and finally computes the corresponding p-value using the function pchisq. What is the resulting p-value for this test? 3 Contingency tables Suppose we wish to investigate co-occurrence patterns of two categorical properties in a population. Our study objects are thought to be independently sampled individuals from a population in which we categorize two different attributes, A and B. As a simple example, consider a human population of working adults and let the two attributes be sex (A) and employment status (B). Sex has two categories, men and women, corresponding to categories A1 and A2 . For B we consider three possibilities, B1 = self-employed, B2 = employment in private sector and B3 = employment in public sector. It is important for the analysis that all individuals will belong to exactly one A-category and exactly one B-category. We can now think of cross-classification categories Ai Bj . For example, an individual that belongs to category A1 B2 will be characterized as a man employed in the private sector. Without any assumptions about the relation and sizes of the two variables we may conclude that a natural model for the six cross-classified counts nij = n(A1 B2 ), i = 1, 2, j = 1, 2, 3, is that they form a Multinomial distribution with parameters n = number of sampled individuals, pij = proportion of population in category Ai Bj . A particularly common question in connection to such cross-classifications concerns dependence between the two factors. This is usually formulated as an assumption that all the proportions pij can be factorized as pi,j = pi qj , where pi are the large population proportions of sex Ai and qj the large population proportion of employment status Bj . Independence then means that the employment patterns among men and women are the same, but observe that it could still be the case that therePis a difference from P 0.5 in the proportion of sexes among all employed! B Let nA = n and n = i j j ij i nij (the marginal counts) be the number of sampled individuals in category Ai and Bj , respectively. If we wish to test the hypothesis above we can do that by applying a Chi-square test to a 2x3 dimensional contingency table. Such a test uses the B natural estimates pi = nA i /n and qj = nj /n under the null-hypothesis, pretend that those are 2 perfect, and use combinations of them to predict expected counts using the formula B A B neij = pi qj n = (nA i /n)(nj /n)n = ni nj /n. (2) The test statistic χ2 is then formed using formula (1), giving χ2 = B 2 X (nij − nA i nj /n) . B nA i nj /n i,j (3) If the sample size n is large, it can be shown that χ2 in this particular case is approximately Chi-square distributed with 2 degrees of freedom. This can be seen as follows: In the full model we have 5 free parameters since we know that the sum of all 6 parameters (the cell proportions) always is exactly one. Similarly, in the null hypothesis model there is 1 free parameter for sex and 2 free parameters for employment status. When we test the hypothesis we measure deviation from a three-dimensional object in a five-dimensional space, i.e. in reality we use 2 dimensions to build the test statistic. For a general contingency table with r rows and c columns, a standard Chi-square statistic may be constructed in the same manner. The degree of freedoms will be (r − 1)(c − 1). Download the file “Occupation.txt” from the course home page. Each row in this file corresponds to a random individual sampled from the population. The category which an individual belongs to is coded by there being a 1 in the corresponding group column. Read this file into a data frame object ct.data, count the number of individuals in each group and create a matrix ct holding the counts by running ct.data <- read.table("Occupation.txt", header = TRUE) ct <- matrix(apply(ct.data, 2, sum), nrow = 2, ncol = 3, byrow = TRUE) The rows of ct correspond to sex (Male, Female) and the columns to employment status (Self, Private, Public). Use chisq.test to determine if the null hypothesis that sex and employment status are independent can be rejected at level 0.05. 4 Contingency tables with one pre-specified margin Consider the same question as above but assume that we select two different samples of sizes A nA 1 = 150 and n2 = 100 of employed men and women, and categorize the employment status as before. Now the distributions come from two multinomial distributions with three parameters each, one for the men and one for the women. The natural null hypothesis becomes that these two distributions have the same parameters. It turns out that exactly the same Chi-square statistic can be used in this situation and if both sample sizes are large we also get the same approximative Chi-square distribution under this sampling scheme. Similarly, we may specify three sample sizes among the three B-categories and test the hypotheses that the three binomial distributions (multinomial with two groups) are the same by using the original Chi-square test! Download the file “Occupation2.txt” from the course home page. The format of this file is the same as that for “Occupation.txt”. The only difference is that “Occupation2.txt” begins with 150 randomly chosen men and ends with 100 randomly chosen women. Proceed as in the previous question and use chisq.test to check if the null hypothesis can be rejected at level 0.05. 3 5 Small number contingency tables In 2x2 contingency tables with small sample sizes it is common to use alternative tests to the Chi-square, simply because the approximation with Chi-square distributions is not valid. It turns out that in all three sampling situations described above one may use the following result to construct a test: under the null hypothesis, the conditional distribution of n11 becomes a hypergeometric distribution with parameters nA 1 = the number of sampled objects, n = the number of objects sampled from, nB 1 /n = the proportion of counted objects having property B1 . The test is called Fisher’s exact test and usually rejects for both kind of deviations in the tails of this distribution. Consider the function generate.2x2 <- function(n, p11, p21, p12, p22) { s <- sample(1:4, size = n, replace = TRUE, prob = c(p11, p21, p12, p22)) matrix(c(sum(s == 1), sum(s == 2), sum(s == 3), sum(s == 4)), nrow = 2, ncol = 2) } This function generates a random 2x2 table (as a matrix) with total count equal to n, in such a way that each individual has a probability pij of ending up in the i:th row and j:th column. Generate 10000 such tables with equal cell probabilites and n = 100 by running the code n <- 100 table.count <- 10000 tables <- lapply(1:table.count, function(x) generate.2x2(n, 0.25, 0.25, 0.25, 0.25)) tables is now a list holding the random 2x2 tables. Use fisher.test to compute the vector of p-values obtained when applying Fisher’s test to the 2x2 tables. Further, using this p-value vector, estimate the probability that the test declares the data to be significant at the 0.05 level. Also, estimate the same probability using chisq.test, first with option simulate.p.value = FALSE and then with simulate.p.value = TRUE. (Hint: the function sapply is useful when applying the test functions to tables. Also note that, given a table x, we can find the p-values for the tests using chisq.test(x)$p.value and fisher.test(x)$p.value). When one uses Fisher’s exact test the probability of rejecting the null hypothesis if it is true is guaranteed to be smaller than the significance level. However, the discreteness of the hypergeometric distribution typically makes the real p-value smaller. Using the large sample motivated Chi-square test may lead to problems with the threshold in a complex way, and the significance level is not really guaranteed. Thus, simulation based p-values using the conditional distribution given the marginals, utilizing the same distribution as Fisher’s exact test, is implemented as an alternative also in the function chisq.test. This becomes quite similar to Fisher’s exact test, but the balance between rejection probabilities in the two tails of the hypergeometric distribution is sometimes lost. Next, generate a new sequence of random tables in which there is a dependence between factors by running n <- 100 table.count <- 10000 tables2 <- lapply(1:table.count, function(x) generate.2x2(n, 0.2, 0.3, 0.3, 0.2)) 4 Using the same method as above, estimate the probabilities of rejecting the null hypothesis for the three different test types given the new cell probabilities. What you have just done is to compute the power of these test. Note that it is somewhat problematic to do a fair comparison of the power properties of these three different methods when the exact significance levels are unequal. 6 Paired categorical data Imagine a situation where we match some study objects in similar pairs, and then treat one of them with method A, and the other with method B. In simplistic cases the outcome is recorded with a 0-1 score, where 0 is failure and 1 denotes success. The outcome of such an experiment with n pairs will result in n paired observations, each of which will belong the set of possible pairs (0, 1), (1, 1), (1, 0) and (0, 0). Here we interpret an observation (0, 1) as indicating that treatment A failed but treatment B succeded in that particular pair. It is tempting to record these as counts in a contingency table, with A and B as factors with two different levels. However, due to the dependence structure that is built into the design, ordinary Chi-square tests do not work. (In fact, that test would be relevant to test independence in the pairing, which we should not have if we have chosen pairs properly). A common test in this situation is to ignore all pairs in which treatment effects are equal and base a test statistic on the conditional probability of n0,1 given the sum n0,1 + n1,0 . Under the null hypothesis that the treatment effects are the same, this distribution should be binomial with probability p = 0.5. If A works better than B, we would expect that n1,0 tends to be larger than n0,1 . This test is called McNemar’s test. Download the file “TwoTreatments.txt” from the course home page, read it into a data frame object and attach it. You should now be able to access the two vectors A and B, which code the treatment success results. Use these vectors to build a 2x2 table (as a matrix) of the counts in the four different categories. Finally, apply mcnemar.test to the table. What p-value does the test report? Hint: you can obtain the count for category (0, 0) by running sum(A == 0 & B == 0). References Dalgaard, P. (2008). Introductory Statistics with R. Springer, New York, USA, second edition. 5