Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2.1 2. Two-way contingency tables 2.1 Probability structure for contingency tables Setup: Let X be a categorical variable with i=1,…,I levels. Let Y be a categorical variable with j=1,…,J levels. There are IJ different possible combinations of X and Y together. Frequency counts of these combinations can be summarized in an IJ “contingency table”. Often called “two-way” tables since there are two variables of interest. Example: Larry Bird (data source: Wardrop, American Statistician, 1995) Free throws are typically shot in pairs. Below is a contingency table summarizing Larry Bird’s first and second free throw attempts during the 1980-1 and 1981-2 NBA seasons. Let X=First attempt and Y=Second attempt. Second Made Missed Made 251 34 First Missed 48 5 Total 299 39 Total 285 53 338 2010 Christopher R. Bilder 2.2 Interpreting the table: 251 first and second free throw attempts were both made 34 first free throw attempts were made and the second were missed 48 first throw attempts were missed and the second free throw were made 5 first and second free throw attempts were both missed 285 first free throws were made regardless what happened on the second attempt 299 second free throws were made regardless what happened on the first attempt 338 free throw pairs were shot during these seasons What types of questions would be of interest for this data? Example: Field goals Below is a two-way table summarizing field goals from the 1995 NFL season (Bilder and Loughin, Chance, 1998). The data can be considered a representative sample from the population. The two categorical variables in the table are stadium type (dome or outdoors) and field goal result (success or failure). 2010 Christopher R. Bilder 2.3 Field goal result Success Failure 335 52 Stadium Dome type Outdoors 927 111 Total 1262 163 Total 387 1038 1425 What types of questions would be of interest for this data? Example: Salk vaccine clinical trials From p. 186 of the S-Plus 6 Guide to Statistics Volume I In the Salk vaccine trials, two large groups were involved in the placebo-control phase of the study. The first group, which received the vaccination, consisted of 200,745 individuals. The second group, which received a placebo, consisted of 201,229 individuals. There were 57 cases of polio in the first group and 142 cases of polio in the second group. Vaccine Placebo Total Polio 57 142 199 Polio free 200,688 201,087 401,775 2010 Christopher R. Bilder Total 200,745 201,229 401,974 2.4 What types of questions would be of interest for this data? Contingency tables do not have to be 22! Example: #7.24 Subjects were asked whether methods of birth control should be available to teenagers between the ages of 14 and 16. Religious attendance Teenage birth control strongly agree agree disagree strongly disagree Never 49 49 19 9 <1 per year 31 27 11 11 1-2 per year 46 55 25 8 several times per year 34 37 19 7 1 per month 21 22 14 16 2-3 per month 26 36 16 16 nearly every week 8 16 15 11 every week several times per week 32 65 57 61 4 17 16 20 Notice the “total” column and row are not necessary to include with a contingency table. Also, notice that both categorical variables are ordered. 2010 Christopher R. Bilder 2.5 What types of questions would be of interest for this data? In the previous examples, subjects were allowed to fall in only one cell of the contingency table. There are times when subjects may fall in more than one cell! Example: Education and SOV Loughin and Scherer (Biometrics, 1998) examine a sample of 262 Kansas livestock farmers who are asked, “What are your primary sources of veterinary information?” Farmers may pick as many sources that apply from (A) professional consultant, (B) veterinarian, (C) state or local extension service, (D) magazines, and (E) feed companies and representatives. Since respondents may pick any number out the possible categorical responses, Coombs (1964) refers to this type of variable as a “pick any/c” variable (“pick any/c” is read as “pick any out of c” and c is the number of categorical responses). Farmers are also asked many demographic questions including their highest attained level of education. Note that individual farmers may be represented more than once in the table since they may pick all sources that apply. 2010 Christopher R. Bilder 2.6 Information source Education A Total Total B C D E High school 19 38 29 47 40 173 88 Vocational school 2 6 8 8 4 28 16 2-year college 1 13 10 17 14 55 31 4-year college 19 29 40 53 29 170 113 Other 3 8 6 6 27 14 Total responses 4 44 90 Responses Farmers 95 131 93 262 Higgins (An Introduction to Modern Nonparametric Statistics, 2003) also discusses data in this format. The data is given in a multinomial format in Agresti (2002, p. 484-6). What types of questions would be of interest for this data? Notes: Unless otherwise mentioned, all of the contingency tables in this course will have subjects (or items) who fall in only one cell. There are many other examples of contingency tables from marketing, psychology, … The contingency tables presented here are called “twoway” since there are only two categorical variables. Later, we will discuss “three-way” contingency tables when there are three categorical variables. Future chapters will discuss four-way, five-way,… 2010 Christopher R. Bilder 2.7 Probability distributions for contingency tables Let ij = P(X=i, Y=j); i.e., the probability that category i of X and category j of Y is chosen. These probabilities can be put into a contingency table format. If I=2 and J=2, then the following table is produced: Y 1 X 2 1 11 12 2 21 22 Notes: 11, 12, 21, and 22 form the “joint” probability distribution for X and Y (joint since two random variables). Notice the row number goes first in the subscript for and the column number goes second. 11+12+21+22=1; thus, every item falls in one of the cells. Suppose that only the probability distribution for Y is examined. This is called the “marginal” probability distribution for Y. It is denoted by P(Y=1) = +1, P(Y=2) = +2, and +1++2=1 2010 Christopher R. Bilder 2.8 The “+” in the subscript denotes that all possible values of X are being summed over. Thus, +1 = 11 + 21 and +2 = 12 + 22 Equivalently, +1 = P(Y=1) = P(Y=1, X=1) + P(Y=1, X=2). The marginal distribution of X, 1+ and 2+, can be found in a similar manner. You will often see a instead of + used exactly the same way in other textbooks. The contingency table of the probabilities can be extended to include the marginal distribution of Y and X. Notice how the “marginal” probability distribution is put in the “margins” of the table. Y 1 X 2 1 11 12 1+ 2 21 22 2+ +1 +2 1 Each of these ij’s are population parameters. These parameters can be estimated by taking a sample. Counts from the sample are summarized in a contingency table as shown below in a general format. 2010 Christopher R. Bilder 2.9 Y 1 X 2 1 n11 n12 n1+ 2 n21 n22 n2+ n+1 n+2 n Thus, n11 denote the table count for X=1 and Y=1. Also, n1+= n11+ n12 denotes the table count for X=1 without regards Y. Finally, n = n11+n12+ n21+ n22 is the total sample size. This could also be denoted by n++. Using the contingency table counts, the parameter estimates are found using pij = nij/n, pi+ = ni+/n, and p+j = n+j/n. Note that ̂ij could also be used as notation, but Agresti prefers to use a “p”. The resulting contingency table with the “sample proportions” or “sample probabilities” or “estimated probabilities”… is: Y 1 X 2 1 p11 p12 p1+ 2 p21 p22 p2+ p+1 p+2 2010 Christopher R. Bilder 1 2.10 22 contingency tables can be extended to IJ tables as shown below: Y 1 X 2 J 1 11 12 1J 1+ 2 21 22 2J 2+ I I1 I2 IJ I+ +1 +2 +J 1 J I j1 i1 where i+ = ij for i=1,…,I and +j = ij for j=1,…,J Y X 1 2 J 1 n11 n12 n1J n1+ 2 n21 n22 n2J n2+ I nI1 nI2 nIJ nI+ n+1 n+2 n+J n J I j1 i1 where ni+ = nij for i=1,…,I and n+j = nij for j=1,…,J 2010 Christopher R. Bilder 2.11 Y X 1 2 J 1 p11 p12 p1J p1+ 2 p21 p22 p2J p2+ I pI1 pI2 pIJ pI+ p+1 p+2 p+J 1 J I j1 i1 where pi+ = pij for i=1,…,I and p+j = pij for j=1,…,J The contingency table could also be written in terms of the expected cell counts, ij, which is simply E(nij). Note that ij = nij. Example: Larry Bird (bird.R) Second Made Missed Total Made n11=251 n12=34 n1+=285 First Missed n21=48 n22=5 n2+=53 Total n+1=299 n+2=39 n=338 2010 Christopher R. Bilder 2.12 First Second Made Missed Total Made p11=0.7426 p12=0.1006 p1+=0.8432 Missed p21=0.1420 p22=0.0148 p2+=0.1568 Total p+1=0.8846 p+2=0.1154 1 For example, p11 = 251/338 = 0.7426 and p1+ = 285/338 = 0.8432. Make sure you can interpret the probabilities in the table! How are the contingency tables entered into R? Below is the code for one method. > #Create contingency table - notice the data is entered by # columns I, J > n.table <- array(data = c(251, 48, 34, 5), dim = c(2, 2), dimnames = list(First = c("made", "missed"), Second = c("made", "missed"))) > n.table Rows first Second First made missed made 251 34 missed 48 5 > n.table[1,1] [1] 251 > #Find the estimated proportions > p.table <- n.table/sum(n.table) > p.table Second First made missed made 0.7426036 0.1005917 2010 Christopher R. Bilder Notice how the division is performed on each element 2.13 missed 0.1420118 0.0147929 What if the data did not come in a contingency table format? Suppose the data is in its “raw” form: > all.data2 first second 1 missed missed 2 missed missed 3 missed missed 4 missed missed 5 missed missed 6 missed made 7 missed made 8 missed made 336 made made 337 made made 338 made made The above data is stored in a data.frame (it is constructed in bird.R). To find a contingency table for the data, use the table() or xtabs() functions. > #Find contingency table two different ways > bird.table1 <- table(all.data2$first, all.data2$second) > bird.table1 made missed made 251 34 missed 48 5 > bird.table1[1, 1] [1] 251 2010 Christopher R. Bilder 2.14 > bird.table2<-xtabs(formula = ~ first + second, data=all.data2) > bird.table2 second first made missed made 251 34 missed 48 5 > bird.table2[1,1] [1] 251 Note: For those of you with SAS experience, the corresponding output is similar to the output produced from PROC FREQ in SAS. Conditional probability distributions Often when one categorical variable is considered a “response” or “dependent” variable and another categorical variable is considered an “explanatory” or “independent” variable, we would like to look at the probability distribution for the response variable GIVEN the level of the explanatory variable. These can be examined through conditional probability distributions. From STAT 218: Suppose two events are denoted by A and B. The conditional probability of A given B happens is denoted by 2010 Christopher R. Bilder 2.15 P(A | B) P(A and B) , provided P(B)0 P(B) For example, A = Bird’s 2nd free throw attempt outcome and B = Bird’s 1st free throw attempt outcome For STAT 875, we can define conditional probabilities the following way. Suppose Y (columns) is the response variable and X (rows) is the explanatory variable. Let j|i = P(Y=j | X=i). Note that j|i = ij/i+ = P(X=i and Y=j) / P(X=i). The conditional probability distribution has J probabilities 1|i, 2|i, …, J|i and j|i 1 for i=1,...,I. j1 Thus, one can think of each row of the contingency table as one conditional probability distribution. Estimators for the conditional probabilities are pj|i = pij/pi+ = (nij/n) / (ni+/n) = nij/ni+. Example: Larry Bird 2010 Christopher R. Bilder 2.16 First Second Made Missed Total Made p11=0.7426 p12=0.1006 p1+=0.8432 Missed p21=0.1420 p22=0.0148 p2+=0.1568 Total p+1=0.8846 p+2=0.1154 1 Given that Larry Bird misses the first free throw, what is the estimated probability that he will make the second? P(2nd made | 1st missed) = 1|2 Be careful with the notation for this problem! The corresponding estimator is p1|2 = p21/p2+ = 0.1420/0.1568 = 0.9057. You can also find this using p1|2 = n21/n2+ = 48/53. Be careful with making sure you know which variable level is represented first and which variable level is represented second in the subscript notation for p1|2. Therefore it is still very likely that Larry Bird will make the second free throw even if the first one is missed. Question for basketball fans: Why would this probability be important to know? If the first free throw result is thought of as an explanatory variable and the second free throw result is 2010 Christopher R. Bilder 2.17 thought of as a response variable, we can find the following table of conditional probabilities: First Second Made Missed Made p1|1=0.8807 p2|1=0.1193 Missed p1|2=0.9057 p2|2=0.0943 Total 1 1 Notice the estimated probability of making the second free throw is larger after (given) the first free throw is missed! Independence Suppose Y is a response variable and X is an explanatory variable. Also, suppose Y is independent of X. What is j|i equal to? Remember that j|i = P(Y=j | X=i). Independence means that the probability of Y=j does not depend on the level of X. Therefore, the probability is the same for all levels of X; i.e., P(Y=j | X=i) = P(Y=j) for i=1,…,I and j=1,…,J j|i = +j for i=1,…,I and j=1,…,J This can be rewritten as 2010 Christopher R. Bilder 2.18 j|1 = j|2 = … = j|I for j=1,…,J Thus, there is equality across rows for the conditional probability distributions. When both categorical variables can be thought of as response variables, independence can be written without the use of conditional probability distributions. Statistical independence occurs if ij = i++j for i=1,…,I and j=1,…,J. Thus, ij is equal to the product of the corresponding marginal probabilities. The equivalence of the two ways to write independence can be shown as follows: ij = i++j for i=1,…,I and j=1,…,J ij/i+ = i++j/i+ for i=1,…,I and j=1,…,J j|i = +j for i=1,…,I and j=1,…,J Example: Larry Bird What does independence mean in this example? Do you think independence occurs? 2010 Christopher R. Bilder 2.19 Poisson, binomial, and multinomial sampling How do counts in a contingency table come about with respect to probability distributions? There are 4 ways where 3 are discussed here: 1) We can often treat each cell of an IJ contingency table as independent Poison random variables; i.e., nij ~ ind. Poisson(ij). Thus, nijij eij f(nij ) for nij = 0, 1, 2, … nij ! When use this distribution, we have Poisson sampling. The total sample size, n, is NOT fixed. 2) When n is fixed (or conditional on sample size), multinomial sampling occurs over all of the cells of the contingency table; i.e., (n11, n12, …, nIJ) ~ Multinomial(n, 11, 12, …, IJ). A random sample of size n from one multinomial distribution is taken and summarized by the sample counts in cells of the table. Note (n11, n12, …, nIJ) ~ Multinomial(n, 11, 12, …, IJ) could also be expressed as (n11, n12, …, n-ijnij) ~ Multinomial(n, 11, 12, …, 1-ijij) since nIJ = n-ijnij and IJ = 1-ijij. 2010 Christopher R. Bilder 2.20 3) Sometimes n1+, n2+,…, nI+ are fixed by the sampling design. For example in a clinical trial, there may be only 10 people available for the placebo group and 9 people available for the drug group. Also, suppose there are only two possible outcomes for the trial – cured and not-cured. In this case, we have binomial sampling within each row of the contingency table. This is often called “independent” binomial sampling since random variables are independent across the rows. When more than two outcomes are possible, say cured, partially cured, and not cured, then “independent multinomial sampling” occurs within each row of the contingency table. (n11, n12, …, n1J) ~ Multinomial(n1+, 1|1, 2|1, …, J|1), (n21, n22, …, n2J) ~ Multinomial(n2+, 1|2, 2|2, …, J|2), (nI1, nI2, …, nIJ) ~ Multinomial(nI+, 1|I, 2|I, …, J|I) Example: Independent binomial and multinomial sampling and just multinomial sampling. Suppose n1+=50 males and n2+=60 females are wanted for a study. These males and females are randomly selected from their individual populations. Suppose there are only 2 possible outcomes – cured 2010 Christopher R. Bilder 2.21 and not cured. This is an example of independent binomial sampling. Y Not Cured Cured X Male n11 n12 n1+ Female n21 n22 n2+ n+1 n+2 n Thus, n11~Binomial(n1+, 1|1) and n21~Binomial(n2+, 1|2) where n11 is independent of n21. Suppose n1+=50 males and n2+=60 females are wanted for a study. These males and females are randomly selected from their individual populations. Suppose there are now 3 possible outcomes – cured, partially cured, and not cured. This is an example of independent multinomial sampling. Y Partially Not Cured Cured Cured X Male n11 n12 n13 n1+ Female n21 n22 n23 n2+ n+1 n+2 n+3 n 2010 Christopher R. Bilder 2.22 Thus, (n11, n12, n13) ~ Multinomial(n1+, 1|1, 2|1, 3|1) and (n21, n22, n23) ~ Multinomial(n2+, 1|2, 2|2, 3|2) where the n1j’s are independent of the n2j’s. Suppose n=110 subjects are wanted for a study. Males and females are randomly selected from the one population. This is an example of multinomial sampling. The n1+ and n2+ are not fixed for this study. Y Partially Not Cured Cured Cured X Male n11 n12 n13 n1+ Female n21 n22 n23 n2+ n+1 n+2 n+3 n Thus, (n11, n12, n13, n21, n22, n23) ~ Multinomial(n, 11, 12, 13, 21, 22, 23) Instead of Male and Female, we could have drug and placebo groups. Typically, the number of subjects receiving the drug and the number receiving the placebo will be fixed. Thus, independent binomial or multinomial sampling will be used. You can kind of think of this as a Completely Randomized Design used in ANOVA where you fixed the number of people receiving each treatment. 2010 Christopher R. Bilder 2.23 What about Poisson sampling? Perhaps this could occur if the study allowed anyone who volunteered (with no upper limit) to participate in it. Notes: Although Poisson sampling may occur, n or ni+ are often conditioned upon. For the analyses to be examined in this book, we will usually get the same results no matter what types of sampling methods are used. You should think about how one can simulate observations in order to form a contingency table. See the p. 40-41 of Agresti (2002) for an additional example. 2010 Christopher R. Bilder 2.24 2010 Christopher R. Bilder 2.25 2.2 Comparing proportions in 22 contingency tables Difference of proportions or differences of probabilities Suppose we have the following 22 table Y 1 X 2 1 n11 n12 n1+ 2 n21 n22 n2+ n+1 n+2 n where n1+ and n2+ are FIXED. Thus, we have independent binomial sampling. Suppose Y=1 equates to a success and Y=2 equates to a failure. We can then write the table in terms of the conditional probability distributions. Y 1=success 2=failure X 1 1|1 2|1 1 2 1|2 2|2 1 The sample proportions or probabilities can also be written in this format. Note that Agresti writes the table as 2010 Christopher R. Bilder 2.26 Y 1=success 2=failure X 1 1 1-1 1 2 2 1-2 1 Example: Larry Bird Second Made Missed Made p1|1=0.8807 p2|1=0.1193 First Missed p1|2=0.9057 p2|2=0.0943 Total 1 1 Often of interest is determining if the probability of success is the same across the two levels of X. If the probabilities are equal, then 1|1-1|2=0. A confidence interval can be found to examine the differences of the proportions (or probabilities). Remember from Chapter 1 that the estimated proportion, p, can be treated as an approximate normal random variable with mean and variance (1 ) n for a large sample. Using the notation in this chapter, this means that p1|1 ~ N(1|1, 1|1(1-1|1)/n1+) and p1|2 ~ N(1|2, 1|2(1-1|2)/n2+) approximately 2010 Christopher R. Bilder 2.27 for large n1+ and n2+. Note that p1|1 and p1|2 are treated as random variables here, not the observed values in the last example. The statistic that estimates 1|1 - 1|2 is p1|1 - p1|2. The distribution can be approximated by N(1|1-1|2, 1|1(1-1|1)/n1+ + 1|2(1-1|2)/n2+) for large n1+ and n2+. Note: Var(p1|1 - p1|2) = Var(p1|1) + Var(p1|2) since p1|1 and p1|2 are independent random variables. Some of you may have seen the following: Let X and Y be independent random variables and let a and b be constants. Then Var(aX+bY) = a2Var(X) + b2Var(Y). Thus, an approximate (1-)100% confidence interval for 1|1-1|2 is Estimator (distributional value)(standard deviation of estimator) p1|1-p1|2Z1-/2 p1|1(1 p1|1) n1 p1|2 (1 p1|2 ) n2 Notice how p1|1 and p1|2 replace 1|1 and 1|2 in the standard deviation of the estimator. This is another example of a Wald confidence interval 2010 Christopher R. Bilder 2.28 Do you remember the problems with the Wald confidence interval in Chapter 1? Similar problems happen here. Agresti and Caffo (2000) recommend using the “add two successes and two failures” methods for an interval of ANY level of confidence. Let p1|2 n21 1 n 1 and p1|1 11 . n2 2 n1 2 The confidence interval is p1|1 p1|2 Z1 / 2 p1|1(1 p1|1) p1|2 (1 p1|2 ) n1 2 n2 2 Again, Agresti and Caffo do not change the adjustment for different confidence levels! Below are two plots from the paper comparing the Agresti and Caffo interval to the Wald interval (similar to p. 1.45). The solid line denotes the Agresti and Caffo interval. The y-axis shows the true confidence level (coverage) of the confidence intervals. The x-axis shows various values of 1|1 where 1|2 is fixed at 0.3. 2010 Christopher R. Bilder 2.29 To find the estimated true confidence level, 10,000 samples from a binomial probability distribution with 1|2=0.3 and 10,000 samples from a binomial probability distribution with 1|1=x-axis value. The sample size is given on the bottom of the plot. For each of the 10,000 samples from binomial #1 and binomial #2, the confidence interval is calculated. The proportion of time that 1|1-0.3 is inside the interval is calculated as the “true confidence level”. In the plots, p1 represents our 1|1, and p2 represents our 1|2. 2010 Christopher R. Bilder 2.30 2010 Christopher R. Bilder 2.31 For the plots below, the value of 1|1 was no longer fixed. The Agresti and Caffo interval tends to be much better than the Wald interval. 2010 Christopher R. Bilder 2.32 Note that other confidence intervals can be done. Agresti and Caffo’s (2000) objective was to present a “better” than the Wald interval which could be used in elementary statistics courses. See Newcombe (Statistics in Medicine, 1998, p. 857-872) for other intervals. Example: Larry Bird (bird.R) Find a (1-)100% confidence interval for 1|1-1|2; i.e., P(2nd made | 1st made) – P(2nd made | 1st missed). 95% Wald confidence interval: -0.1122 1|1 - 1|2 0.0623 95% Agresti-Caffo confidence interval: -0.1022 1|1 - 1|2 0.0764 There is not sufficient evidence to indicate a difference in the proportions. What does this mean in terms of the original problem? R code and output: > #Confidence interval for difference of proportions > alpha <- 0.05 > p.1.1 <- p.table[1, 1]/sum(p.table[1, ]) > p.1.2 <- p.table[2, 1]/sum(p.table[2, ]) > p.1.1 [1] 0.8807018 2010 Christopher R. Bilder 2.33 > p.1.2 [1] 0.9056604 > #Wald > lower <- p.1.1 - p.1.2 - qnorm(1 - alpha/2) * sqrt((p.1.1*(1-p.1.1))/sum(n.table[1,]) + (p.1.2*(1p.1.2))/sum(n.table[2,])) > upper <- p.1.1 - p.1.2 + qnorm(1 - alpha/2) * sqrt((p.1.1*(1-p.1.1))/sum(n.table[1,]) + (p.1.2*(1p.1.2))/sum(n.table[2,])) > cat("The Wald C.I. is:", round(lower, 4), "<= pi.1.1pi.1.2 <=", round(upper, 4)) The Wald C.I. is: -0.1122 <= pi.1.1-pi.1.2 <= 0.0623 > > > > #Agresti-Caffo p.1.1<-(n.table[1,1]+1)/(sum(n.table[1,])+2) p.1.2<-(n.table[2,1]+1)/(sum(n.table[2,])+2) lower<-p.1.1-p.1.2-qnorm(1-alpha/2)* sqrt(p.1.1*(1-p.1.1)/(sum(n.table[1,])+2) + p.1.2*(1-p.1.2)/(sum(n.table[2,])+2)) > upper<-p.1.1-p.1.2+qnorm(1-alpha/2)* sqrt(p.1.1*(1-p.1.1)/(sum(n.table[1,])+2) + p.1.2*(1-p.1.2)/(sum(n.table[2,])+2)) > cat("The Agresti-Caffo interval is:", round(lower,4) , "<= pi.1.1-pi.1.2 <=", round(upper,4)) The Agresti-Caffo interval is: -0.1035 <= pi.1.1-pi.1.2 <= 0.0778 Agresti provides code for these and a few other intervals for the difference of two proportions and other measures at www.stat.ufl.edu/~aa/cda/R/two_sample/R2/index.html Relative risk Suppose there is independent binomial sampling. 2010 Christopher R. Bilder 2.34 The ratio of two probabilities may be more meaningful than their difference when the proportions are close to 0 or 1 than 0.5. Consider two cases examining the probabilities of people who experience adverse reactions to a drug (1) or a placebo (2): Adverse reactions Yes No Total Drug 1|1=0.510 2|1=0.490 1 Placebo 1|2=0.501 2|2=0.499 1 1|1 - 1|2 = 0.510 – 0.501 = 0.009 Adverse reactions Yes No Total Drug 1|1=0.010 2|1=0.990 1 Placebo 1|2=0.001 2|2=0.999 1 1|1 - 1|2 = 0.010 – 0.001 = 0.009 In both cases, the difference in proportions is the same. However in the second case, it is 10 times more likely to experience an adverse reaction by taking the drug! The relative risk is the ratio of two probabilities. In the above example (2nd case), it is 1|1/1|2=0.010/0.001 = 10. Consider the table below. 2010 Christopher R. Bilder 2.35 Y 1=success 2=failure X 1 1|1 2|1 1 2 1|2 2|2 1 General interpretation: A Y=1 (success) is 1|1/1|2 times more likely when X=1 rather than when X=2. Typically, it is easier to interpret this quantity when the relative risk is greater than 1. Thus, you may want to invert the ratio. Of course, “invert” your interpretation as well!!! The sample version of the relative risk is the ratio of two sample conditional probabilities. Questions: What does a relative risk of 1 mean? What is the range of the relative risk? One version of an approximate (1-)100% confidence interval is p1|1 1 p1|1 1 p1|2 exp log Z1 / 2 p n p n p 1|2 1 1|1 2 1|2 for large n1+ and n2+ (see #2.15). This is a Wald confidence interval. The estimated standard deviation 2010 Christopher R. Bilder 2.36 used in the formula is derived using the “delta method” (see Chapter 14 of Agresti (2002) for a nice introduction). Example: Larry Bird (bird.R) First Second Made Missed Made p1|1=0.8807 p2|1=0.1193 Missed p1|2=0.9057 p2|2=0.0943 Total 1 1 p1|1/p1|2 = 0.8807/0.9057 = 0.9724 If the relative risk is inverted: p1|2/p1|1 = 0.9057/0.8807 = 1.0284. Thus, a successful second free throw is estimated to be 1.0284 times more likely to occur when the first free throw is missed rather than made. R code and output: > #################################################### #Relative risk > p.1.1 <- p.table[1,1]/sum(p.table[1,]) > n.1 <- sum(n.table[1,]) > p.1.2 <- p.table[2,1]/sum(p.table[2,]) > n.2 <- sum(n.table[2,]) > cat("The sample relative risk is", p.1.1/p.1.2, "\n \n") The sample relative risk is 0.9724415 > alpha <- 0.05 > lower <- exp(log(p.1.1/p.1.2) - qnorm(1 - alpha/2) * sqrt((1-p.1.1)/(n.1*p.1.1) + (1- p.1.2)/(n.2*p.1.2))) 2010 Christopher R. Bilder 2.37 > upper <- exp(log(p.1.1/p.1.2) + qnorm(1 - alpha/2) * sqrt((1-p.1.1)/(n.1*p.1.1) + (1- p.1.2)/(n.2*p.1.2))) > cat("The Wald interval for RR is:", round(lower, 4), "<= pi.1.1/pi.1.2 <=", round(upper, 4)) The Wald interval for RR is: 0.8827 <= pi.1.1/pi.1.2 <= 1.0713 > #Invert > cat("The Wald interval for RR is:", round(1/upper, 4), "<= pi.1.2/pi.1.1 <=", round(1/lower, 4)) The Wald interval for RR is: 0.9334 <= pi.1.2/pi.1.1 <= 1.1329 Standard interpretation: I am approximately 95% confident that a second FT success is between 0.9334 and 1.1329 times more likely when the first FT is missed rather than made. What else could be said here if one wanted to do a hypothesis of Ho: 1|1/1|2 = 1 vs. Ho: 1|1/1|2 ≠ 1 What if the interval was 21|1/1|24? 2010 Christopher R. Bilder 2.38 2.3 The odds ratio (OR) Suppose there is independent binomial sampling with the following set of conditional probabilities: Y 1=success 2=failure X 1 1|1 2|1 1 2 1|2 2|2 1 For row 1, the “odds of a success” is odds1 = 1|1/(1-1|1) = 1|1/2|1. For row 2, the “odds of a success” is odds2 =1|2/(1-1|2) = 1|2/2|2. In general, the odds of a success are P(success)/P(failure). Notice that the odds are just a rescaling of the P(success)! For example, if P(success) = 0.75, then the odds are 3 or “3 to 1 odds”. The odds of a success are three times larger than for a failure. The estimated odds are: odds1 p1|1 p2|1 n odds2 21 n22 p11 / p1 p11 n11 / n n11 and p12 / p1 p12 n12 / n n12 2010 Christopher R. Bilder 2.39 Notice what cells these correspond to in the contingency table. Y 1 2 X 1 n11 n12 n1+ 2 n21 n22 n2+ n+1 n+2 n Questions: What is the range of an odds? What does it mean for an odds to be 1? To incorporate information from both rows 1 and 2 into a single number, the ratio of the two odds is found. This is called an “odds ratio”. Formally, it is defined as: 1|1 / 2|1 1|12|2 odds1 1|1 /(1 1|1) odds2 1|2 /(1 1|2 ) 1|2 / 2|2 1|2 2|1 “Odds ratio” is often abbreviated by “OR”. ORs are VERY useful in categorical data analysis and will be used throughout this book! ORs measure how much greater the odds of success are for one level of X than for another level of X. 2010 Christopher R. Bilder 2.40 Questions: What is the range of an OR? What does it mean for an OR to be 1? What does it mean for an OR > 1? What does it mean for an OR < 1? The OR can be estimated by p /(1 p1|1) p1|1p2|2 ˆ odds1 1|1 odds2 p1|2 /(1 p1|2 ) p1|2p2|1 n11 n22 n n n n 1 2 11 22 n21 n12 n21n12 n2 n1 This is the maximum likelihood estimate of (“invariance property” of maximum likelihood estimators). Notice how the OR is not dependent on a particular variable being called a “response” variable. If the roles of Y and X were switched, we would get the same OR! This is not true for relative risk (try it yourself). If there was multinomial sampling for the entire table, one could just condition on the rows to obtain the same OR. Also, note that 1122 ( 11 / 1 )( 22 / 2 ) 1|12|2 12 21 ( 12 / 2 )( 21 / 1 ) 1|2 2|1 2010 Christopher R. Bilder 2.41 which is same OR as before. Also, p11p22 p12p21 n11 n22 n n n n 11 22 n12 n21 n12n21 n n is the same estimated odds ratio as before. Interpretation of the OR: The odds of Y=1 (success) are times larger when X=1 than when X=2. The odds of X=1 are times larger when Y=1 than when Y=2. When <1, we will often want to invert the OR. Below is how the interpretations could change: The odds of Y=1 (success) are 1/ times larger when X=2 than when X=1 since 1 odds2 1|2 /(1 1|2 ) 1|1 / 2|2 1|2 2|1 odds1 1|1 /(1 1|1) 1|2 / 2|1 1|12|2 The odds of X=1 are 1/ times larger when Y=2 than when Y=1. Also, the interpretations could change to: 2010 Christopher R. Bilder 2.42 The odds of Y=2 are 1/ times larger when X=1 than when X=2 since Odds of failure for row #1 2|1 / 1|1 2|11|2 1 Odds of failure for row #2 2|2 / 1|2 2|2 1|1 The odds of X=2 are 1/ times larger when Y=1 than when Y=2. The table below is used a lot for the rearrangement of terms above. Y 1=success 2=failure X 1 1|1 2|1 1 2 1|2 2|2 1 Work through these on your own to make sure you can show these relationships. You will need to become very comfortable with inverting an OR! Confidence interval for Since ̂ is a maximum likelihood estimate, we can use the “usual” properties of them to find the confidence 2010 Christopher R. Bilder 2.43 interval. However, using the log( ̂ ) often works better (i.e., its distribution is closer to being a normal distribution). It can be shown that: log( ̂ ) has an approximate normal distribution with mean log() for large n. The “asymptotic” (for large n) standard deviation of 1 1 1 1 log( ̂ ) is . This is derived using n11 n12 n21 n22 the “delta method” (see Chapter 14 of Agresti (2002) for a nice introduction). The approximate (1-)100% confidence interval for log() is log(ˆ ) Z1 / 2 1 1 1 1 n11 n12 n21 n22 The approximate (1-)100% confidence interval for is 1 1 1 1 ˆ exp log() Z1 / 2 n n n n 11 12 21 22 Lui and Lin (Biometrical Journal, 2003, p. 231) show this interval is conservative. What does “conservative” mean? 2010 Christopher R. Bilder 2.44 Problems with small cell counts n n What happens to ˆ 11 22 if nij=0 for some i, j? n21n12 When there is a 0 or small cell count, the OR estimator is changed a little to help prevent problems. The OR estimator is (n11 0.5)(n22 0.5) (n21 0.5)(n12 0.5) Thus, 0.5 is added to each cell count. The “asymptotic” standard deviation of log( ) is then 1 1 1 1 n11 0.5 n12 0.5 n21 0.5 n22 0.5 and the confidence interval for can be found. Sometimes, a small number is just added to a cell with a 0 count instead. Example: Larry Bird (bird.R) 2010 Christopher R. Bilder 2.45 Second Made Missed Total Made n11=251 n12=34 n1+=285 First Missed n21=48 n22=5 n2+=53 Total n+1=299 n+2=39 n=338 ˆ n11n22 251 5 0.7690 . n21n12 48 34 Interpretation: The estimated odds of a made second free throw attempt are 0.7690 times larger when the first free throw is made than when the first free throw is missed. The estimated odds of a made first free throw attempt are 0.7690 times larger when the second free throw is made than when the second free throw is missed. Note that this does not necessarily make sense to examine for this problem. Often when the OR<1, the OR is inverted and the interpretation is changed. Therefore, the estimated odds of a made second free throw attempt are 1/0.7690=1.3004 times larger when the first free throw is missed than when the first free throw is made. The approximate 95% confidence interval for is 0.2862 2.0659. If the interval is inverted, the approximate 95% confidence interval for 1/ is 0.4841 1/ 3.4935. 2010 Christopher R. Bilder 2.46 The interpretation can be extended to be: With approximately 95% confidence, the odds of a made second free throw attempt are between 0.4841 and 3.4935 times larger when the first free throw is missed than when the first free throw is made. Since 1 is in the interval, there is not sufficient evidence to indicate that the first free throw result has an effect on the second free throw result. R code and output: > #################################################### > #OR > theta.hat <- (n.table[1,1] * n.table[2,2]) / (n.table[1,2] * n.table[2,1]) > theta.hat [1] 0.7689951 > 1/theta.hat [1] 1.300398 > alpha <- 0.05 > lower <- exp(log(theta.hat) - qnorm(1 - alpha/2) * sqrt(1/n.table[1,1] + 1/n.table[2,2] + 1/n.table[1,2] + 1/n.table[2,1]) > upper <- exp(log(theta.hat) + qnorm(1 - alpha/2) * sqrt(1/n.table[1,1] + 1/n.table[2,2] + 1/n.table[1,2] + 1/n.table[2,1])) > cat("The Wald interval for OR is:", round(lower, 4), "<= theta <=", round(upper, 4)) The Wald interval for OR is: 0.2862 <= theta <= 2.0659 > #Invert 2010 Christopher R. Bilder 2.47 > cat("The Wald interval for OR is:", round(1/upper, 4),“<= 1/theta <=", round(1/lower, 4)) The Wald interval for OR is: 0.4841 <= 1/theta <= 3.4935 Be careful with the inverted OR. I could have put “the Wald interval for 1/OR is:…”. Please note that it is incorrect to replace the word “odds” with “probability”. Also, a statement such as “it is 1.3 times more likely the second free throw is made when the first free throw is missed rather than made.” The word “likely” means probabilities are being compared. Example: Salk vaccine clinical trials (polio.R) Vaccine Placebo Polio 57 142 Polio free 200,688 201,087 Total 200,745 201,229 R code and output: > n.table<-array(data = c(57, 142, 200688, 201087), dim = c(2,2), dimnames=list(Trt = c("vaccine", "placebo"), Result = c("polio", "polio free"))) > n.table Result Trt polio polio free vaccine 57 200688 placebo 142 201087 > theta.hat <- (n.table[1,1] * n.table[2,2]) / (n.table[1,2] * n.table[2,1]) 2010 Christopher R. Bilder 2.48 > theta.hat [1] 0.4022065 > 1/theta.hat [1] 2.486285 > alpha <- 0.05 > lower <- exp(log(theta.hat) - qnorm(1 - alpha/2) * sqrt(1/n.table[1,1] + 1/n.table[2,2] + 1/n.table[ 1,2] + 1/n.table[2,1])) > upper <- exp(log(theta.hat) + qnorm(1 - alpha/2) * sqrt(1/n.table[1,1] + 1/n.table[2,2] + 1/n.table[ 1,2] + 1/n.table[2,1])) > cat("The Wald interval for OR is:", round(lower, 4), "<= theta <=", round(upper, 4)) The Wald interval for OR is: 0.2958 <= theta <= 0.5469 > #Invert cat("The Wald interval for 1/OR is:", round(1/upper, 4), "<= 1/theta <=", round(1/lower, 4)) The Wald interval for OR is: 1.8283 <= 1/theta <= 3.381 The estimated odds of getting polio are 0.4022 times higher when the vaccine is given instead of a placebo. If this OR is inverted, a more meaningful interpretation results: The estimated odds of getting polio are 2.4863 times higher when the placebo is given instead of the vaccine. With approximately 95% confidence, the odds of getting polio are between 1.8283 and 3.3810 times higher when the placebo is given instead of the vaccine. 2010 Christopher R. Bilder 2.49 The odds ratio interpretation could also be written as: The estimated odds of not getting polio are 2.4863 times higher when the vaccine is given instead of the placebo. Would you want to receive the vaccine? ORs can be calculated for larger contingency tables. For example, suppose the table below is of interest. Y X 1 2 3 1 n11 n12 n13 n1+ 2 n21 n22 n23 n2+ 3 n31 n32 n33 n3+ n+1 n+2 n+3 n Many ORs could be calculated here. For example, n n The estimated odds of Y=1 vs. Y=2 are ˆ 11 22 times n21n12 larger when X=1 than when X=2. Also, the estimated n n odds of X=1 vs. X=2 are ˆ 11 22 times larger when n21n12 Y=1 than when Y=2. n11n32 ˆ The estimated odds of Y=1 vs. Y=2 are times n31n12 larger when X=1 than when X=3. 2010 Christopher R. Bilder 2.50 n n The estimated odds of Y=1 vs. Y=3 are ˆ 11 33 times n31n13 larger when X=1 than when X=3. n n The estimated odds of Y=2 vs. Y=3 are ˆ 12 33 times n32n13 larger when X=1 than when X=3. Notice how each sentence has something like “Y=1 vs. Y=2”. This is needed since we need to know which levels are being compared. Before when there was just two, we could just say “Y=1” since this implies it is being compared to the only other level. Notes: One could write the odds ratio in terms of the expected 1122 cell counts, ij, as for a 22 table. 1221 Read on your own Section 2.3.4 (Relationship between the OR and the relative risk), Section 2.3.5 (The odds ratio applies in case-control studies) and Section 2.3.5 (Types of observational studies). The Chapter 2 extra notes for the following contains an old test problem (responsible for) and other measures of association in a contingency table (not responsible for). 2010 Christopher R. Bilder 2.51 2.4 Chi-squared tests of independence We will be doing a variety of different hypothesis tests involving contingency tables. In order to do these hypothesis tests, we will need to find the expected cell counts under a hypothesis. These expected cell counts are denoted by ij. Agresti’s (2007) notation here is not necessarily the best to use for all situations. It may be more appropriate to use something like ij(o) to denote the expected value under a null hypothesis (Ho). For example, the observed cell count for row i and column j of a contingency table is nij. Remember that nij is a random variable. The expected value of nij under a particular hypothesis is E(nij) = ij. Note that ij = nij if there are no restrictions upon what ij can be. Suppose we assume multinomial sampling (n is fixed). A common hypothesis test is a test for independence: Ho: ij=i++j for i=1,…,I and j=1,…,J Ha: Not all equal Under the null hypothesis of independence restriction, E(nij) = ij = ni++j. Under Ho or Ha (no restriction), E(nij) = nij. 2010 Christopher R. Bilder 2.52 Make sure you understand why ij = ni++j under Ho! Pearson statistic The Pearson chi-squared statistic is 2 (n ) ij ij X2 ij i1j1 I J Notes: The numerator measures how far the expected value under Ho and observed cell counts are from each other. Think of this as a squared residual. The denominator helps account for the scale of the cell count. (nij ij )2 The larger , the more evidence that the null ij hypothesis is incorrect. Large values of X2 indicate the null hypothesis is incorrect. For large n, X2 has an approximate 2 distribution with a particular number of degrees of freedom. The degrees of freedom are dependent on the hypotheses being tested. This is a right tail test. Typical recommendations for a “large n” involve ij 5 (or nij 5). 2010 Christopher R. Bilder 2.53 Remember that with nij ~ Poisson(ij), then (nij ij ) ij is an approximate standard normal value. Thus, (nij ij )2 ij is an approximate 12 value. See Section 24 of Ferguson (1996) for general uses of the Pearson statistic. Suppose we assume multinomial sampling (n is fixed). When a test for independence is done, the hypotheses are: Ho: ij=i++j for i=1,…,I and j=1,…,J Ha: Not all equal The Pearson statistic has ni++j substituted for ij: I J (nij ni j )2 . X2 ni j i1j1 Problem: Notice the parameter values are in the statistic! Thus, this statistic is difficult to calculate. To solve the problem, the corresponding estimators replace the parameters. The expected cell count under independence is estimated by ni n j ni n j ˆ ij npi p j n . n n n 2010 Christopher R. Bilder 2.54 The statistic becomes 2 2 I J (nij I J ) (n n n / n) ˆ ij ij i j . X2 ˆ ij ni n j / n i1j1 i1j1 For large n, this statistic has an approximate 2 distribution with (I-1)(J-1) degrees of freedom under Ho. The distribution can be denoted symbolically as (I21)(J1) . Where does the (I-1)(J-1) degrees of freedom come from? In general, the degrees of freedom can be calculated as: [# of parameters under Ha - # of restrictions under Ha] – [# of parameters under Ho - # of restrictions under Ho] = [# of free parameters under Ha] – [# of free parameters under Ho] For a test of independence, the number of free parameters under Ha is IJ – 1. Reason: There are IJ ij parameters. There is one restriction since ijij=1. 2010 Christopher R. Bilder 2.55 For a test of independence, the number of free parameters under Ho is I+J-2. Reason: There are I i+ parameters and J +j parameters. There are two restrictions since ii+=1 and j+j=1. Thus, [IJ–1] – [I+J-2] = IJ – I – J +1 = (I-1)(J-1). Example: Larry Bird (bird.R) Second Made Missed Total Made n11=251 n12=34 n1+=285 First Missed n21=48 n22=5 n2+=53 Total n+1=299 n+2=39 n=338 Second Made First Missed Made 285 299 ˆ 11 338 252.11 53 299 ˆ 21 338 46.88 2010 Christopher R. Bilder Missed 285 39 ˆ 12 338 32.88 53 39 ˆ 22 338 6.11 2.56 2 2 I J (n ) (n n n / n) ˆ ij ij ij i j X2 ˆ ij ni n j / n i1j1 i1j1 (n11 ˆ 11)2 (n12 ˆ 12 )2 (n21 ˆ 21)2 (n22 ˆ 22 )2 ˆ 11 ˆ 12 ˆ 21 ˆ 22 I J (251 252.11)2 (34 32.88)2 ( 48 46.88)2 (5 6.11)2 252.11 32.88 46.88 6.11 = 0.0049 + 0.0382 + 0.0268 + 0.2017 = 0.2716 2 2 The critical value at =0.05 is 0.95,(2 1)(21) = 0.95,1 = 3.84. The p-value for the test is 0.6015. Thus, there is not sufficient evidence to reject independence. Of course, this does not mean that the first and second attempts ARE independent! 2010 Christopher R. Bilder 2.57 1.0 0.5 Chi-square f(x) 1.5 2 1 0 1 2 3 4 5 x par(xaxs = "i", yaxs = "i") #Removes extra space on x and y-axis curve(expr = dchisq(x, df=1), col = "red", xlim = c(0,5), ylab = "Chi-square f(x)", main = expression(chi[1]^2)) Note that executing demo(plotmath) at the command prompt shows more of what you can do for plotting mathematical symbols. Below is the R code and output. > ind.test<-chisq.test(n.table, correct=F) > names(ind.test) 2010 Christopher R. Bilder 2.58 [1] "statistic" "parameter" "p.value" "data.name" "observed" [7] "expected" "residuals" > ind.test "method" Pearson's Chi-squared test data: n.table X-squared = 0.2727, df = 1, p-value = 0.6015 > #just p-value > ind.test$p.value [1] 0.6015021 > ind.test$expected Second First made missed made 252.11538 32.884615 missed 46.88462 6.115385 > #Another way using the raw data > chisq.test(x = all.data2$first, y = all.data2$second, correct=F) Pearson's Chi-squared test data: all.data2$first and all.data2$second X-squared = 0.2727, df = 1, p-value = 0.6015 > #critical value > qchisq(p = 0.95, df = 1) [1] 3.841459 > 1 - pchisq(q = ind.test$statistic, df = 1) X-squared 0.6015021 > #Two more ways! > bird.table2<-xtabs(formula = ~ first + second, data=all.data2) > summary(bird.table2) Call: xtabs(formula = ~first + second, data = all.data2) Number of cases in table: 338 Number of factors: 2 Test for independence of all factors: Chisq = 0.27274, df = 1, p-value = 0.6015 2010 Christopher R. Bilder 2.59 > bird.table3<-table(all.data2$first, all.data2$second) > summary(bird.table3) Number of cases in table: 338 Number of factors: 2 Test for independence of all factors: Chisq = 0.27274, df = 1, p-value = 0.6015 Notes: When the sample size is small, a 2 approximation to the distribution of X2 may not do a good job. The Yates’ continuity correction can be used to allow for a better approximation. With the correction, the Pearson statistic becomes: X 2 nij ni n j / n 0.5 2 I J i1 j1 ni n j / n You can produce this statistic with the chisq.test() function by using the correct=TRUE option. We will discuss other alternatives later for when the sample size is small. Here is a quote from Agresti (1996, p.43), regarding the use of the correction: There is no longer any reason to use this approximation, however, since modern software makes it possible to conduct Fisher’s exact test for fairly large samples… 2010 Christopher R. Bilder 2.60 The Pearson statistic can also be derived from the point of view of having independent multinomial sampling (ni+ fixed – each row of the contingency table represents a population). Instead of testing for independence as stated previously, equality of the j|i across the rows for each j=1,…,J is tested. Stated formally, the hypotheses are Ho:j|1=…=j|I for j=1,…,J vs. Ha: At least one The hypotheses here are equivalent to the independence hypotheses (see p. 2.17 – 2.18). The Pearson test statistic and its asymptotic distribution are also the same. Some books go into detail explaining the differences and how they end up being equivalent. See Chapter 2 of Christensen (1990) if you are interested. Likelihood ratio test (LRT) statistic From Chapter 1 notes: The LRT statistic, , is the ratio of two likelihood functions. The numerator is the likelihood function maximized over the parameter space restricted under the null hypothesis. The denominator is the likelihood function maximized over the unrestricted parameter space. The test statistic is written as: Max. lik. when parameters satisfy Ho Max. lik. when parameters satisfy Ho or Ha 2010 Christopher R. Bilder 2.61 Note that the ratio is between 0 and 1 since the numerator can not exceed the denominator. Questions: Why can’t the numerator exceed the denominator? What does it mean when the ratio is close to 1? What does it mean when the ratio is close to 0? The actual test statistic used for a LRT is –2log(). The reason is because this statistic has an approximate 2 distribution for large n. The degrees of freedom are found the same way as for the Pearson statistic. Assuming multinomial sampling, –2log() becomes nij G 2 nij log ij i1j1 2 I J where ij is restricted under the null hypothesis. Note that ij under Ho or Ha ends up being just nij. The G2 notation is used throughout this book and by many other authors to denote this statistic. Questions: What happens if nijij? What could produce a large value of G2? 2010 Christopher R. Bilder 2.62 The Pearson and G2 will often yield the same conclusions, but rarely the exact same statistic values. Each will always have the same large sample (asymptotic) distribution under the null hypothesis. Suppose we assume multinomial sampling (n is fixed). When a test for independence is done, the hypotheses are: Ho: ij=i++j for i=1,…,I and j=1,…,J Ha: Not all equal G2 has ni++j substituted for ij: I J nij 2 G 2 nij log n i1j1 i j Problems: 1) What if nij=0? Often, 0.5 or some other small constant is added to the cell. 2) Notice the parameter values in G2! Thus, this statistic is difficult to calculate. To solve the problem, the corresponding estimators replace the parameters. The expected cell count under independence is estimated by ni n j ni n j ˆ ij npi p j n . n n n The statistic becomes 2010 Christopher R. Bilder 2.63 nij G 2 nij log ˆ ij i1 j1 I J 2 nij 2 nij log n n / n i1 j1 i j I J For large n, this statistic has an approximate (I21)(J1) distribution. Example: Larry Bird (bird.R) From the last example, Second Made Missed Total Made n11=251 n12=34 n1+=285 First Missed n21=48 n22=5 n2+=53 Total n+1=299 n+2=39 n=338 Second Made Missed Made ̂11 252.11 ̂12 32.88 First Missed ̂21 46.88 ̂22 6.11 nij G 2 nij log ˆ ij i1j1 2 2 2 = 2 251log 251 34 48 5 34log 48log 5log 252.11 32.88 46.88 6.11 2010 Christopher R. Bilder 2.64 = 0.2858 The p-value is 0.5930. Thus, there is not sufficient evidence to reject independence. Remember the p-value from using the Pearson statistic was 0.6015. For a small contingency table like this, you may have to do the calculations by hand on a test. Below is how the test can be done a few different ways in R. > library(vcd) Loading required package: MASS Attaching package 'vcd': The following object(s) are masked from package:graphics : barplot.default fourfoldplot mosaicplot The following object(s) are masked from package:base : print.summary.table summary.table > assocstats(n.table) X^2 df P(> X^2) Likelihood Ratio 0.28575 1 0.59296 Pearson 0.27274 1 0.60150 Phi-Coefficient : 0.028 Contingency Coeff.: 0.028 Cramer's V : 0.028 The package, vcd, contains a function assoc.stats() which can calculate the LRT statistic and p-value. This package is not installed by default with R. You 2010 Christopher R. Bilder 2.65 can install the package by selecting PACKAGES > INSTALL PACKAGE(S) FROM CRAN. Select the vcd package from the list and select OK. R may ask if you want to delete the installation files. You can type “Y” for deletion. In order to load the package (make ready for use) in any R session, use the library(vcd) code. This must be done before using any functions within the package. See the Chapter 2 additional notes for how you can program the statistic itself into R. 2010 Christopher R. Bilder 2.66 Large n The 2 (I1)(J1) distributional approximations for X2 and G2 both rely on a “large n” for them to work. Below is a quote from Agresti (1990, p.49) that describes the approximation in more detail: It is not simple to describe the sample size needed for the chi-squared distribution to approximate well the exact distribution of X2 and G2. For a fixed number of cells, X2 usually converges more quickly than G2. The chi-squared approximation is usually poor for G2 when n/IJ<5. When I or J is large, it can be decent for X2 for n/IJ as small as 1, if the table does not contain both very small and moderately large expected frequencies. P. 395-6 of Agresti (2002) contains similar information. Example: Salk vaccine clinical trials (polio.R) Vaccine Placebo Polio 57 142 Polio free 200,688 201,087 Total 200,745 201,229 # Test for independence - Pearson chi-square 2010 Christopher R. Bilder 2.67 > ind.test <- chisq.test(n.table, correct = F) > ind.test Pearson's chi-square test without Yates' continuity correction data: n.table X-square = 36.1201, df = 1, p-value = 0 #critical value > qchisq(p = 0.95, df = 1) [1] 3.841459 > 1 - pchisq(q = ind.test$statistic, df = 1) X-square 1.855266e-009 > ind.test$expected Result Trt polio polio free vaccine 99.3802 200645.6 placebo 99.6198 201129.4 ##################################################### # Test for independence – LRT > library(vcd) > assocstats(n.table) X^2 df P(> X^2) Likelihood Ratio 37.313 1 1.0059e-09 Pearson 36.120 1 1.8553e-09 Phi-Coefficient : 0.009 Contingency Coeff.: 0.009 Cramer's V : 0.009 There is evidence against the independence of the treatment and polio result. 2010 Christopher R. Bilder 2.68 Suppose subjects can pick more than one X and Y response. Below is an example of where this can happen: In this case, farmers can choose more than one type of swine waste storage method and more than one type of source of veterinary information. The previous methods for testing independence assume a subject (farmer here) is represented only once in the table. Therefore, they can not be used. As part of my research, I have derived a few different testing approaches for this. See Bilder and Loughin (Biometrics, 2004) for more information. Residuals Suppose the hypothesis of independence is rejected. The next step would be to determine WHY it was rejected. Summary measures like an OR can help determine what type of dependence exists. Cell residuals can also help determine where independence is a bad “fit”. 2010 Christopher R. Bilder 2.69 Cell deviations: nij- ̂ij - hard to interpret because of the size of the counts 2 (n ) ˆ ij ij Cell 2: - can be “roughly” treated as 12 ˆ ij (nij ˆ ij ) Pearson residual: - this is just the square root ˆ ij of the cell 2; it can be treated “roughly” as a N(0,1); use 2 or 3 as “general” guidelines to help determine what cells are “outlying” or indicate evidence against independence (nij ˆ ij ) Standardized residual: for a test of ˆ ij (1 pi )(1 p j ) independence. Note that the denominator is Var(nij ˆ ij ) . For large n, this can be treated as a approximate N(0,1) random variable. Use 2 or 3 as guidelines to help determine what cells are “outlying” or indicate evidence against independence. Questions: For the Pearson residual, why does it make sense to divide by ̂ij ? The standardized residual will change if a different hypothesis is tested. The Pearson residual and the standardized residual are the equivalent of semistudentized residuals and 2010 Christopher R. Bilder 2.70 studentized residuals typically discussed in a regression analysis course similar to STAT 870. See Section 10.2 of my STAT 870 lecture notes at www.chrisbilder.com/stat870/schedule.htm for more information. Example: Larry Bird (bird.R) From the last example, Second Made Missed Made n11=251 n12=34 First Missed n21=48 n22=5 Total n+1=299 n+2=39 Total n1+=285 n2+=53 n=338 Second Made Missed Made ̂11 252.11 ̂12 32.88 First Missed ̂21 46.88 ̂22 6.11 Pay close attention to how elementwise subtraction and division are being done even though matrices are being used! #General way > mu.hat<-ind.test$expected > cell.dev <- n.table - mu.hat > cell.dev second made second missed first made -1.115385 1.115385 2010 Christopher R. Bilder 2.71 first missed 1.115385 -1.115385 > pearson.res <- cell.dev/sqrt(mu.hat) > pearson.res second made second missed first made -0.07024655 0.1945039 first missed 0.16289564 -0.4510376 > ind.test$residuals #Pearson residuals easier way Second First made missed made -0.07024655 0.1945039 missed 0.16289564 -0.4510376 > stand.res <- matrix(NA, 2, 2) > #find standardized residuals for(i in 1:2) { for(j in 1:2) { stand.res[i, j] <- pearson.res[i,j] / sqrt((1-sum(n.table[i,])/n) * (1-sum(n.table[,j])/n)) } pi+ } p+j > stand.res [,1] [,2] [1,] -0.5222416 0.5222416 [2,] 0.5222416 -0.5222416 #Note that the Pearson residuals can also be found with: > ind.test<-chisq.test(n.table, correct=F) > ind.test$residuals second made second missed first made -0.07024655 0.1945039 first missed 0.16289564 -0.4510376 Notice that none of the residuals are indicating that independence provides a bad fit to the contingency table. Why does this make sense? 2010 Christopher R. Bilder 2.72 Example: Salk vaccine clinical trials (polio.R) Vaccine Placebo Polio 57 142 Polio free 200,688 201,087 Total 200,745 201,229 > n.table polio polio free vaccine 57 200688 placebo 142 201087 > pearson.res<-ind.test$residuals > pearson.res Result Trt polio polio free vaccine -4.251215 0.09461241 placebo 4.246099 -0.09449856 > stand.res <- matrix(data = NA, nrow = 2, ncol = 2) #find standardized residuals > for(i in 1:2) { for(j in 1:2) { stand.res[i, j] <- pearson.res[i, j]/sqrt((1 – sum(n.table[i, ]/n)) * (1 - sum(n.table[, j]/n))) } } pi+ p+j > stand.res [,1] [,2] [1,] -6.009997 6.009997 [2,] 6.009997 -6.009997 Notice that the residuals are indicating all cells contribute to the dependence. Example: #7.13 (birth_control.R) 2010 Christopher R. Bilder 2.73 This example shows what happens when a table larger than 22 is used. Note that it may be difficult to summarize all of the dependence with ORs since the table is 94 in size! Subjects were asked whether methods of birth control should be available to teenagers between the ages of 14 and 16. Notice the ordered categorical variables! Religious attendance Teenage birth control strongly agree agree disagree strongly disagree Never 49 49 19 9 <1 per year 31 27 11 11 1-2 per year 46 55 25 8 several times per year 34 37 19 7 1 per month 21 22 14 16 2-3 per month 26 36 16 16 nearly every week 8 16 15 11 every week several times per week 32 65 57 61 4 17 16 20 Below is the R code and output. n.table<-array(c(49, 31, 46, 34, 21, 26, 8, 32, 4, 49, 27, 55, 37, 22, 36, 16, 65, 17, 19, 11, 25, 19, 14, 16, 15, 57, 16, 9, 11, 8, 7, 16, 16, 11, 61, 20), dim=c(9,4), dimnames=list( Religous.attendance = c("Never", "<1 per year", "1-2 per year", "several times per year", "1 per month", "2-3 per month", 2010 Christopher R. Bilder 2.74 "nearly every week", "every week", "several times per week"), Teenage.birth.control = c("strongly agree", "agree", "disagree", "strongly disagree"))) > n.table Religous.attendance Never <1 per year 1-2 per year several times per 1 per month 2-3 per month nearly every week every week several times per Teenage.birth.control strongly agree agree disagree strongly disagree 49 49 19 9 31 27 11 11 46 55 25 8 year 34 37 19 7 21 22 14 16 26 36 16 16 8 16 15 11 32 65 57 61 week 4 17 16 20 ###################################################### # Test for independence - Pearson > ind.test <- chisq.test(n.table, correct = F) > ind.test Pearson's chi-square test without Yates' continuity correction data: n.table X-square = 106.1941, df = 24, p-value = 0 > mu.hat<-ind.test$expected > mu.hat Teenage.birth.control Religous.attendance strongly agree agree Never 34.15335 44.08639 <1 per year 21.68467 27.99136 1-2 per year 36.32181 46.88553 several times per year 26.29266 33.93952 1 per month 19.78726 25.54212 2-3 per month 25.47948 32.88985 nearly every week 13.55292 17.49460 every week 58.27754 75.22678 several times per week 15.45032 19.94384 disagree strongly disagree 26.12527 21.634989 16.58747 13.736501 27.78402 23.008639 20.11231 16.655508 15.13607 12.534557 19.49028 16.140389 10.36717 8.585313 44.57883 36.916847 11.81857 9.787257 ###################################################### # Test for independence - LRT > #easiest way > library(vcd) 2010 Christopher R. Bilder 2.75 > assocstats(n.table) X^2 df P(> X^2) Likelihood Ratio 112.54 24 2.0284e-13 Pearson 106.19 24 2.5890e-12 Phi-Coefficient : 0.339 Contingency Coeff.: 0.321 Cramer's V : 0.196 ###################################################### # Find residuals > pearson.res<-ind.test$residuals > pearson.res Never <1 per year 1-2 per year several times per year 1 per month 2-3 per month nearly every week every week several times per week strongly agree 2.5404573 2.0004242 1.6058693 1.5030986 0.2726315 0.1031195 -1.5083590 -3.4421839 -2.9130570 agree 0.7400280 -0.1873785 1.1850612 0.5253346 -0.7008651 0.5423137 -0.3573330 -1.1791057 -0.6591897 disagree strongly disagree -1.3940262 -2.71641759 -1.3719091 -0.73834198 -0.5281708 -3.12893004 -0.2480249 -2.36589885 -0.2920103 0.97882315 -0.7905900 -0.03494422 1.4388522 0.82410537 1.8603644 3.96370252 1.2163031 3.26446417 #find standardized residuals > stand.res <- matrix(NA, 9, 4) > for(i in 1:9) { for(j in 1:4) { stand.res[i, j]<-pearson.res[i,j] / sqrt((1-sum(n.table[i,]/n)) * (1 - sum(n.table[, j]/n))) } } > stand.res [,1] [,2] [,3] [,4] [1,] 3.2012973 0.9874517 -1.6845693 -3.21118144 [2,] 2.4512975 -0.2431349 -1.6121413 -0.84876153 [3,] 2.0337928 1.5892453 -0.6414677 -3.71746225 [4,] 1.8606698 0.6886070 -0.2944292 -2.74746527 [5,] 0.3327059 -0.9056755 -0.3417328 1.12058040 [6,] 0.1274202 0.7095804 -0.9368124 -0.04050671 [7,] -1.8164011 -0.4556525 1.6616023 0.93098781 [8,] -4.6010650 -1.6689014 2.3846584 4.97026514 [9,] -3.5220717 -0.8439433 1.4102460 3.70267268 2010 Christopher R. Bilder 2.76 There is strong evidence against independence. The deviation from independence appears to occur in the “corners” of the table. Notice the upper left and lower right have positive values, and the lower left and upper right have negative values. This could be due to the ordinal nature of the categorical variables. Models which take into this into account will be discussed later. The type of dependence here is called “positive” dependence (not “negative” dependence). The upper left and lower right have positive values mean the (1,1), (9,4),… cells are occurring more frequently than expected under independence. Thus, low row and column indices occur together and the high row and high column indices occur together. The lower left and upper right have negative values mean the (9,1), (1,4),… cells are occurring less frequently than expected under independence. If this is hard to understand, think of the positive relationship that typically occurs with high school and college GPAs. See the data set in the R Introduction notes. Partitioning Chi-squared (p.32-3) Read on your own 2010 Christopher R. Bilder 2.77 Comments on Chi-squared tests (p.33-34) Read on your own Note that X2 and G2 do not depend on the order of the rows or columns. Thus, they do not change for any ordering of the rows and columns. These tests assume the categorical variables are nominal. If the categorical variables are ordinal, the tests ignore the ordinal information. 2010 Christopher R. Bilder 2.78 2.5 Testing independence for ordinal data The previous tests for independence assumed each categorical variable was nominal. If at least one of the variables was ordinal, useful information may be ignored by using the previous tests! Generally, tests which incorporate the ordinal information will be more POWERFUL in detecting dependence than tests which do not. What does being more POWERFUL mean??? Linear trend alternative to independence Suppose the row and column categorical variables are ordinal. If either of the categorical variables are nominal with only two categories, the test shown below can also be used. Tests using the ordinal information assign “scores” to the each level of the row and each level of the column categorical variables. Let u1u2…uI denote the scores for the row variable with at least one replaced with a <. Let v1v2…vJ denote the scores for the column variable with at least one replaced with a <. 2010 Christopher R. Bilder 2.79 Example: #7.13 (birth_control.R) Religious attendance Teenage birth control strongly agree agree disagree strongly disagree Never 49 49 19 9 <1 per year 31 27 11 11 1-2 per year 46 55 25 8 several times per year 34 37 19 7 1 per month 21 22 14 16 2-3 per month 26 36 16 16 nearly every week 8 16 15 11 every week several times per week 32 65 57 61 4 17 16 20 Teenage birth control opinions could have scores of v1=1 (strongly agree), v2=2 (agree), v3=3 (disagree), and v4=4 (strongly disagree). Religious attendance could have scores of u1=0 (never), u2=1 (<1 per year),…, u9=8 (several times per week). Changing the levels to yearly could produce the following scores also: u1=0, u2=1, u3=1.5, u4=(3+12)/2 = 7.5, u5=12, u6=2.512=25, u7=(52+25) /2=38.5, u8=52, and u9=522=104. Notice there generally is more than one way of assigning scores! One should try a few different ways to see if inferences are affected. 2010 Christopher R. Bilder 2.80 Suppose each observation is replaced with their (ui, vj) pair. In the last example, there are 49 observation pairs of (u1,v1), …, 20 observation pairs of (u9, v4). Using this “new” data set, the Pearson product-moment correlation (often denoted by r) can be calculated and interpreted in its usual way! Review from STAT 218 for a Pearson correlation: Suppose X and Y are two variables. We observe (x1, y1), …. , (xn, yn) pairs where n is the sample size. The Pearson correlation is calculated as n r (xi x)(yi y) i1 n n 2 (xi x) (yi y) 2 i1 i1 n xi yi nxy i 1 n x nx i 1 2 i 2 n 2 2 yi ny i 1 r is scaleless and 0r1. Since there are a number of observations with the same (ui, vj) pair, we can simplify the formula for the correlation to be 2010 Christopher R. Bilder 2.81 I J u v n u n v n i j ij i i j j i1 j1 i1 j1 I J r 2 2 J I u n v jn j i i I u2n i1 J v 2n j1 j j i i i n n 1 j1 Compare this formula on your own to the formula for the Pearson product-moment correlation. Notes: -1r1 Values close to -1 or 1 indicate strong negative or positive dependence, respectively. Values close to 0 indicate independence or small dependence. To test, Ho:Independence Ha:Linear dependence, use M2=(n-1)r2 as the test statistic. This statistic has an approximate 12 distribution for large n. Notice the null hypothesis is the same as previously used for “test of independence” with X2 and G2. However, the alternative hypothesis is not the same. 2010 Christopher R. Bilder 2.82 This alternative hypothesis specifies the “type” of dependence. Previously, any “type” of dependence was given in the alternative hypothesis. The alternative hypothesis here is a subset of the alternative hypothesis used with X2 and G2. Example: #7.13 (birth_control.R) Ho:Independence Ha:Linear dependence r = 0.31, M2 = (926-1)0.312 = 88.96, p-value<0.0001 There is evidence of positive linear dependence. Notice the pattern of the residuals on p. 2.75. This is indicative of a linear relationship! The “corner” residuals are “large”. When the u and v scores are both small or large, the residuals are positive. When the u and v scores are opposite in their values (i.e. u small, v large or vice versa), the residuals are negative. Below is the R code and output. Notice how the data is put into its “raw” form. > ######################################################### # ordinal measures #scores > u <- 0:8 > #u_c(0, 1, 1.5, 7.5, 12, 25, 38.5, 52, 104) 2010 Christopher R. Bilder 2.83 > v <- 1:4 > all.data <- matrix(data = NA, nrow = 0, ncol = 2) Combine u > #Put data in "raw" form and v scores for(i in 1:9) { for(j in 1:4) { all.data <- rbind(all.data, matrix(data = c(u[i],v[j]), nrow = n.table[i, j], ncol = 2, byrow=T)) } } Number of rows of #find correlation the same u and v Number of columns > r <- cor(all.data) of the same u and v > r [,1] [,2] [1,] 1.0000000 0.3101243 [2,] 0.3101243 1.0000000 > M.sq <- (sum(n.table) - 1) * r[1, 2]^2 > M.sq [1] 88.96382 > 1 - pchisq(M.sq, 1) [1] 0 When the second set of u scores are used, r= 0.3067. Notes: r and M2 do not change for different sets of equal spaced scores. For example, scores of 1,2,3,4 and 0,1,2,3 give the same results. See the example using the data in Table 2.7 of Agresti (2007). The column variable is nominal, but one can still find r since it has only two levels. See Agresti (2007) use of “midranks” to find the scores. 2010 Christopher R. Bilder 2.84 Model based approaches for ordinal data will be discussed later in Chapter 7. Chapter 9 of Agresti (2002) discusses these in detail. What if one of the variables is ordinal and the other variable is nominal (with more than two categories)? One can look at mean scores across the levels of the nominal variable. For example, suppose X is nominal and Y is ordinal. Find the mean scores for Y at each level of X. See Chapter 9 of Agresti (2002) again. 2010 Christopher R. Bilder 2.85 2.6 Exact inference for small samples X2 and G2 for a fixed n do NOT exactly have 2 distributions!!! We use a 2 distribution when n is large because the statistics “approximately” have this distribution. What happens if the sample size is not large??? A good overview of exact inference is: Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science 7, 131-153. Exact inference refers to the “exact” probability distribution of the statistic being used. The ClopperPearson interval is an example of exact inference. Here’s a quote from Agresti (1992) which quotes R. A. Fisher’s Statistical Methods for Research Workers 1st edition (1926) book: … the traditional machinery of statistical processes is wholly unsuited to the needs of practical research. Not only does it take a cannon to shoot a sparrow, but it misses the sparrow! The elaborate mechanism built on the theory of infinitely large samples is not accurate enough for simple laboratory data. Only by systematically tackling 2010 Christopher R. Bilder 2.86 small sample problems on their merits does it seem possible to apply accurate tests to practical data. Small samples here does not just mean a small n. It also means having a mix of small and large cell counts. Hypergeometric distribution Here’s the classic set up for a random variable with a hypergeometric probability distribution: Suppose an urn has n balls with a of them being red and b of them being blue. Thus a+b=n. Suppose kn balls are drawn from the urn without replacement. Let m be the number of red balls drawn out. The random variable m has a hypergeometric distribution with density function of a b m k m for m = 0, 1,…, k P(m) n k subject to m ≤ a and k – m ≤ b. Note that e e! C d e d d!(e d)! = “e choose d”. Also, notice that 2010 Christopher R. Bilder 2.87 a, n, b, and k are FIXED values. The only random variable is m! Example: Let n=10, a=4, b=6, k=3, and m=2 4 6 2 3 2 6 6 3 P(m 2) 120 10 10 3 Example: Urns (tea_taster.R) Suppose there are n=8 balls in an urn with a=4 of them red and b=4 of them blue. Suppose k=4 balls are drawn from the urn. What is the probability of getting m=3 red balls? a b m k m P(3) n k 4 4 3 4 3 4 4 0.2286 8! 8 4 4!4! The entire probability distribution is m P(m) 0 0.0143 1 0.2286 2 0.5143 2010 Christopher R. Bilder 2.88 3 0.2286 4 0.0143 Is it reasonable to observe m 3? R code and output: > #P(3) > dhyper(3, 4, 4, 4) [1] 0.2285714 > #P(0),...,P(4) > dhyper(0:4, 4, 4, 4) [1] 0.01428571 0.22857143 0.51428571 0.22857143 0.01428571 In general, the function is dhyper(m, a, b, k). Fisher’s exact test The hypergeometric distribution can be used with 22 tables to test for independence! Below is a 22 table. m Y 1 X 2 1 n11 n12 n1+ 2 n21 n22 n2+ n+1 n+2 a b n n Suppose n1+, n2+, n+1, n+2, and n are FIXED by the sampling design. This means before the sample is k 2010 Christopher R. Bilder 2.89 taken or the experiment is conducted, these values are KNOWN. Given these known quantities, how many of the 4 cell counts (n11, n12, n21, and n22) are needed before all of the other cell counts are known? Since only one of the four cell counts is needed to know the rest of the table counts, n11 can be treated as the only random variable! If you know n11, you know the rest of the table! Suppose X and Y are independent. The probabilities of observing different n11 values (and thus different 22 tables) can be calculated using the hypergeometric distribution: n1 n2 n n n P(n11) 11 1 11 n n 1 n1 n2 n n 11 21 . n n 1 The probabilities are calculated under the assumption of independence between X and Y. A low probability indicates that a particular n11 is not likely to be observed. Thus, its corresponding 22 table is not likely under independence. Using the hypergeometric distribution in a test for independence with 22 contingency tables is called 2010 Christopher R. Bilder 2.90 Fisher’s exact test. Note that the hypergeometric is the EXACT distribution for n11. Thus, this is where the name exact inference comes from. Tea taster experiment This is a common example discussed often in statistics. See p. 46 of Agresti (2007) for the set up or “The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century” book by David Salsburg. Below is a “possible” outcome of the observed data (the actual data is unknown?). Guess Pour First Milk Tea Poured Milk First Tea 3 1 4 1 3 4 4 4 8 Before the experiment, it was decided to have 4 cups with milk poured first and 4 cups with tea poured first. Thus, the row marginal totals are FIXED. Since the taster was told before the experiment 4 cups had milk poured first and 4 cups had tea poured first, one would think the taster would guess 4 of each type. Thus, the column totals are FIXED. 2010 Christopher R. Bilder 2.91 Questions: How likely is it to have an experiment with both row and column totals fixed? Suppose that the taster really can not tell the difference. What does this mean in terms of the problem? What is the probability that the taster would have guessed correctly three of the milk poured first? Under the assumption that the taster can not tell the difference, the probability can be found with the hypergeometric distribution: P(3)=0.2286. Does guessing 3 or more of the milk poured first correctly seem reasonable under the assumption that the taster can not tell the difference? P(3)+P(4) = 0.2286 + 0.0143 = 0.2429 What is the p-value of Ho:=1 (independence) vs. Ha:>1 (positive dependence)? Why is this test chosen instead of Ha:1 or Ha:<1? Notice the only way to show there is some evidence that the taster can tell the difference is when n11=4. The small sample size here is the reason. R code and output from tea_taster.R: 2010 Christopher R. Bilder 2.92 > n.table<-array(data = c(3, 1, 1, 3), dim = c(2,2), dimnames=list(Actual = c("Pour Milk", "Pour Tea"), Guess = c("Pour Milk", "Pour Tea"))) > n.table Guess Actual Pour Milk Pour Tea Pour Milk 3 1 Pour Tea 1 3 > fisher.test(x = n.table) Fisher's Exact Test for Count Data data: n.table p-value = 0.4857 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.2117329 621.9337505 sample estimates: odds ratio 6.408309 > fisher.test(n.table, alternative = "greater") Fisher's Exact Test for Count Data data: n.table p-value = 0.2429 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 0.3135693 Inf sample estimates: odds ratio 6.408309 The two-tail test p-value is given by fisher.test(). The two-tail test adds all probabilities that are P(n11); i.e., sum the table probabilities that are no more likely than 2010 Christopher R. Bilder 2.93 the observed. In this case, this includes the probabilities of P(0), P(1), P(3), P(4). Larger than 22 tables Fisher’s exact test can be extended to tables larger than 22 by using the multiple hypergeometric distribution. The probability of observing cell counts that are not in row I or column J (see blue cells) is: I J ni ! n j ! i1 j1 I J n! nij ! i1 j1 The marginal totals of the contingency table are again assumed to be fixed. Below is the IJ table shown for review: 2010 Christopher R. Bilder 2.94 1 X 2 Y J-1 J 1 n11 n12 n1,J-1 n1J n1+ 2 n21 n22 n2,J-1 n2J n2+ I-1 nI-1,1 nI-1,2 nI-1,J-1 nI-1,J I nI1 nI2 nI,J-1 nIJ nI+ n+1 n+2 n+,J-1 n+J n For 22 tables, the multiple hypergeometric simplifies to the hypergeometric. Example: Table 2.10 of Agresti (1996, p. 45) (tab2.10.R) n.table <- array(data = c(0, 1, 0, 7, 1, 8, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0), dim = c(3, 9)) > n.table [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 0 7 0 0 0 0 0 1 1 [2,] 1 1 1 1 1 1 1 0 0 [3,] 0 8 0 0 0 0 0 0 0 > fisher.test(n.table) Fisher's Exact Test for Count Data data: n.table p-value = 0.001505 alternative hypothesis: two.sided 2010 Christopher R. Bilder 2.95 > x.sq<-chisq.test(n.table, correct=F) Warning message: Chi-squared approximation may be incorrect in: chisq.test(n.table, correct = F) > x.sq Pearson's Chi-squared test data: n.table X-squared = 22.2857, df = 16, p-value = 0.1342 Notice the difference in the p-values between the two tests. Permutation tests Introduction to Modern Nonparametric Statistics by James J. Higgins (2003) is a very good reference on these types of tests. Similar to Fisher’s exact test, it would be nice if we could write out the exact probability distribution for statistics like X2 or G2 and use these distributions to judge how likely it is to observe the test statistic value under a null hypothesis. In the tea tasting experiment, there are 5 unique 22 tables under independence that produce the following probabilities: 2010 Christopher R. Bilder 2.96 n11 Table P(n11) X2 0 0 4 4 0 0.0143 8 1 1 3 3 1 0.2286 2 2 2 2 2 2 0.5143 0 3 3 1 1 3 0.2286 2 4 4 0 0 4 0.0143 8 Notice that the X2 is the same for some tables. Taking this into account, the exact probability distribution of X2 can be written as X2 0 2 8 P(X2) CDF 0.5143 0.5143 0.4571 0.9715 0.0286 1.0000 2010 Christopher R. Bilder 2.97 The CDF column represents the “cumulative distribution function.” Remember that with a Pearson chi-square test for independence, we would use a 12 distribution to approximate this discrete distribution. Below are some tables and plots showing how poor this approximation is: X2 0 2 8 2 P(X2) CDF CDF for 1 0.5143 0.5143 0.0000 0.4571 0.9714 0.8427 0.0286 1.0000 0.9953 2 1 0.2 0.4 0.6 Exact 0.0 Cumulative probability 0.8 1.0 CDFs 0 2 4 6 X 2 2010 Christopher R. Bilder 8 2.98 My perm_test_motivate.R program does the new calculations. A more general way to see this same exact distribution representation is to consider all possible “permutations” of the row and column numbers. For example, we observed the table: Guess Pour First Milk Tea Poured Milk First Tea 3 1 4 1 3 4 4 4 8 There are 8 distinct observations the lady needs to make. We could label these as z1, z2, …, z8. Suppose we observed the following: Row 1 1 1 1 2 2 2 2 Column z1 = 1 z2 = 1 z3 = 1 z4 = 2 z5 = 1 z6 = 2 z7 = 2 z8 = 2 which produces the table above and X2 = 2. Under independence, these column numbers could have 2010 Christopher R. Bilder 2.99 appeared with any of the row numbers. For example, we could have had Row 1 1 1 1 2 2 2 2 Column z2 = 1 z1 = 1 z3 = 1 z4 = 2 z5 = 1 z6 = 2 z7 = 2 z8 = 2 resulting in the same 22 table, so that X2 = 2 again. Also, we could have had Row 1 1 1 1 2 2 2 2 Column z1 = 1 z2 = 1 z7 = 2 z4 = 2 z5 = 1 z6 = 2 z3 = 1 z8 = 2 resulting in a contingency table with all 2’s in the cells and X2 = 0. These last two examples are “permutations” of the data, and there are 8! = 40,320 permutations in total. Because of the independence assumption, each of these are equally likely to occur – i.e., 1/40,320. If we 2010 Christopher R. Bilder 2.100 found all possible permutations, we could form a table as follows: X2 # of permutations Proportion 0.5143 0 20,736 0.4571 2 18,432 0.0286 8 1,152 which is the same exact distribution that we saw before! In fact, one could have found this with > dhyper(0:4, 4, 4, 4)*factorial(8) [1] 576 9216 20736 9216 576 in R. In order to calculate a p-value, we can use this exact distribution. With X2 = 2 observed, the p-value is P(A 2) = 0.4571 + 0.0286 = 0.4857 where A is a random variable with this exact distribution (in a more mathematical statistics setting, one would write x2 = 2 is observed and the p-value is P(X2 2)) Frequently, the number of permutations is going to be so large that we can not calculate every permutation. Instead, we will randomly select a large number, say B, and calculate the estimate of the exact distribution from those. This estimate is often referred to as the “permutation distribution.” Using this distribution to do a hypothesis test is referred to as a “permutation test.” 2010 Christopher R. Bilder 2.101 Below is a description of a general way to find the permutation distribution. 1)Randomly permute the column numbers. Put these back into a data set with the row numbers. 2 2 2)Calculate X . Denote this statistic by X to avoid confusion with the observed X2. 3)Repeat 1) and 2) B times where B is a large number (1,000 or more). 2 4)Plot a histogram of the X 's . This serves as a visual estimate of the exact distribution of X2. To calculate our p-value, we can obtain an initial impression if it will be small or large by seeing where X2 falls on it. To calculate it formally, we can use step 5. 1 # of X 2 X 2 . Small p-values B indicate the observed X2 would be unusual to obtain if independence was true. 5)The p-value is How can we do all of this in R? First, we will need to put the data into its “raw form” (this is my own term), so that every cell in the contingency table is represented by row and column numbers like on p. 2.98. We can use then the sample() function to find each permutation. The example next shows the whole process. 2010 Christopher R. Bilder 2.102 Example: Table 2.10 of Agresti (1996) (tab2.10-v2.R) > n.table<-array(data 7, 0, 0, 0, 0, 0, 1, 1, = c(0, 1, 0, 1, 8, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0), dim=c(3,9)) > x.sq<-chisq.test(n.table, correct=F) Warning message: Chi-squared approximation may be incorrect in: chisq.test(n.table, correct = F) > x.sq Pearson's Chi-squared test data: n.table X-squared = 22.2857, df = 16, p-value = 0.1342 Note that X2 = 22.29. >########################################################## > #Put data into raw form > all.data<-matrix(data = NA, nrow = 0, ncol = 2) > > #Put data in "raw" form > for (i in 1:nrow(n.table)) { for (j in 1:ncol(n.table)) { all.data<-rbind(all.data, matrix(data = c(i, j), nrow = n.table[i,j], ncol = 2, byrow=T)) } } There were 16 warnings (use warnings() to see them) > #Note that warning messages will be generated since n.table[i,j]=0 sometimes > 2010 Christopher R. Bilder 2.103 > all.data [,1] [,2] [1,] 1 2 [2,] 1 2 [3,] 1 2 [4,] 1 2 [5,] 1 2 [6,] 1 2 [7,] 1 2 [8,] 1 8 [9,] 1 9 [10,] 2 1 [11,] 2 2 [12,] 2 3 [13,] 2 4 [14,] 2 5 [15,] 2 6 [16,] 2 7 [17,] 3 2 [18,] 3 2 [19,] 3 2 [20,] 3 2 [21,] 3 2 [22,] 3 2 [23,] 3 2 [24,] 3 2 > save<-xtabs(~all.data[,1]+ all.data[,2]) > save all.data[, 2] all.data[, 1] 1 2 3 4 5 6 7 8 9 1 0 7 0 0 0 0 0 1 1 2 1 1 1 1 1 1 1 0 0 3 0 8 0 0 0 0 0 0 0 > rowSums(save) 1 2 3 9 7 8 > colSums(save) 1 2 3 4 5 6 1 16 1 1 1 1 7 1 8 1 9 1 2010 Christopher R. Bilder 2.104 This matches with the original contingency table so the raw data part worked. Note what the row and column marginal totals are! Below is a further explanation of the code used to put the data into raw form: The "c(i,j)" creates a vector containing the row (i) and column (j) index for the raw data format. The "matrix( ... )" part tells R to create a matrix with contents of "c(i,j)" and a number of rows of "n.table[i,j]", number of columns of "2", and do this by row (meaning, c(i,j) will be a 1x2 vector). Since c(i,j) is only one vector, R duplicates as many times as it is told to do by specifying "n.table[i,j]" as the number of rows (R calls this recycling). The "rbind( ... )" tells R to combine everything in "all.data" and "matrix( ... )" by row. Thus, everything that was in "all.data" comes first and the "matrix( ... )" is put below it. This is done for all rows and columns of the data through using the two for loops. > > > > ######################################################## #Do one permutation to illustrate – i.e., find one X^2* set.seed(4088) all.data.star<-cbind(all.data[,1], sample(all.data[,2], replace=F)) > all.data.star [,1] [,2] [1,] 1 2 [2,] 1 2 [3,] 1 9 2010 Christopher R. Bilder 2.105 [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] [21,] [22,] [23,] [24,] 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 2 2 4 2 2 8 2 2 2 2 2 6 2 1 3 2 2 5 7 > calc.stat<-chisq.test(all.data.star[,1], all.data.star[,2], correct=F) Warning message: Chi-squared approximation may be incorrect in: chisq.test(all.data.star[, 1], all.data.star[, 2], correct = F) > calc.stat$statistic X-squared 17.33036 > save.star<-xtabs(~all.data.star[,1] + all.data.star[,2]) > save.star all.data.star[, 2] all.data.star[, 1] 1 2 3 4 5 6 7 8 9 1 0 7 0 1 0 0 0 0 1 2 0 6 0 0 0 0 0 1 0 3 1 3 1 0 1 1 1 0 0 > rowSums(save.star) 1 2 3 2010 Christopher R. Bilder 2.106 9 7 8 > colSums(save.star) 1 2 3 4 5 6 7 1 16 1 1 1 1 1 > 8 1 9 1 Notes: To illustrate one possible permutation of the data, the all.data.star data set is found. Notice how the column numbers are permuted using the sample() function. The row numbers are held fixed. The row and column numbers are then put back together to form a matrix. The xtabs(), rowSums(), and colSums() functions’ output show the marginal totals are still the same as 2 with the observed data. The X statistic is 17.33 for this permutation. What is the probability this one permutation would occur? Suppose I did a different permutation. Let set.seed(4089). For this seed, X2 =16.46: > save.star all.data.star[, 2] all.data.star[, 1] 1 2 3 4 5 6 7 8 9 1 0 6 0 0 1 1 1 0 0 2 1 4 1 0 0 0 0 0 1 3 0 6 0 1 0 0 0 1 0 > rowSums(save.star) 1 2 3 9 7 8 > colSums(save.star) 1 2 3 4 5 6 7 1 16 1 1 1 1 1 8 1 9 1 2010 Christopher R. Bilder 2.107 Now, I would like to repeat this process B=1,000 times 2 2 to get 1,000 different X . These X ’s will then represent my permutation distribution. > ######################################################### > # A simple function and for loop to find the permutation distribution. > do.it<-function(data.set){ all.data.star<-cbind(data.set[,1], sample(data.set[,2], replace=F)) chisq.test(all.data.star[,1], all.data.star[,2], correct=F)$statistic } > summarize<-function(result.set, statistic, df, B) { par(mfrow = c(1,2)) #Histogram hist(x = result.set, main = expression(paste("Histogram of ", X^2, " perm. dist.")), col = "blue", freq = FALSE) curve(expr = dchisq(x = x, df = df), col = "red", add = TRUE) segments(x0 = statistic, y0 = -10, x1 = statistic, y1 = 10) #QQ-Plot chi.quant<-qchisq(p = seq(from = 1/(B+1), to = 1/(B+1), by = 1/(B+1)), df = plot(x = sort(result.set), y = chi.quant, main expression(paste("QQ-Plot of ", X^2, " perm. abline(a = 0, b = 1) par(mfrow = c(1,1)) #p-value mean(result.set>=statistic) } 2010 Christopher R. Bilder 1df) = dist."))) 2.108 > #Example use of do.it function > do.it(data.set = all.data) X-squared 16.14286 Warning message: Chi-squared approximation may be incorrect in: chisq.test(all.data.star[, 1], all.data.star[, 2], correct = F) > B<-1000 > results<-matrix(data = NA, nrow = B, ncol = 1) > set.seed(5333) > for(i in 1:B) { results[i,1]<-do.it(all.data) } There were 50 or more warnings (use warnings() to see the first 50) > summarize(results, x.sq$statistic, (nrow(n.table)1)*(ncol(n.table)-1), B) [1] 0.003 2 2 QQ-Plot of X perm. dist. 30 25 10 15 20 chi.quant 0.4 0.3 0.2 5 0.1 0.0 Density 0.5 35 0.6 40 Histogram of X perm. dist. 16 17 18 19 20 21 22 16 17 18 19 result.set 20 21 22 sort(result.set) 2010 Christopher R. Bilder 2.109 Notes: do.it() is a user written function! I have put the 2 sampling part and the calculation of X inside of it. Notice the syntax used with the function. Also, notice the example where I used the function once with the all.data data set. AND, notice the last line of function 2 gives the X value. For all functions written in R, the last line defines what is returned as a result of the function. Notice here the X2 value was printed without me asking for it to be printed! The for loop is used to repeat using the do.it() function B=1,000 times. The results are then stored in a matrix called results. The warning messages are just with regards to the 2 approximation to the distribution of each X2 is probably not appropriate. The set.seed() function is used before the for loop so that the results produced here can be reproduced by others. Notice that it only needs to be set once before the loop. summarize() is another user written function to help summarize the results in a histogram, QQ-plot, and to find the p-value. Notice again the last line finds the pvalue and this is returned as the result of the function. 2 Remember the 16 distribution is used with X2 for a “regular” Pearson chi-square test for independence. 2 The QQ-plot plots the quantiles of a 16 distribution 2 versus X values. If the values fall on a straight line 2010 Christopher R. Bilder 2.110 2 at 45 from the origin, the X values would all be 2 equal to the quantiles of a 16 distribution. Thus, the 2 distribution for X2 could be approximated by a 16 distribution. As you can see, this does NOT happen here! See qq_plot_chi.square.R for an example where a simulated sample from a chi-square distribution is used. There is strong evidence against independence since the p-value is 0.003. Agresti (1996) found a p-value of 0.001. Below are the actual values of X2 obtained. > table(round(results,2)) 15.66 67 15.8 15.89 100 30 15.9 16.14 16.19 16.33 129 61 6 12 16.4 16.46 16.71 16.74 16.75 16.83 40 123 60 47 7 2 16.9 17.19 17.33 17.71 17.83 17.89 18.02 18.05 18.24 18.48 18.66 19.23 19.69 99 7 12 9 31 53 1 32 11 9 16 2 8 19.9 20 20.08 20.14 20.46 21.02 21.19 22.29 22.31 2 4 7 5 1 1 3 1 2 Again, one can think of the permutation test as a way to obtain an estimate of the probability distribution function of the discrete random variable, X2, under Ho. Based upon the above information, we obtain column 2 in the table below. Permutation dist. P(X2 15.66) 67/1000 = 0.067 Chi-square dist. 0.5231 P(X2 15.80) (67+100)/1000 = 0.167 0.5330 2010 Christopher R. Bilder 2.111 P(X2 15.89) 0.197 0.5393 P(X2 21.19) 0.997 0.8287 P(X2 22.29) 0.998 0.8659 P(X2 22.31) 1 0.8665 The permutation distribution then replaces the chisquare distribution approximation for X2. Below is P(X2 ≤ 2 ___ ) using a 16 approximation. > round(pchisq(q = as.numeric(names(table( round(results,2)))), df = (nrow(n.table)1)*(ncol(n.table)-1)),4) [1] [11] [21] [31] 0.5231 0.5974 0.6790 0.7998 0.5330 0.5980 0.6900 0.8223 0.5393 0.6033 0.7035 0.8287 0.5400 0.6079 0.7133 0.8659 0.5568 0.5602 0.5698 0.5746 0.5787 0.5954 0.6266 0.6354 0.6588 0.6661 0.6696 0.6773 0.7431 0.7655 0.7752 0.7798 0.7834 0.7860 0.8665 A plot of the cumulative distribution functions is shown below (see program for code) 2010 Christopher R. Bilder 2.112 0.2 0.4 CDF 0.6 0.8 1.0 Compare CDFs 2 16 0.0 Exact 16 18 20 X 22 2 Here’s a simpler way to get the p-value: > set.seed(7709) > chisq.test(n.table, correct = FALSE, simulate.p.value = TRUE, B = 1000) Pearson’s Chi-squared test with simulated p-value (based on 1000 replicates) data: n.table X-squared = 22.2857, df = NA, p-value = 0.001 2010 Christopher R. Bilder 2.113 Why did I show the harder way first? It will help you understand what the chisq.test() function is actually doing. You can not summarize the results from chisq.test() with a histogram or QQ-plot. A permutation test is a very general approach for inference. It can be used in many other settings which are not already programmed into a function like chisq.test()! A simple example is suppose you would like to use G2 for the test of independence. Permutation tests are closely related to bootstrap hypothesis tests. See the additional Chapter 2 notes for how one can use functions in the boot package to do permutation tests. Example: Larry Bird (bird_perm.R) > #Create contingency table - notice the data is entered by columns > n.table<-array(c(251, 48, 34, 5), dim=c(2,2), dimnames=list(First = c("made", "missed"), Second = c("made", "missed"))) > n.table Second First made missed made 251 34 missed 48 5 > x.sq<-chisq.test(n.table, correct=F) > x.sq 2010 Christopher R. Bilder 2.114 Pearson's Chi-squared test data: n.table X-squared = 0.2727, df = 1, p-value = 0.6015 > ######################################################### > #Find raw data > all.data<-matrix(data = NA, nrow = 0, ncol = 2) > #Put data in "raw" form > for (i in 1:nrow(n.table)) { for (j in 1:ncol(n.table)) { all.data<-rbind(all.data, matrix(data = c(i,j), nrow = n.table[i,j], ncol = 2, byrow=T)) } } > #Check > xtabs(~all.data[,1]+ all.data[,2]) all.data[, 2] all.data[, 1] 1 2 1 251 34 2 48 5 Here’s how the test can be done using the methods demonstrated in the last example. When you do it yourself, you should only use one of these methods unless instructed to do otherwise. Code for method #1: The same do.it() and summarize() function is used here so only partial results are given: > summarize(result.set = results, statistic = x.sq$statistic, df = (nrow(n.table)-1)*(ncol(n.table)1), B = B) 2010 Christopher R. Bilder 2.115 [1] 0.624 2 2 QQ-Plot of X perm. dist. 0 0.0 2 0.2 4 6 chi.quant 0.4 Density 8 0.6 10 Histogram of X perm. dist. 0 2 4 6 8 10 0 2 result.set 4 6 8 10 sort(result.set) > #Shows the different X^2* values > table(round(results,2)) 0 190 10.39 1 0.17 186 0.27 179 0.78 101 0.98 110 1.82 76 2.13 66 3.31 45 3.71 22 5.23 13 5.74 5 7.59 4 8.2 2 > #chi-square app. > round(pchisq(q = as.numeric(names(table(round(results,2)))), df = (nrow(n.table)-1)*(ncol(n.table)-1)),4) [1] 0.0000 0.3199 0.3967 0.6229 0.6778 0.8227 0.8556 0.9311 0.9459 0.9778 [11] 0.9834 0.9941 0.9958 0.9987 Code and output for method #2: #Method #2 > set.seed(8912) > chisq.test(n.table, correct = FALSE, simulate.p.value = TRUE, B = 1000) Pearson's Chi-squared test with simulated p-value (based on 1000 replicates) 2010 Christopher R. Bilder 2.116 data: n.table X-squared = 0.2727, df = NA, p-value = 0.659 Notes: The p-value is 0.624 for method #1 and 0.6590 for method #2 indicating there is not sufficient evidence against independence. The Pearson statistic test for independence had a pvalue of 0.6015. The reason for the general agreement between this test and the permutation test is the sample size is large enough for the “asymptotic” distribution used (chi-square) to work as the approximate distribution for the X2. See the QQ-plot. Notice the “discreteness” of the permutation distribution. Why do you think this is happening? Below is a plot comparing the cumulative distribution functions (see program for code): 2010 Christopher R. Bilder 2.117 0.2 0.4 CDF 0.6 0.8 1.0 Compare CDFs 2 1 0.0 Exact 0 2 4 6 X 8 10 2 If you are interested in using exact inference for other problems outside of categorical data analysis, there is a nice software package which helps to automate these tests even more than in R. The software is made by the Cytel Corporation and is called StatXact. Also, PROC FREQ in SAS has an EXACT option that will do the test. 2010 Christopher R. Bilder 2.118 2.7 Association in three-way tables More than two categorical variables may be of interest. In this setting, one can construct contingency tables summarizing the counts of these additional variables. Tests for independence between all of the variables or between some of them conditional on the other variables can be constructed. However, it is often more beneficial to look at these types of settings from a modeling point of view. Therefore, the discussion of these settings will mostly be postponed until we get to models that can handle them. What is next is an introduction to what a contingency table would look like for three categorical variables and some important things to look out for in this setting (e.g., Simpson’s paradox). In addition to the categorical variables, X and Y, suppose there is a third categorical variable, Z, with k=1,…,K levels. Let nijk enote a cell count for the ith row, jth column, and kth layer of a “three-way” contingency table. If X has I=2 levels and Y has J=2 levels, then the following is the contingency table for the counts: Z=1 Y 1 X Z=2 2 1 n111 n121 n1+1 Y 1 X Z=K 2 1 n112 n122 n1+2 Y 1 X 2 1 n11K n12K n1+K 2 n211 n221 n2+1 2 n212 n222 n2+2 2 n21K n22K n2+K n+11 n+21 n++1 n+12 n+22 n++2 n+1K n+2K n++K 2010 Christopher R. Bilder 2.119 Notes: A third subscript is added to the n’s to denote the Z variable. There are other ways to display a “three-way” contingency table. See Table 2.10 of Agresti (2007) for an example. This table could easily be extended to a IJK table. The table could have also be written in terms of P(X = i, Y = j, Z = k) = ijk or pijk=nijk/n as well. ijk = E(nijk); i.e., the expected frequency for the ith row, jth column, and kth layer. I J K Properties such as ijk 1 are extended to the i1j1k 1 three-way table. Z as the control variable Z often plays the role of a “control” variable. In this case, the purpose is still to understand the relationship between X and Y while controlling for Z. In addition to Z being called a “layer” variable, Z is often called a “stratification” variable. Think of this as the categorical equivalent of an analysis for a randomized complete block design. The levels of X are the treatments, Y is the response, and the levels of Z are the blocks. 2010 Christopher R. Bilder 2.120 Example: Salk vaccine clinical trials We had the following contingency table set up previously for this example. Polio Polio free Vaccine Placebo X is the drug (vaccine, placebo) and Y is the polio result (polio, polio free). Z could denote the clinical trial centers where the clinical trial takes place. Thus, we could have the following table: Polio Omaha Polio free N.Y. Vaccine Vaccine Placebo Placebo Polio Polio free L.A. Polio Polio free Vaccine Placebo The table above is called a three-way table since three variables are represented in a contingency table format. Odds ratios can also be found for a particular level of Z. Since there are three categorical variables, the variables of interest are put in the subscript with the level of the conditioning variable. For 22K tables, 2010 Christopher R. Bilder 2.121 11k 22k 12k 21k n11k n22k ˆ ˆ XY|k XY(k) n12k n21k XY|k XY(k) One could also define P(X=i, Y=j, Z=k) / P(Z=k) = P(X=i, Y=j | Z=k) = ij|k and set up the odds ratios as XY|k 11|k 22|k 12|k 21|k Conditional and marginal associations In the Salk vaccine clinical trial example, each individual 22 table that relates drug to polio result for a specific clinical trial center is called a “partial table”. This is because each table represents “part” of the 22K table. The 22 table examined in Chapter 2 (before clinical center was known) is called a “marginal table” since it ignores clinical trial center. Remember how the word “margins” was used earlier to denote summing over a categorical variable. The partial table associations (relationships) between X and Y are also called “conditional associations” since they are dependent on the level of Z. An example of a 2010 Christopher R. Bilder 2.122 conditional association measure is XY|k. The marginal table associations between X and Y can be called “marginal associations”. An example of a marginal association is calculating the odds ratio in the 22 marginal table for the Salk vaccine clinical trial example: n n XY 11 22 and ˆ XY 11 22 12 21 n12 n21 It is important to distinguish between the two types of association. The marginal association can be VERY different from the conditional associations! “Simpson’s paradox” occurs when this happens. Example: Simpson’s paradox example This example comes from Appleton et al. (American Statistician, 1996, p. 340-341). There were 1,314 women in the UK who participated in a survey in 1972-4 and then followed up on twenty years later. Information about their age (in 1972-4), smoking status, and survival status was recorded. Below is a marginal table summarizing the survival and smoking status. 2010 Christopher R. Bilder 2.123 Survival status Dead Alive Yes 139 443 Smoker No 230 502 The estimated OR, ̂XY , is 0.68 and a 95% confidence interval for the population OR is (0.54, 0.88). Therefore, the odds of being dead are between 0.54 and 0.88 times larger for smokers and than non-smokers with 95% confidence. Alternatively, the odds of survival are between 1.14 and 1.87 times larger for smokers than non-smokers with 95% confidence. Given this information, which would you prefer to be a smoker or non-smoker? Now, let’s take age into account. Age = 18-24 Survival status Dead Alive Yes 2 53 Smoker No 1 61 OR: 2.30 Age = 25-34 Survival status Dead Alive Yes 3 121 Smoker No 5 152 OR: 0.75 Age = 35-44 Survival status Dead Alive Yes 14 95 Smoker No 7 114 OR: 2.40 Age = 45-54 Survival status Age = 55-64 Survival status Age = 65-74 Survival status Dead Alive Dead Alive Dead Alive Yes 27 103 Yes 51 64 Yes 29 7 Smoker Smoker Smoker No 12 66 No 40 81 No 101 28 OR: 1.44 OR: 1.61 OR: 1.15 2010 Christopher R. Bilder 2.124 Age = 75+ Survival status Dead Alive Yes 13 0 Smoker No 64 0 OR: 0.21 Notice that most of these odds ratios are greater than 1 indicating the estimated odds of dying are larger for those who smoke than those who do not smoke. For example, ˆ XY(1824) 2.3 . This is a contradiction of the results from the conditional associations! The most important item to get out this example is to make sure you account for additional variables because you could make incorrect conclusions. Read Agresti’s (2007) death penalty example for another illustration of Simpson’s paradox. Conditional independence X is independent of Y at EACH level of Z; i.e., independence in each partial table. More formally, conditional independence can be written as 2010 Christopher R. Bilder 2.125 XY(1) = XY(2) = … = XY(K) = 1 for a 22K table or ij|k = i+|k+j|k for each i=1,…,I, j=1,…,J and k=1,…,K What is i+|k? i+|k = jij|k Marginal independence: XY=1 See Agresti’s (2007) example for another reason why to not look at the marginal table. There are cases where the marginal and conditional associations are the same. These are discussed in Chapter 7 with respect to loglinear models. Homogeneous X-Y association X-Y have the same levels of association across all levels of Z. For a 22K table, this means the partial ORs are the same but not necessarily equal to 1: XY(1) = XY(2) = … = XY(K). This will be important when discussing the Cochran-Mantel-Haenszel test in Chapter 4. 2010 Christopher R. Bilder