Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1.1 1. Introduction 1.1 Categorical response data What is a categorical (qualitative) variable? Field goal result – success or failure Patient survival – yes or no Criminal offense convictions – murder, robbery, assault, … Highest attained education level – HS, BS, MS, PhD Food for breakfast – cereal, bagel, eggs,… Annual income - <15,000, 15,000-<25,000, 25,000-<40,000, 40,000 Religious affiliation We live in a categorical world! Types of categorical variables 1) Ordinal – categories have an ordering Education, income There are ways to take advantage of the ordering. Agresti is currently updating his Analysis of Ordinal Categorical Data book on the subject! 2) Nominal – categories do not have an ordering 2010 Christopher R. Bilder 1.2 Food for breakfast, criminal offense convictions, religious affiliation Organization of Agresti (2007) Chapters 1-2: “Standard methods” of analysis mostly developed prior to 1960 Chapters 3-8: Modeling approaches such as logistic regression and loglinear models Chapters 9-10: More advanced applications – modeling correlated observations and incorporating random effects Chapter 11: Historical overview – See the “feud” between Pearson and Yule. For a more in depth discussion of categorical data analysis, see Agresti’s (2002) book entitled Categorical Data Analysis. 2010 Christopher R. Bilder 1.3 1.2 Probability distributions for categorical data Binomial distribution Binomial distribution: P(Y=y) = n! y (1 )ny y!(n y)! for y=0,1,…,n Notes: n n! = n choose y y!(n y)! y Y is a random variable denoting the number of “successes” out of n trials y denotes the observed value of Y Y has a fixed number of possibilities – 0, 1, …, n n is a fixed constant is a parameter denoting the probability of a “success”. It can take on values between 0 and 1. Example: Field goal kicking Suppose a field goal kicker attempts 5 field goals during a game and each field goal has the same probability of being successful (the kick is made). Also, assume each field goal is attempted under similar conditions; i.e., distance, weather, surface,…. 2010 Christopher R. Bilder 1.4 Below are the characteristics that must be satisfied in order for the binomial distribution to be used. 1) There are n identical trials. n=5 field goals attempted under the exact same conditions 2) Two possible outcomes of a trial. These are typically referred to as a success or failure. Each field goal can be made (success) or missed (failure) 3) The trials are independent of each other. The result of one field goal does not affect the result of another field goal. 4) The probability of success, denoted by , remains constant for each trial. The probability of a failure is 1-. Suppose the probability a field goal is good is 0.6; i.e., P(success) = = 0.6. 5) The random variable, Y, represents the number of successes. 2010 Christopher R. Bilder 1.5 Let Y=number of field goals that are good. Thus, Y can be 0,1,2,3, 4, or 5. Since these 5 items are satisfied, the binomial probability distribution can be used and Y is called a binomial random variable. Bernoulli distribution: P(Y=y) = y (1 )1 y for y = 0 or 1 This is a special case of the binomial with n = 1. Suppose Y1, Y2,…, Yn are independent Bernoulli random n variables with probability of success . Then X Yi is i1 a Binomial random variable with n trials and as the probability of success for each trial. Mean and variance for Binomial random variable E(Y) = yP(Y y) y 0 n! y (1 )n y y 0 y!(n y)! n (n 1)! y 1 = n y (1 )n y y 0 y!(n y)! n = y 2010 Christopher R. Bilder 1.6 (n 1)! y 1 (1 )n y since y=0 does not y 1 y!(n y)! contribute to the sum n (n 1)! = n y 1(1 )n y y 1 (y 1)!(n y)! n1 (n 1)! = n x (1 )nx 1 where x=y-1 x 0 x!(n x 1)! = n 1 since a binomial distribution with n-1 trials is inside the sum. = n n = n y Var(Y) = n(1-) Example: Field goal kicking (binomial.R) Suppose =0.6, n=5. What is the probability distribution? P(Y=0) = 5! 0.60 (1 0.6)50 0.45 0.0102 0!(5 0)! For Y=0,…,5: Y 0 1 2 3 P(Y=y) 0.0102 0.0768 0.2304 0.3456 2010 Christopher R. Bilder 1.7 Y P(Y=y) 4 0.2592 5 0.0778 E(Y) = n = 50.6 = 3 and Var(Y) = n(1-) = 50.6(1-0.6) = 1.2 R code and output: > #N=5, pi=0.6 > pdf<-dbinom(x = 0:5, size = 5, prob = 0.6) > pdf [1] 0.01024 0.07680 0.23040 0.34560 0.25920 0.07776 > pdf <- dbinom(0:5, 5, 0.6) > pdf [1] 0.01024 0.07680 0.23040 0.34560 0.25920 0.07776 #Make the printing look a little nicer > save <- data.frame(Y = 0:5, prob = round(pdf, 4)) > save Y prob 1 0 0.0102 2 1 0.0768 3 2 0.2304 4 3 0.3456 5 4 0.2592 6 5 0.0778 > plot(x = save$Y, y = save$prob, type = "h", xlab = "Y", ylab = "P(Y=y)", main = "Plot of a binomial distribution for n=5, pi=0.6", panel.first = grid(col="gray", lty="dotted"), lwd = 2) > abline(h = 0) 2010 Christopher R. Bilder 1.8 0.20 0.00 0.05 0.10 0.15 P(Y=y) 0.25 0.30 0.35 Plot of a binomial distribution for n=5, pi=0.6 0 1 2 3 4 5 Y Example: Simulating observations from a binomial distribution (binomial.R) The purpose of this example is to show how one can “simulate” observing a random sample of observations from a population characterized by a binomial distribution. Why would someone want to do this? Use the rbinom() function in R. 2010 Christopher R. Bilder 1.9 > #Generate observations from a Binomial distribution > set.seed(4848) > bin5<-rbinom(n = 1000, size = 5, prob = 0.6) > bin5[1:20] [1] 3 2 4 1 3 1 3 3 3 4 3 3 3 2 3 1 2 2 5 2 > mean(bin5) [1] 2.991 > var(bin5) [1] 1.236155 > hist(x = bin5, main = "Binomial with n=5, pi=0.6, 1000 bin. observations", col = "blue") 200 0 100 Frequency 300 Binomial with n=5, pi=0.6, 1000 bin. observations 0 1 2 3 4 5 bin5 Notes: The “0” is a little off. Here’s how to fix this problem: > save.count<-table(bin5) > save.count bin5 0 1 2 3 4 5 2010 Christopher R. Bilder 1.10 12 84 215 362 244 83 > barplot(height = save.count, names = c("0", "1", "2", "3", "4", "5"), main = "Binomial with N=5, pi=0.6, 1000 bin. observations", xlab = "x") 0 50 100 150 200 250 300 350 Binomial with n=5, pi=0.6, 1000 bin. observations 0 1 2 3 4 5 x The table() function counts the observations. The shape of the histogram looks similar to the shape of the actual binomial distribution. The mean and variance are close to what we expect them to be! 2010 Christopher R. Bilder 1.11 Multinomial distribution Instead of just two possible outcomes, suppose there are two or more. When this happens, a multinomial distribution is appropriate. The binomial distribution is a special case when there are only two possible outcomes. Set up: n trials k different outcomes or categories ni denotes the number of observed responses for category i (count for the ith category) for i=1,…,k i denote the probability of observing ith category – these are parameters The multinomial distribution is: P(n1,n2 ,...,nk ) n! 1n1 n22 (n1 !)(n2 !) (nk !) nkk Note that n = n1+n2+…+nk. Example: Highest college degree received Outcomes: None, Associate, Bachelors, Masters, Doctorate 2010 Christopher R. Bilder 1.12 Suppose a random sample of size n is taken from a population of interest. n1 denotes number of people out of n with no college degree n2 denotes number of people out of n with associate degree as their highest Example: Sample from a multinomial distribution (multinomial_gen.R) Suppose k=5, 1=0.25, 2=0.35, 3=0.2, 4=0.1, and 5=0.1. > set.seed(2195) > save<-rmultinom(n = 1, size = 1000, prob = c(0.25, 0.35, 0.2, 0.1, 0.1)) > save [,1] [1,] 242 [2,] 333 [3,] 188 [4,] 122 [5,] 115 > save/1000 [,1] [1,] 0.242 [2,] 0.333 [3,] 0.188 [4,] 0.122 [5,] 0.115 2010 Christopher R. Bilder 1.13 > > #Height of bars correspond to the columns so need to transpose save barplot(height = t(save), names = c("1", "2", "3", "4", "5"), col = "red", main = "Sample from multinomial distribution with \n pi=(0.25, 0.35, 0.2, 0.1, 0.1)") 0 50 100 150 200 250 300 Sample from multinomial distribution with pi=(0.25, 0.35, 0.2, 0.1, 0.1) 1 2 3 4 5 Notes: The first option in rmultinom( ) tells R how many sets of multinomial samples of size = 1000 (in this example) to find. The c() function combines or concatenates values into a vector. Multinomial sampling will be VERY important later in the course. 2010 Christopher R. Bilder 1.14 The dmultinom() function can be used to find probabilities with the multinomial distribution. See the program for examples. Question: Why examine probability distributions? 2010 Christopher R. Bilder 1.15 1.3 Statistical inference for a proportion Introduction to maximum likelihood estimation (FG kicking) Suppose the success or failure of a field goal in football can be modeled with a Bernoulli() distribution. Let Y=0 if the field goal is a failure and Y=1 if the field goal is a success. Then the probability distribution for Y is: P(Y=y) = y (1 )1 y where denotes the probability of success. Suppose we would like to estimate for a 40 yard field goal. Let y1,…,yn denote a random sample of observed field goal results at 40 yards. Thus, these yi’s are either 0’s or 1’s. Given the resulting data (y1,…,yn), the “likelihood function” measures the plausibility of different values of : ( | y1,...,yn ) P(Y1 y1 ) P(Y2 y2 ) P(Yn yn ) n P(Yi yi ) i 1 n yi (1 )1 yi i 1 i1 yi (1 )n i1 yi n n 2010 Christopher R. Bilder 1.16 Suppose ni1 yi = 4 and n = 10. Given this observed information, we would like to find the corresponding parameter value for that produces the largest probability of obtaining this particular sample. The following table can be formed to help find this parameter value: ( | y1,...,yn ) 0.000419 0.000953 0.001132 0.001192 0.001194 0.001192 0.000977 0.2 0.3 0.35 0.39 0.4 0.41 0.5 Plot of the likelihood function 0.0014 L( |y1,...,yN) 0.0012 0.0010 0.0008 0.0006 0.0004 0.0002 0.0000 0 0.2 0.4 0.6 0.8 2010 Christopher R. Bilder 1 1.17 Note that = 0.4 is the “most plausible” value of for the observed data since this maximizes the likelihood function. Therefore, 0.4 is the maximum likelihood estimate (MLE). In general, the maximum likelihood estimate can be found as follows: 1. Find the natural log of the likelihood function, log ( | y1,...,yn ) 2. Take the derivative of log ( | y1,...,yn ) with respect to . 3. Set the derivative equal to 0 and solve for to find the maximum likelihood estimate. Note that the solution is the maximum of ( | y1,...,yn ) provided certain “regularity” conditions hold (see Mood, Graybill, Boes, 1974). For the field goal example: log ( | y1,...,yn ) log i1 yi (1 )n i1 yi n n n n i 1 i 1 yi log( ) (n yi )log(1 ) where log means natural log. 2010 Christopher R. Bilder 1.18 n n yi n yi log ( | y1,...,yn ) i i 1 Then 1 0 1 n yi i 1 n n yi i 1 1 n 1 n yi i 1 n yi i 1 n 1 n n yi yi i 1 n i 1 yi i 1 n yi i 1 n Therefore, the maximum likelihood estimator of is the proportion of field goals made. To avoid confusion between a parameter and a statistic, we will denote the n estimator as p = yi /n. Often, others will put a ^ on to i1 denote the estimator. Agresti uses p. Agresti derives the estimator from a binomial point of view. See the top of p.6 for the connection between the Bernoulli and binomial way of looking at the problem. 2010 Christopher R. Bilder 1.19 Maximum likelihood estimation will be extremely important in this class!!! For additional examples with maximum likelihood estimation, please see Section 9.15 of my STAT 380 lecture notes at http://www.chrisbilder.com/stat380/schedule.htm. Properties of maximum likelihood estimators For a large sample, maximum likelihood estimators can be treated as normal random variables. For a large sample, the variance of the maximum likelihood estimator can be computed from the second derivative of the log likelihood function. Why do you think these two properties are important? The use of “for a large sample” can also be replaced with the word “asymptotically”. You will often hear these results talked about using the phrase “asymptotic normality of maximum likelihood estimators”. You are not responsible to do derivations as shown in the next example (on exams). This example is helpful to see 2010 Christopher R. Bilder 1.20 in order to understand how R will be doing these and more complex calculations. Example: Field goal kicking log ( | y1,...,yn ) ni1 yi log( ) (n in1 yi )log(1 ) log ( | y1,...,yn ) ni1 yi n in1 yi Then 1 2 log ( | y1,...,yn ) ni1 yi n in1 yi and 2 2 (1 )2 The large sample variance of p is 2 log ( | Y1,...,Yn ) E 2 1 . p To find this, note that, (1 )2 in1 Yi (n in1 Yi )2 ni1 Yi n in1 Yi E 2 E 2 (1 ) 2 (1 )2 ni1 Yi 2in1 Yi 2 in1 Yi n2 2 in1 Yi E 2 (1 )2 2010 Christopher R. Bilder 1.21 ni1 Yi 2in1 Yi n2 E 2 2 (1 ) E ni1 Yi 2E in1 Yi n2 since only the Yi’s 2 2 (1 ) are random variables n(1 ) n 2n2 n2 n n2 2 (1 )2 2 (1 )2 2 (1 )2 n (1 ) Since is a parameter, we replace it with its corresponding estimator to obtain n p(1 p). 2 log ( | Y1,...,Yn ) Thus, E 2 1 p p(1 p) = . n This same result is derived on p. 474 of Casella and Berger (2002). See Chapter 18 of Ferguson (1996) for more on the “asymptotic normality” of maximum likelihood estimators. Likelihood Ratio Test (LRT) 2010 Christopher R. Bilder 1.22 The LRT is a general way to test hypotheses. The LRT statistic, , is the ratio of two likelihood functions. The numerator is the likelihood function maximized over the parameter space restricted under the null hypothesis. The denominator is the likelihood function maximized over the unrestricted parameter space. The test statistic is written as: Max. lik. when parameters satisfy Ho Max. lik. when parameters satisfy Ho or Ha Wilks (1935, 1938) shows that –2log() can be approximated by a u2 for a large sample and under Ho where u is the difference in dimension between the alternative and null hypothesis parameter spaces. See Casella and Berger (2002) for more on the LRT. See the chi-square (2) distribution review in the Chapter 1 additional notes if needed. Example: Field goal kicking Continuing the field goal example, suppose the hypothesis test Ho:=0.5 vs. Ha:0.5 is of interest. Remember that yi 4 and n = 10. 2010 Christopher R. Bilder 1.23 The numerator of is the maximum possible value of the likelihood function under the null hypothesis. Since =0.5 is the null hypothesis, the maximum can be found by just substituting =0.5 in the likelihood function: ( 0.5 | y1,...,yn ) 0.5yi (1 0.5)n yi Then ( 0.5 | y1,...,yn ) 0.54 (0.5)10 4 0.0009766 The denominator of is the maximum possible value of the likelihood function under the null OR alternative hypotheses. Since this includes all possible values of here, the maximum is achieved when the maximum likelihood estimate is substituted for in the likelihood function! As shown previously, the maximum value is 0.001194. Therefore, Max. lik. when parameters satisfy Ho Max. lik. when parameters satisfy Ho or Ha 0.0009766 0.8179 0.001194 Then –2log() = -2log(0.8179) = 0.4020 is the test 2 statistic value. The critical value is 1,0.95 = 3.84 using 2010 Christopher R. Bilder 1.24 =0.05. There is not sufficient evidence to reject the hypothesis that =0.5. Questions: Suppose the ratio is close to 0, what does this say about Ho? Explain. Suppose the ratio is close to 1, what does this say about Ho? Explain. Confidence interval for the population proportion A typical form of a confidence interval for a parameter is Estimator (distributional value)(standard deviation of estimator) In this case, p = ni1 yi /n is the estimator, the distribution is normal, and the standard deviation is p(1 p) / n . Thus, the approximate (1-)100% confidence interval for is p Z1 /2 p(1 p) n where Z1-/2 is the 1-/2 quantile from a standard normal distribution. 2010 Christopher R. Bilder 1.25 Confidence intervals which use the asymptotic normality of a maximum likelihood estimator are called “Wald” confidence intervals. Example: Field goal kicking n Suppose yi =4 and n=10. Then the 95% confidence i 1 interval is p Z1 /2 p(1 p) 0.4(1 0.4) 0.4 1.96 n 10 0.0964 ≤ ≤ 0.7036 Problems!!! 1) Remember the interval “works” if the sample size is large. The field goal kicking example has n=10 only! 2) The discreteness of the binomial distribution often makes the normal approximation work poorly even with large samples. The result is a confidence interval that is often too “liberal”. This means when 95% is stated as the confidence level, the true confidence level is often lower. 2010 Christopher R. Bilder 1.26 The problems with this particular confidence interval have been discussed for a long time in the statistical literature. A few recent papers on this topic include: Agresti, A. and Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician 52(2), 119-126. Agresti, A. and Caffo B. (2000). Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. The American Statistician 54(4), 280-288. Agresti, A. and Min, Y. (2001). On small-sample confidence intervals for parameters in discrete distributions. Biometrics 57, 963-971. Blaker, H. (2000). Confidence curves and improved exact confidence intervals for discrete distributions. The Canadian Journal of Statistics 28, 783-798. Blaker, H. (2001). Corrigenda: Confidence curves and improved exact confidence intervals for discrete distributions. The Canadian Journal of Statistics 29, 681. Borkowf, C. B. (2006). Constructing binomial confidence intervals with near nominal coverage by adding a single imaginary failure or success. Statistics in Medicine 25, 3679-3695. Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science 16(2), 101-133. 2010 Christopher R. Bilder 1.27 Henderson, M. and Meyer, M. (2001). Exploring the confidence interval for a binomial parameter in a first course in statistical computing. The American Statistician 55(4), 337-344. Newcombe, R. G. (2001). Logit confidence intervals and the inverse sinh transformation. The American Statistician 55(3), 200-202. Vos, P. W. and Hudson, S. (2005). Evaluation Criteria for Discrete Confidence Intervals: Beyond Coverage and Length. The American Statistician 59(2), 137-142. Brown et al. (2001) serves as a summary of all the proposed methods and gives the following recommendations: For n40, use the “Agresti and Coull” (1998) interval. 2 ni1 yi Z1 /2 2 Let p be an adjusted estimate of . 2 n Z1 /2 The approximate (1-)100% confidence interval is p Z1 / 2 p(1 p) n Z12 / 2 Note that when =0.05, Z1-/2=1.962. Then 2 ni1 yi 2 2 in1 yi 2 . p 2 n4 n2 2010 Christopher R. Bilder 1.28 Thus, two successes and two failures are added. Agresti and Coull (1998) refer to this as an “adjusted” Wald confidence interval. When n<40, the Agresti and Coull interval is generally still better than the Wald interval. For n<40, use the Wilson or Jeffrey’s prior interval. Wilson interval: Z1 /2n1/2 Z12 /2 p p(1 p) 2 4n n Z1 /2 Where does this come from? Consider the hypothesis test for Ho:=0 vs. Ha:0 using the test statistic of Z p 0 0 (1 0 ) n The limits of the Wilson confidence interval come from “inverting” the test. This means finding the set of 0 such that 2010 Christopher R. Bilder 1.29 Z /2 p 0 Z1 /2 0 (1 0 ) n is satisfied. Through solving a quadratic equation (see #1.18 in the homework), the interval limits are derived. Jeffrey’s interval: Please see the additional notes for Chapter 1. This interval will not be on a test, but could be on a project. Example: Field goal kicking (fg.R) Below is the code used to calculate each confidence interval. > sum.y<-4 > n<-10 > alpha<-0.05 > p<-sum.y/n > > > > #Wald C.I. lower<-p-qnorm(p = 1-alpha/2)*sqrt(p*(1-p)/n) upper<-p+qnorm(p = 1-alpha/2)*sqrt(p*(1-p)/n) cat("The Wald C.I. is:", round(lower,4) , "<= pi <=", round(upper,4), "\n \n") The Wald C.I. is: 0.0964 <= pi <= 0.7036 > #Agresti and Coull C.I. 2010 Christopher R. Bilder 1.30 > p.tilde<-(sum.y+qnorm(p = 1-alpha/2)^2 /2)/(n+qnorm(p = 1-alpha/2)^2) > lower<-p.tilde-qnorm(p = 1-alpha/2)*sqrt(p.tilde*(1p.tilde)/(n+qnorm(p = 1-alpha/2)^2)) > upper<-p.tilde+qnorm(p =1-alpha/2)*sqrt(p.tilde*(1p.tilde)/(n+qnorm(p =1-alpha/2)^2)) > cat("The Agresti and Coull C.I. is:", round(lower,4) , "<= pi <=", round(upper,4), "\n \n") The Agresti and Coull C.I. is: 0.1671 <= pi <= 0.6884 > #Wilson C.I. > lower<-p.tilde-qnorm(p =1-alpha/2)*sqrt(n)/(n+qnorm(p =1alpha/2)^2) * sqrt(p*(1-p)+qnorm(p = 1-alpha/2)^2/ (4*n)) > upper<-p.tilde+qnorm(p = 1-alpha/2)*sqrt(n)/(n+qnorm(p = 1-alpha/2)^2) * sqrt(p*(1-p)+qnorm(p = 1-alpha/2)^2/ (4*n)) > cat("The Wilson C.I. is:", round(lower,4) , "<= pi <=", round(upper,4), "\n \n") The Wilson C.I. is: 0.1682 <= pi <= 0.6873 The intervals produced are: The Wald C.I. is: 0.0964 <= pi <= 0.7036 The Agresti and Coull C.I. is: 0.1671 <= pi <= 0.6884 The Wilson C.I. is: 0.1682 <= pi <= 0.6873 How useful would these confidence intervals be? Note that Agresti now gives code for some of these intervals at http://www.stat.ufl.edu/~aa/cda/R/one_sample/R1/index.html. This code contains actual functions to perform the calculations. Also, the binom package in R provides a simple function to do these calculations as well. Here is an example of how I used its function: > library(binom) 2010 Christopher R. Bilder 1.31 > binom.confint(x = sum.y, n = n, conf.level = 1alpha, methods = "all") method x n mean lower upper 1 agresti-coull 4 10 0.4000000 0.16711063 0.6883959 2 asymptotic 4 10 0.4000000 0.09636369 0.7036363 3 bayes 4 10 0.4090909 0.14256735 0.6838697 4 cloglog 4 10 0.4000000 0.12269317 0.6702046 5 exact 4 10 0.4000000 0.12155226 0.7376219 6 logit 4 10 0.4000000 0.15834201 0.7025951 7 probit 4 10 0.4000000 0.14933907 0.7028372 8 profile 4 10 0.4000000 0.14570633 0.6999845 9 lrt 4 10 0.4000000 0.14564246 0.7000216 10 prop.test 4 10 0.4000000 0.13693056 0.7263303 11 wilson 4 10 0.4000000 0.16818033 0.6873262 Below is a comparison of the performance of the four confidence intervals (Brown et al., 2001). The values on the y-axis represent the true confidence level (coverage) of the confidence intervals. Each of the confidence intervals are supposed to be 95%! 2010 Christopher R. Bilder 1.32 2010 Christopher R. Bilder 1.33 What does the “true confidence” or “coverage” level mean? Suppose a random sample of size n=50 is taken from a population and a 95% Wald confidence interval is calculated. Suppose another random sample of size n=50 is taken from the same population and a 95% Wald confidence interval is calculated. Suppose this process is repeated 10,000 times. We would expect 9,500 out of 10,000 (95%) confidence intervals to contain . Unfortunately, this does not often happen. It only is guaranteed to happen when n= for the Wald interval. The true confidence or coverage level is the percent of times the confidence intervals contain or “cover” . How can this “true confidence” or “coverage” level be calculated? Simulate the process of taking 10,000 samples and calculating the confidence interval each time using a computer. This is called a Monte Carlo simulation. The plots of the previous page do this for many possible values of (0.0005 to 0.9995 by 0.0005). For example, the coverage using the Wald interval is approximately 0.90 for =0.184. 2010 Christopher R. Bilder 1.34 Note: The authors of the paper actually use a different method which does not require computer simulation (see the additional Chapter 1 notes for further information). However, computer simulation will provide just about the same answer if the number of times the process (take a sample, calculate the C.I.) is repeated a large number of times. Plus, it can be used in many more situations where computer simulation is the only option. Brown et al. (2001) also discuss other confidence intervals. After the paper, there are a set of responses by other statisticians. Each agrees the Wald interval should NOT be used; however, they are not all in agreement with which other interval to use! Example: Calculate estimated true confidence level for Wald (CI_sim.R) ############################################################## # NAME: Chris Bilder # # DATE: 1-23-02 # # UPDATE: 12-7-03 # # PURPOSE: C.I. pi simulation # # # # NOTES: See plot on p. 107 of Brown, et al. (2001) # ############################################################## ############################################################ # Do one time > n <- 50 > pi <- 0.184 > sum.y <- rbinom(n = 1, size = n, prob = pi) 2010 Christopher R. Bilder 1.35 > sum.y [1] 6 > > > > > > alpha <- 0.05 p <- sum.y/n #Wald C.I. lower <- p - qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n) upper <- p + qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n) cat("The Wald C.I. is:", round(lower, 4), "<= pi <=", round(upper, 4)) The Wald C.I. is: 0.0299 <= pi <= 0.2101 ########################################################### # Repeate 10,000 times > numb <- 10000 > n <- 50 > pi <- 0.184 > set.seed(4516) > sum.y <- rbinom(n = numb, size = n, prob = pi) > alpha <- 0.05 > p <- sum.y/n > #Wald C.I. > lower <- p - qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n) > upper <- p + qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n) > save <- ifelse(test = pi > lower, yes = ifelse(test = pi < upper, yes = 1, no = 0), no = 0) > true.conf <- mean(save) > cat("An estimate of the true confidence level is:", round(true.conf, 4)) An estimate of the true confidence level is: 0.9041 Important: Notice how the calculations are performed using vectors. What we are doing here is actually estimating the probabilities for ni1 yi =0, 1, …, n through the use of Monte Carlo simulation. We sum up the probabilities for 2010 Christopher R. Bilder 1.36 ni1 yi when its corresponding confidence interval contains . Below is some R code demonstrating this. > table(sum.y) sum.y 1 2 3 2 23 90 13 535 14 324 15 172 4 222 5 473 16 75 17 52 6 7 8 9 10 11 759 1180 1319 1441 1341 1091 18 12 19 9 > > > > 12 879 22 1 temp<-table(sum.y)/numb possible.sum.y<-as.integer(names(table(sum.y))) p.possible<-possible.sum.y/n lower<-p.possible-qnorm(p = 1-alpha/2)* sqrt(p.possible*(1-p.possible)/n) > upper<-p.possible+qnorm(p = 1-alpha/2) * sqrt(p.possible*(1-p.possible)/n) > data.frame(lower, upper, from.table=temp) lower upper from.table.sum.y from.table.Freq 1 -0.018805307 0.05880531 1 0.0002 2 -0.014316115 0.09431612 2 0.0023 3 -0.005826784 0.12582678 3 0.0090 4 0.004802744 0.15519726 4 0.0222 5 0.016845771 0.18315423 5 0.0473 6 0.029926913 0.21007309 6 0.0759 7 0.043821869 0.23617813 7 0.1180 8 0.058383853 0.26161615 8 0.1319 9 0.073510628 0.28648937 9 0.1441 10 0.089127694 0.31087231 10 0.1341 11 0.105178893 0.33482111 11 0.1091 12 0.121620771 0.35837923 12 0.0879 13 0.138419025 0.38158098 13 0.0535 14 0.155546145 0.40445385 14 0.0324 15 0.172979816 0.42702018 15 0.0172 16 0.190701784 0.44929822 16 0.0075 17 0.208697040 0.47130296 17 0.0052 18 0.226953233 0.49304677 18 0.0012 19 0.245460214 0.51453979 19 0.0009 20 0.302411087 0.57758891 22 0.0001 2010 Christopher R. Bilder 1.37 > sum(ifelse(test = pi>lower, yes = ifelse(test = pi<upper, yes = 1, no = 0), no = 0)*temp) [1] 0.9041 > dbinom(x = 10, size = 50, prob = 0.184) [1] 0.1341070 #About same as estimate by MC simulation Using this idea, one could find the EXACT true confidence level without Monte Carlo simulation! See the additional Chapter 1 notes for details. How can you perform this simulation for more than one and then produce a coverage plot similar to p. 1.32? True confidence level oscillation Why does the true confidence level (coverage) oscillate? The reason involves the discreteness of a binomial random variable. Please see my additional Chapter 1 notes for a more in-depth discussion. Clopper-Pearson interval First, we need to discuss what a beta probability distribution represents. Let w be a random variable from a Beta distribution. Then 2010 Christopher R. Bilder 1.38 f(w;a,b) (a b) a1 w (1 w)b1 (a)(b) for 0<w<1, a>0, and b>0. Note that ( ) is the “gamma” function where (c) = (c-1)! for an integer c. The gamma function is more generally defined as (c) xc 1e x dx for c>0. 0 The quantile, denoted by w and Beta(; a, b), is w 0 (a b) a1 w (1 w)b1dw . (a)(b) Below are plots of various Beta distributions from Welsh (1996). The left extreme of each plot denotes 0 and the right extreme of each plot denotes 1 because 0 < w < 1 by definition for the distribution. 2010 Christopher R. Bilder 1.39 The (1-)100% Clopper-Pearson confidence interval is Beta(/2; ni 1 yi , n- ni 1 yi +1) Beta(1-/2; ni 1 yi +1, n- ni 1 yi ) This is derived using the relationship between the binomial and beta distributions (see problem #2.40 on p. 82 of Casella and Berger, 2002). If ni1 yi = 0, the lower end point is taken to be 0. If ni1 yi = n, the upper end point is taken to be 1. Often, the interval below is given instead: 2010 Christopher R. Bilder 1.40 1 n ni1 yi 1 1 F2(1n ni1 yi ),2 in1 yi ,1 /2 Ni1 yi ni1 yi 1 F2(1 ni1 yi ),2(n in1 yi ),1 /2 n ni1 yi in1 yi 1 1 F2(1 in1 yi ),2(n in1 yi ),1 /2 n in1 yi where Fa, b, 1- is the 1- quantile from an F distribution with a numerator and b denominator degrees of freedom. The two intervals are the same and one can use the relationship between beta and F random variables to show this. Problem #9.21 of Casella and Berger (2002, p. 454) discusses its derivation. This Clopper-Pearson (CP) interval is GUARANTEED to have a true confidence (coverage) level 1-! While this is good, the confidence interval is usually wider than other intervals. Why would a wide interval be bad? Brown et al. (2001, p. 113) say this interval is “wastefully conservative and is not a good choice for practical use”. Examine the coverage plot for it (n=50, p. 113 of Brown et al. 2001): 2010 Christopher R. Bilder 1.41 Maybe this is too harsh? Intervals like the Clopper-Pearson are called “exact” confidence intervals because they use the exact distribution for ni1 yi (binomial). This does not mean their true confidence level is exactly (1-)100% as you can see above. Example: Field goal kicking (fg.R) Below is the code used to calculate the Clopper-Pearson confidence interval. > #Clopper-Pearson > lower<- qbeta(p = alpha/2, shape1 = sum.y, shape2 = n-sum.y+1) > upper<- qbeta(p = 1-alpha/2, shape1 = sum.y+1, shape2 = n-sum.y) 2010 Christopher R. Bilder 1.42 > cat("The C-P C.I. is:", round(lower,4) , "<= pi <=", round(upper,4), "\n \n") The C-P C.I. is: 0.1216 <= pi <= 0.7376 The binom package calls the Clopper-Pearson interval the “exact” interval. > library(binom) > binom.confint(x = sum.y, n = n, conf.level = 1alpha, methods = "exact") method x n mean lower upper 1 exact 4 10 0.4 0.1215523 0.7376219 A summary of all intervals: Name Wald Agresti and Coull Wilson Clopper-Pearson Lower 0.0964 0.1671 0.1682 0.1216 Upper 0.7036 0.6884 0.6873 0.7376 Jeffrey’s Blaker 0.1531 0.1500 0.6963 0.7171 There are many other confidence intervals for proportion! For example, Agresti (2007) discusses a “mid-p” confidence interval that adjusts the Clopper-Pearson for being too conservative. However, this interval no longer is guaranteed to have a true confidence level 1-. This interval and many, many others are discussed in the papers referenced earlier. 2010 Christopher R. Bilder