Download 1 - Chris Bilder`s

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
1.1
1. Introduction
1.1 Categorical response data
What is a categorical (qualitative) variable?
 Field goal result – success or failure
 Patient survival – yes or no
 Criminal offense convictions – murder, robbery, assault,
…
 Highest attained education level – HS, BS, MS, PhD
 Food for breakfast – cereal, bagel, eggs,…
 Annual income - <15,000, 15,000-<25,000,
25,000-<40,000, 40,000
 Religious affiliation
We live in a categorical world!
Types of categorical variables
1) Ordinal – categories have an ordering
Education, income
There are ways to take advantage of the ordering. Agresti
is currently updating his Analysis of Ordinal Categorical
Data book on the subject!
2) Nominal – categories do not have an ordering
 2010 Christopher R. Bilder
1.2
Food for breakfast, criminal offense convictions, religious
affiliation
Organization of Agresti (2007)
Chapters 1-2: “Standard methods” of analysis mostly
developed prior to 1960
Chapters 3-8: Modeling approaches such as logistic
regression and loglinear models
Chapters 9-10: More advanced applications – modeling
correlated observations and incorporating random
effects
Chapter 11: Historical overview – See the “feud”
between Pearson and Yule.
For a more in depth discussion of categorical data
analysis, see Agresti’s (2002) book entitled Categorical
Data Analysis.
 2010 Christopher R. Bilder
1.3
1.2 Probability distributions for categorical data
Binomial distribution
Binomial distribution: P(Y=y) =
n!
y (1  )ny
y!(n  y)!
for y=0,1,…,n
Notes:
n
n!
   = n choose y

y!(n  y)!  y 
 Y is a random variable denoting the number of
“successes” out of n trials
 y denotes the observed value of Y
 Y has a fixed number of possibilities – 0, 1, …, n
 n is a fixed constant
  is a parameter denoting the probability of a
“success”. It can take on values between 0 and 1.
Example: Field goal kicking
Suppose a field goal kicker attempts 5 field goals during
a game and each field goal has the same probability of
being successful (the kick is made). Also, assume each
field goal is attempted under similar conditions; i.e.,
distance, weather, surface,….
 2010 Christopher R. Bilder
1.4
Below are the characteristics that must be satisfied in
order for the binomial distribution to be used.
1) There are n identical trials.
n=5 field goals attempted under the exact same
conditions
2) Two possible outcomes of a trial. These are typically
referred to as a success or failure.
Each field goal can be made (success) or missed
(failure)
3) The trials are independent of each other.
The result of one field goal does not affect the result
of another field goal.
4) The probability of success, denoted by , remains
constant for each trial. The probability of a failure is
1-.
Suppose the probability a field goal is good is 0.6; i.e.,
P(success) =  = 0.6.
5) The random variable, Y, represents the number of
successes.
 2010 Christopher R. Bilder
1.5
Let Y=number of field goals that are good. Thus, Y
can be 0,1,2,3, 4, or 5.
Since these 5 items are satisfied, the binomial probability
distribution can be used and Y is called a binomial
random variable.
Bernoulli distribution: P(Y=y) = y (1  )1 y for y = 0 or 1
This is a special case of the binomial with n = 1.
Suppose Y1, Y2,…, Yn are independent Bernoulli random
n
variables with probability of success . Then X   Yi is
i1
a Binomial random variable with n trials and  as the
probability of success for each trial.
Mean and variance for Binomial random variable

E(Y) =  yP(Y  y)
y 0
n!
y (1  )n y
y 0
y!(n  y)!
n
(n  1)! y 1
= n  y
 (1  )n y
y 0
y!(n  y)!
n
= y
 2010 Christopher R. Bilder
1.6
(n  1)! y 1
 (1  )n y since y=0 does not
y 1 y!(n  y)!
contribute to the sum
n
(n  1)!
= n 
y 1(1  )n y
y 1 (y  1)!(n  y)!
n1
(n  1)!
= n 
x (1  )nx 1 where x=y-1
x 0 x!(n  x  1)!
= n  1 since a binomial distribution with n-1 trials is
inside the sum.
= n
n
= n  y
Var(Y) = n(1-)
Example: Field goal kicking (binomial.R)
Suppose =0.6, n=5. What is the probability distribution?
P(Y=0) =
5!
0.60 (1  0.6)50  0.45  0.0102
0!(5  0)!
For Y=0,…,5:
Y
0
1
2
3
P(Y=y)
0.0102
0.0768
0.2304
0.3456
 2010 Christopher R. Bilder
1.7
Y P(Y=y)
4 0.2592
5 0.0778
E(Y) = n = 50.6 = 3 and
Var(Y) = n(1-) = 50.6(1-0.6) = 1.2
R code and output:
> #N=5, pi=0.6
> pdf<-dbinom(x = 0:5, size = 5, prob = 0.6)
> pdf
[1] 0.01024 0.07680 0.23040 0.34560 0.25920 0.07776
> pdf <- dbinom(0:5, 5, 0.6)
> pdf
[1] 0.01024 0.07680 0.23040 0.34560 0.25920 0.07776
#Make the printing look a little nicer
> save <- data.frame(Y = 0:5, prob = round(pdf, 4))
> save
Y
prob
1 0 0.0102
2 1 0.0768
3 2 0.2304
4 3 0.3456
5 4 0.2592
6 5 0.0778
> plot(x = save$Y, y = save$prob, type = "h", xlab = "Y",
ylab = "P(Y=y)", main = "Plot of a binomial
distribution for n=5, pi=0.6", panel.first =
grid(col="gray", lty="dotted"), lwd = 2)
> abline(h = 0)
 2010 Christopher R. Bilder
1.8
0.20
0.00
0.05
0.10
0.15
P(Y=y)
0.25
0.30
0.35
Plot of a binomial distribution for n=5, pi=0.6
0
1
2
3
4
5
Y
Example: Simulating observations from a binomial
distribution (binomial.R)
The purpose of this example is to show how one can
“simulate” observing a random sample of observations
from a population characterized by a binomial
distribution.
Why would someone want to do this?
Use the rbinom() function in R.
 2010 Christopher R. Bilder
1.9
> #Generate observations from a Binomial distribution
> set.seed(4848)
> bin5<-rbinom(n = 1000, size = 5, prob = 0.6)
> bin5[1:20]
[1] 3 2 4 1 3 1 3 3 3 4 3 3 3 2 3 1 2 2 5 2
> mean(bin5)
[1] 2.991
> var(bin5)
[1] 1.236155
> hist(x = bin5, main = "Binomial with n=5, pi=0.6, 1000
bin. observations", col = "blue")
200
0
100
Frequency
300
Binomial with n=5, pi=0.6, 1000 bin. observations
0
1
2
3
4
5
bin5
Notes:
 The “0” is a little off. Here’s how to fix this problem:
> save.count<-table(bin5)
> save.count
bin5
0
1
2
3
4
5
 2010 Christopher R. Bilder
1.10
12
84 215 362 244
83
> barplot(height = save.count, names = c("0", "1", "2",
"3", "4", "5"), main = "Binomial with N=5, pi=0.6, 1000
bin. observations", xlab = "x")
0
50
100
150
200
250
300
350
Binomial with n=5, pi=0.6, 1000 bin. observations
0
1
2
3
4
5
x
The table() function counts the observations.
 The shape of the histogram looks similar to the shape
of the actual binomial distribution.
 The mean and variance are close to what we expect
them to be!
 2010 Christopher R. Bilder
1.11
Multinomial distribution
Instead of just two possible outcomes, suppose there
are two or more. When this happens, a multinomial
distribution is appropriate. The binomial distribution is a
special case when there are only two possible
outcomes.
Set up:
 n trials
 k different outcomes or categories
 ni denotes the number of observed responses for
category i (count for the ith category) for i=1,…,k
 i denote the probability of observing ith category –
these are parameters
The multinomial distribution is:
P(n1,n2 ,...,nk ) 
n!
1n1 n22
(n1 !)(n2 !) (nk !)
nkk
Note that n = n1+n2+…+nk.
Example: Highest college degree received
Outcomes: None, Associate, Bachelors, Masters,
Doctorate
 2010 Christopher R. Bilder
1.12
Suppose a random sample of size n is taken from a
population of interest.
n1 denotes number of people out of n with no college
degree
n2 denotes number of people out of n with associate
degree as their highest

Example: Sample from a multinomial distribution
(multinomial_gen.R)
Suppose k=5, 1=0.25, 2=0.35, 3=0.2, 4=0.1, and
5=0.1.
> set.seed(2195)
> save<-rmultinom(n = 1, size = 1000, prob = c(0.25, 0.35,
0.2, 0.1, 0.1))
>
save
[,1]
[1,] 242
[2,] 333
[3,] 188
[4,] 122
[5,] 115
>
save/1000
[,1]
[1,] 0.242
[2,] 0.333
[3,] 0.188
[4,] 0.122
[5,] 0.115
 2010 Christopher R. Bilder
1.13
>
>
#Height of bars correspond to the columns so need to
transpose save
barplot(height = t(save), names = c("1", "2", "3", "4",
"5"), col = "red", main = "Sample from multinomial
distribution with \n pi=(0.25, 0.35, 0.2, 0.1, 0.1)")
0
50
100
150
200
250
300
Sample from multinomial distribution with
pi=(0.25, 0.35, 0.2, 0.1, 0.1)
1
2
3
4
5
Notes:
 The first option in rmultinom( ) tells R how many sets
of multinomial samples of size = 1000 (in this example)
to find.
 The c() function combines or concatenates values into
a vector.
 Multinomial sampling will be VERY important later in
the course.
 2010 Christopher R. Bilder
1.14
 The dmultinom() function can be used to find
probabilities with the multinomial distribution. See the
program for examples.
Question: Why examine probability distributions?
 2010 Christopher R. Bilder
1.15
1.3 Statistical inference for a proportion
Introduction to maximum likelihood estimation (FG kicking)
Suppose the success or failure of a field goal in football
can be modeled with a Bernoulli() distribution. Let Y=0
if the field goal is a failure and Y=1 if the field goal is a
success. Then the probability distribution for Y is:
P(Y=y) = y (1  )1 y
where  denotes the probability of success. Suppose
we would like to estimate  for a 40 yard field goal. Let
y1,…,yn denote a random sample of observed field goal
results at 40 yards. Thus, these yi’s are either 0’s or 1’s.
Given the resulting data (y1,…,yn), the “likelihood
function” measures the plausibility of different values of
:
(  | y1,...,yn )
 P(Y1  y1 )  P(Y2  y2 )   P(Yn  yn )
n
  P(Yi  yi )
i 1
n
  yi (1  )1 yi
i 1
  i1 yi (1  )n   i1 yi
n
n
 2010 Christopher R. Bilder
1.16
Suppose ni1 yi = 4 and n = 10. Given this observed
information, we would like to find the corresponding
parameter value for  that produces the largest
probability of obtaining this particular sample. The
following table can be formed to help find this
parameter value:
(  | y1,...,yn )
0.000419
0.000953
0.001132
0.001192
0.001194
0.001192
0.000977

0.2
0.3
0.35
0.39
0.4
0.41
0.5
Plot of the likelihood function
0.0014
L( |y1,...,yN)
0.0012
0.0010
0.0008
0.0006
0.0004
0.0002
0.0000
0
0.2
0.4
0.6
0.8

 2010 Christopher R. Bilder
1
1.17
Note that  = 0.4 is the “most plausible” value of  for the
observed data since this maximizes the likelihood
function. Therefore, 0.4 is the maximum likelihood
estimate (MLE).
In general, the maximum likelihood estimate can be
found as follows:
1. Find the natural log of the likelihood function,
log (  | y1,...,yn )
2. Take the derivative of log (  | y1,...,yn ) with respect
to .
3. Set the derivative equal to 0 and solve for  to find
the maximum likelihood estimate. Note that the
solution is the maximum of (  | y1,...,yn ) provided
certain “regularity” conditions hold (see Mood,
Graybill, Boes, 1974).
For the field goal example:
log  (  | y1,...,yn )  log   i1 yi (1  )n   i1 yi 


n
n
n
n
i 1
i 1
  yi log( )  (n   yi )log(1  )
where log means natural log.
 2010 Christopher R. Bilder
1.18
n
n
yi n   yi
 log  (  | y1,...,yn ) i
i 1
Then
 1 
0


1 
n

 yi
i 1

n

n   yi
i 1
1 
n
1 



n   yi
i 1
n
 yi
i 1
n
1
 

n
n   yi   yi
i 1
n
i 1
 yi
i 1
n

 yi
i 1
n
Therefore, the maximum likelihood estimator of  is the
proportion of field goals made. To avoid confusion
between a parameter and a statistic, we will denote the
n
estimator as p =  yi /n. Often, others will put a ^ on  to
i1
denote the estimator. Agresti uses p.
Agresti derives the estimator from a binomial point of
view. See the top of p.6 for the connection between the
Bernoulli and binomial way of looking at the problem.
 2010 Christopher R. Bilder
1.19
Maximum likelihood estimation will be extremely
important in this class!!!
For additional examples with maximum likelihood estimation,
please see Section 9.15 of my STAT 380 lecture notes at
http://www.chrisbilder.com/stat380/schedule.htm.
Properties of maximum likelihood estimators
 For a large sample, maximum likelihood estimators can
be treated as normal random variables.
 For a large sample, the variance of the maximum
likelihood estimator can be computed from the second
derivative of the log likelihood function.
Why do you think these two properties are important?
The use of “for a large sample” can also be replaced with
the word “asymptotically”. You will often hear these results
talked about using the phrase “asymptotic normality of
maximum likelihood estimators”.
You are not responsible to do derivations as shown in the
next example (on exams). This example is helpful to see
 2010 Christopher R. Bilder
1.20
in order to understand how R will be doing these and more
complex calculations.
Example: Field goal kicking
log  (  | y1,...,yn )    ni1 yi  log( )  (n   in1 yi )log(1  )
 log  (  | y1,...,yn )  ni1 yi n   in1 yi
Then




1 
2 log  (  | y1,...,yn )
 ni1 yi n   in1 yi
and
 2 
2


(1  )2
The large sample variance of p is
  2 log  (  | Y1,...,Yn )  
 E 

2

 
 
1
.
p
To find this, note that,
 (1  )2 in1 Yi  (n  in1 Yi )2 
 ni1 Yi n  in1 Yi 
E 2 
 E

2 
(1  ) 
2 (1  )2
 


 ni1 Yi  2in1 Yi  2 in1 Yi  n2  2 in1 Yi 
 E

2 (1  )2


 2010 Christopher R. Bilder
1.21
 ni1 Yi  2in1 Yi  n2 
 E

2
2

(1


)


E  ni1 Yi   2E  in1 Yi   n2
since only the Yi’s

2
2
 (1  )
are random variables
n(1  )
n  2n2  n2 n  n2



2 (1  )2
2 (1  )2 2 (1  )2
n

(1  )
Since  is a parameter, we replace it with its
corresponding estimator to obtain n p(1  p).
  2 log  (  | Y1,...,Yn )  
Thus,  E 

2

 
 
1
p
p(1  p)
=
.
n
This same result is derived on p. 474 of Casella and
Berger (2002).
See Chapter 18 of Ferguson (1996) for more on the
“asymptotic normality” of maximum likelihood estimators.
Likelihood Ratio Test (LRT)
 2010 Christopher R. Bilder
1.22
The LRT is a general way to test hypotheses. The LRT
statistic, , is the ratio of two likelihood functions. The
numerator is the likelihood function maximized over the
parameter space restricted under the null hypothesis.
The denominator is the likelihood function maximized
over the unrestricted parameter space. The test statistic
is written as:

Max. lik. when parameters satisfy Ho
Max. lik. when parameters satisfy Ho or Ha
Wilks (1935, 1938) shows that –2log() can be
approximated by a u2 for a large sample and under Ho
where u is the difference in dimension between the
alternative and null hypothesis parameter spaces. See
Casella and Berger (2002) for more on the LRT.
See the chi-square (2) distribution review in the
Chapter 1 additional notes if needed.
Example: Field goal kicking
Continuing the field goal example, suppose the
hypothesis test Ho:=0.5 vs. Ha:0.5 is of interest.
Remember that  yi  4 and n = 10.
 2010 Christopher R. Bilder
1.23
The numerator of  is the maximum possible value of
the likelihood function under the null hypothesis. Since
=0.5 is the null hypothesis, the maximum can be found
by just substituting =0.5 in the likelihood function:
(   0.5 | y1,...,yn )  0.5yi (1  0.5)n yi
Then
(   0.5 | y1,...,yn )  0.54 (0.5)10  4  0.0009766
The denominator of  is the maximum possible value of
the likelihood function under the null OR alternative
hypotheses. Since this includes all possible values of 
here, the maximum is achieved when the maximum
likelihood estimate is substituted for  in the likelihood
function! As shown previously, the maximum value is
0.001194.
Therefore,
Max. lik. when parameters satisfy Ho
Max. lik. when parameters satisfy Ho or Ha
0.0009766

 0.8179
0.001194

Then –2log() = -2log(0.8179) = 0.4020 is the test
2
statistic value. The critical value is 1,0.95
= 3.84 using
 2010 Christopher R. Bilder
1.24
=0.05. There is not sufficient evidence to reject the
hypothesis that =0.5.
Questions:
 Suppose the ratio is close to 0, what does this say about
Ho? Explain.
 Suppose the ratio is close to 1, what does this say about
Ho? Explain.
Confidence interval for the population proportion 
A typical form of a confidence interval for a parameter is
Estimator  (distributional value)(standard deviation of estimator)
In this case, p = ni1 yi /n is the estimator, the distribution
is normal, and the standard deviation is p(1  p) / n .
Thus, the approximate (1-)100% confidence interval for
 is
p  Z1 /2
p(1  p)
n
where Z1-/2 is the 1-/2 quantile from a standard normal
distribution.
 2010 Christopher R. Bilder
1.25
Confidence intervals which use the asymptotic normality
of a maximum likelihood estimator are called “Wald”
confidence intervals.
Example: Field goal kicking
n
Suppose  yi =4 and n=10. Then the 95% confidence
i 1
interval is
p  Z1 /2
p(1  p)
0.4(1  0.4)
 0.4  1.96
n
10
0.0964 ≤  ≤ 0.7036
Problems!!!
1) Remember the interval “works” if the sample size is large.
The field goal kicking example has n=10 only!
2) The discreteness of the binomial distribution often makes
the normal approximation work poorly even with large
samples.
The result is a confidence interval that is often too
“liberal”. This means when 95% is stated as the
confidence level, the true confidence level is often lower.
 2010 Christopher R. Bilder
1.26
The problems with this particular confidence interval have
been discussed for a long time in the statistical literature.
A few recent papers on this topic include:
 Agresti, A. and Coull, B. A. (1998). Approximate is
better than “exact” for interval estimation of binomial
proportions. The American Statistician 52(2), 119-126.
 Agresti, A. and Caffo B. (2000). Simple and effective
confidence intervals for proportions and differences of
proportions result from adding two successes and two
failures. The American Statistician 54(4), 280-288.
 Agresti, A. and Min, Y. (2001). On small-sample
confidence intervals for parameters in discrete
distributions. Biometrics 57, 963-971.
 Blaker, H. (2000). Confidence curves and improved
exact confidence intervals for discrete distributions. The
Canadian Journal of Statistics 28, 783-798.
 Blaker, H. (2001). Corrigenda: Confidence curves and
improved exact confidence intervals for discrete
distributions. The Canadian Journal of Statistics 29,
681.
 Borkowf, C. B. (2006). Constructing binomial
confidence intervals with near nominal coverage by
adding a single imaginary failure or success. Statistics
in Medicine 25, 3679-3695.
 Brown, L. D., Cai, T. T., and DasGupta, A. (2001).
Interval estimation for a binomial proportion. Statistical
Science 16(2), 101-133.
 2010 Christopher R. Bilder
1.27
 Henderson, M. and Meyer, M. (2001). Exploring the
confidence interval for a binomial parameter in a first
course in statistical computing. The American
Statistician 55(4), 337-344.
 Newcombe, R. G. (2001). Logit confidence intervals
and the inverse sinh transformation. The American
Statistician 55(3), 200-202.
 Vos, P. W. and Hudson, S. (2005). Evaluation Criteria
for Discrete Confidence Intervals: Beyond Coverage
and Length. The American Statistician 59(2), 137-142.
Brown et al. (2001) serves as a summary of all the
proposed methods and gives the following
recommendations:
 For n40, use the “Agresti and Coull” (1998) interval.
2
 ni1 yi  Z1 /2 2
Let p 
be an adjusted estimate of .
2
n  Z1 /2
The approximate (1-)100% confidence interval is
p  Z1 / 2
p(1  p)
n  Z12 / 2
Note that when =0.05, Z1-/2=1.962. Then
2
ni1 yi  2 2 in1 yi  2
.
p

2
n4
n2
 2010 Christopher R. Bilder
1.28
Thus, two successes and two failures are added.
Agresti and Coull (1998) refer to this as an “adjusted”
Wald confidence interval.
When n<40, the Agresti and Coull interval is generally
still better than the Wald interval.
 For n<40, use the Wilson or Jeffrey’s prior interval.
Wilson interval:
Z1 /2n1/2
Z12 /2
p
p(1  p) 
2
4n
n  Z1 /2
Where does this come from?
Consider the hypothesis test for Ho:=0 vs. Ha:0
using the test statistic of
Z
p  0
0 (1  0 )
n
The limits of the Wilson confidence interval come
from “inverting” the test. This means finding the set
of 0 such that
 2010 Christopher R. Bilder
1.29
Z /2 
p  0
 Z1 /2
0 (1  0 )
n
is satisfied. Through solving a quadratic equation
(see #1.18 in the homework), the interval limits are
derived.
Jeffrey’s interval: Please see the additional notes for
Chapter 1. This interval will not be on a test, but could
be on a project.
Example: Field goal kicking (fg.R)
Below is the code used to calculate each confidence
interval.
> sum.y<-4
> n<-10
> alpha<-0.05
> p<-sum.y/n
>
>
>
>
#Wald C.I.
lower<-p-qnorm(p = 1-alpha/2)*sqrt(p*(1-p)/n)
upper<-p+qnorm(p = 1-alpha/2)*sqrt(p*(1-p)/n)
cat("The Wald C.I. is:", round(lower,4) , "<= pi <=",
round(upper,4), "\n \n")
The Wald C.I. is: 0.0964 <= pi <= 0.7036
> #Agresti and Coull C.I.
 2010 Christopher R. Bilder
1.30
> p.tilde<-(sum.y+qnorm(p = 1-alpha/2)^2 /2)/(n+qnorm(p =
1-alpha/2)^2)
> lower<-p.tilde-qnorm(p = 1-alpha/2)*sqrt(p.tilde*(1p.tilde)/(n+qnorm(p = 1-alpha/2)^2))
> upper<-p.tilde+qnorm(p =1-alpha/2)*sqrt(p.tilde*(1p.tilde)/(n+qnorm(p =1-alpha/2)^2))
> cat("The Agresti and Coull C.I. is:", round(lower,4) ,
"<= pi <=", round(upper,4), "\n \n")
The Agresti and Coull C.I. is: 0.1671 <= pi <= 0.6884
> #Wilson C.I.
> lower<-p.tilde-qnorm(p =1-alpha/2)*sqrt(n)/(n+qnorm(p =1alpha/2)^2) * sqrt(p*(1-p)+qnorm(p = 1-alpha/2)^2/
(4*n))
> upper<-p.tilde+qnorm(p = 1-alpha/2)*sqrt(n)/(n+qnorm(p =
1-alpha/2)^2) * sqrt(p*(1-p)+qnorm(p = 1-alpha/2)^2/
(4*n))
> cat("The Wilson C.I. is:", round(lower,4) , "<= pi <=",
round(upper,4), "\n \n")
The Wilson C.I. is: 0.1682 <= pi <= 0.6873
The intervals produced are:
The Wald C.I. is: 0.0964 <= pi <= 0.7036
The Agresti and Coull C.I. is: 0.1671 <= pi <= 0.6884
The Wilson C.I. is: 0.1682 <= pi <= 0.6873
How useful would these confidence intervals be?
Note that Agresti now gives code for some of these intervals
at http://www.stat.ufl.edu/~aa/cda/R/one_sample/R1/index.html. This
code contains actual functions to perform the calculations.
Also, the binom package in R provides a simple function to
do these calculations as well. Here is an example of how I
used its function:
> library(binom)
 2010 Christopher R. Bilder
1.31
> binom.confint(x = sum.y, n = n, conf.level = 1alpha, methods = "all")
method x n
mean
lower
upper
1 agresti-coull 4 10 0.4000000 0.16711063 0.6883959
2
asymptotic 4 10 0.4000000 0.09636369 0.7036363
3
bayes 4 10 0.4090909 0.14256735 0.6838697
4
cloglog 4 10 0.4000000 0.12269317 0.6702046
5
exact 4 10 0.4000000 0.12155226 0.7376219
6
logit 4 10 0.4000000 0.15834201 0.7025951
7
probit 4 10 0.4000000 0.14933907 0.7028372
8
profile 4 10 0.4000000 0.14570633 0.6999845
9
lrt 4 10 0.4000000 0.14564246 0.7000216
10
prop.test 4 10 0.4000000 0.13693056 0.7263303
11
wilson 4 10 0.4000000 0.16818033 0.6873262
Below is a comparison of the performance of the four
confidence intervals (Brown et al., 2001). The values on the
y-axis represent the true confidence level (coverage) of the
confidence intervals. Each of the confidence intervals are
supposed to be 95%!
 2010 Christopher R. Bilder
1.32
 2010 Christopher R. Bilder
1.33
What does the “true confidence” or “coverage” level
mean?
 Suppose a random sample of size n=50 is taken from
a population and a 95% Wald confidence interval is
calculated.
 Suppose another random sample of size n=50 is taken
from the same population and a 95% Wald confidence
interval is calculated.
 Suppose this process is repeated 10,000 times.
 We would expect 9,500 out of 10,000 (95%)
confidence intervals to contain .
 Unfortunately, this does not often happen. It only is
guaranteed to happen when n= for the Wald interval.
 The true confidence or coverage level is the percent of
times the confidence intervals contain or “cover” .
How can this “true confidence” or “coverage” level be
calculated?
Simulate the process of taking 10,000 samples and
calculating the confidence interval each time using a
computer. This is called a Monte Carlo simulation.
The plots of the previous page do this for many possible
values of  (0.0005 to 0.9995 by 0.0005). For example,
the coverage using the Wald interval is approximately
0.90 for =0.184.
 2010 Christopher R. Bilder
1.34
Note: The authors of the paper actually use a
different method which does not require computer
simulation (see the additional Chapter 1 notes for
further information). However, computer simulation
will provide just about the same answer if the
number of times the process (take a sample,
calculate the C.I.) is repeated a large number of
times. Plus, it can be used in many more situations
where computer simulation is the only option.
Brown et al. (2001) also discuss other confidence
intervals. After the paper, there are a set of responses
by other statisticians. Each agrees the Wald interval
should NOT be used; however, they are not all in
agreement with which other interval to use!
Example: Calculate estimated true confidence level for Wald
(CI_sim.R)
##############################################################
# NAME: Chris Bilder
#
# DATE: 1-23-02
#
# UPDATE: 12-7-03
#
# PURPOSE: C.I. pi simulation
#
#
#
# NOTES: See plot on p. 107 of Brown, et al. (2001)
#
##############################################################
############################################################
# Do one time
> n <- 50
> pi <- 0.184
> sum.y <- rbinom(n = 1, size = n, prob = pi)
 2010 Christopher R. Bilder
1.35
> sum.y
[1] 6
>
>
>
>
>
>
alpha <- 0.05
p <- sum.y/n
#Wald C.I.
lower <- p - qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n)
upper <- p + qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n)
cat("The Wald C.I. is:", round(lower, 4), "<= pi <=",
round(upper, 4))
The Wald C.I. is: 0.0299 <= pi <= 0.2101
###########################################################
# Repeate 10,000 times
> numb <- 10000
> n <- 50
> pi <- 0.184
> set.seed(4516)
> sum.y <- rbinom(n = numb, size = n, prob = pi)
> alpha <- 0.05
> p <- sum.y/n
> #Wald C.I.
> lower <- p - qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n)
> upper <- p + qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n)
> save <- ifelse(test = pi > lower, yes = ifelse(test = pi <
upper, yes = 1, no = 0), no = 0)
> true.conf <- mean(save)
> cat("An estimate of the true confidence level is:",
round(true.conf, 4))
An estimate of the true confidence level is: 0.9041
Important: Notice how the calculations are performed
using vectors.
What we are doing here is actually estimating the
probabilities for ni1 yi =0, 1, …, n through the use of
Monte Carlo simulation. We sum up the probabilities for
 2010 Christopher R. Bilder
1.36
ni1 yi when its corresponding confidence interval
contains . Below is some R code demonstrating this.
> table(sum.y)
sum.y
1
2
3
2
23
90
13
535
14
324
15
172
4
222
5
473
16
75
17
52
6
7
8
9
10
11
759 1180 1319 1441 1341 1091
18
12
19
9
>
>
>
>
12
879
22
1
temp<-table(sum.y)/numb
possible.sum.y<-as.integer(names(table(sum.y)))
p.possible<-possible.sum.y/n
lower<-p.possible-qnorm(p = 1-alpha/2)*
sqrt(p.possible*(1-p.possible)/n)
> upper<-p.possible+qnorm(p = 1-alpha/2) *
sqrt(p.possible*(1-p.possible)/n)
> data.frame(lower, upper, from.table=temp)
lower
upper from.table.sum.y from.table.Freq
1 -0.018805307 0.05880531
1
0.0002
2 -0.014316115 0.09431612
2
0.0023
3 -0.005826784 0.12582678
3
0.0090
4
0.004802744 0.15519726
4
0.0222
5
0.016845771 0.18315423
5
0.0473
6
0.029926913 0.21007309
6
0.0759
7
0.043821869 0.23617813
7
0.1180
8
0.058383853 0.26161615
8
0.1319
9
0.073510628 0.28648937
9
0.1441
10 0.089127694 0.31087231
10
0.1341
11 0.105178893 0.33482111
11
0.1091
12 0.121620771 0.35837923
12
0.0879
13 0.138419025 0.38158098
13
0.0535
14 0.155546145 0.40445385
14
0.0324
15 0.172979816 0.42702018
15
0.0172
16 0.190701784 0.44929822
16
0.0075
17 0.208697040 0.47130296
17
0.0052
18 0.226953233 0.49304677
18
0.0012
19 0.245460214 0.51453979
19
0.0009
20 0.302411087 0.57758891
22
0.0001
 2010 Christopher R. Bilder
1.37
> sum(ifelse(test = pi>lower, yes = ifelse(test = pi<upper,
yes = 1, no = 0), no = 0)*temp)
[1] 0.9041
> dbinom(x = 10, size = 50, prob = 0.184)
[1] 0.1341070
#About same as estimate by MC simulation
Using this idea, one could find the EXACT true
confidence level without Monte Carlo simulation! See
the additional Chapter 1 notes for details.
How can you perform this simulation for more than one 
and then produce a coverage plot similar to p. 1.32?
True confidence level oscillation
Why does the true confidence level (coverage) oscillate?
The reason involves the discreteness of a binomial
random variable. Please see my additional Chapter 1
notes for a more in-depth discussion.
Clopper-Pearson interval
First, we need to discuss what a beta probability
distribution represents. Let w be a random variable from
a Beta distribution. Then
 2010 Christopher R. Bilder
1.38
f(w;a,b) 
(a  b) a1
w (1  w)b1
(a)(b)
for 0<w<1, a>0, and b>0. Note that ( ) is the “gamma”
function where (c) = (c-1)! for an integer c. The gamma
function is more generally defined as

(c)   xc 1e x dx for c>0.
0
The  quantile, denoted by w and Beta(; a, b), is
w
 
0
(a  b) a1
w (1  w)b1dw .
(a)(b)
Below are plots of various Beta distributions from Welsh
(1996). The left extreme of each plot denotes 0 and the
right extreme of each plot denotes 1 because 0 < w < 1
by definition for the distribution.
 2010 Christopher R. Bilder
1.39
The (1-)100% Clopper-Pearson confidence interval is
Beta(/2; ni 1 yi , n- ni 1 yi +1)    Beta(1-/2; ni 1 yi +1, n- ni 1 yi )
This is derived using the relationship between the
binomial and beta distributions (see problem #2.40 on p.
82 of Casella and Berger, 2002). If  ni1 yi = 0, the lower
end point is taken to be 0. If  ni1 yi = n, the upper end
point is taken to be 1.
Often, the interval below is given instead:
 2010 Christopher R. Bilder
1.40
1
n   ni1 yi  1
1
F2(1n  ni1 yi ),2 in1 yi ,1 /2
Ni1 yi
 ni1 yi  1
F2(1 ni1 yi ),2(n  in1 yi ),1 /2
n   ni1 yi

 in1 yi  1
1
F2(1 in1 yi ),2(n  in1 yi ),1 /2
n  in1 yi
where Fa, b, 1- is the 1- quantile from an F distribution
with a numerator and b denominator degrees of
freedom. The two intervals are the same and one can
use the relationship between beta and F random
variables to show this. Problem #9.21 of Casella and
Berger (2002, p. 454) discusses its derivation.
This Clopper-Pearson (CP) interval is GUARANTEED to
have a true confidence (coverage) level  1-! While
this is good, the confidence interval is usually wider than
other intervals. Why would a wide interval be bad?
Brown et al. (2001, p. 113) say this interval is
“wastefully conservative and is not a good choice for
practical use”. Examine the coverage plot for it
(n=50, p. 113 of Brown et al. 2001):
 2010 Christopher R. Bilder
1.41
Maybe this is too harsh?
Intervals like the Clopper-Pearson are called “exact”
confidence intervals because they use the exact
distribution for  ni1 yi (binomial). This does not mean
their true confidence level is exactly (1-)100% as you
can see above.
Example: Field goal kicking (fg.R)
Below is the code used to calculate the Clopper-Pearson
confidence interval.
> #Clopper-Pearson
> lower<- qbeta(p = alpha/2, shape1 = sum.y, shape2
= n-sum.y+1)
> upper<- qbeta(p = 1-alpha/2, shape1 = sum.y+1,
shape2 = n-sum.y)
 2010 Christopher R. Bilder
1.42
> cat("The C-P C.I. is:", round(lower,4) , "<= pi
<=", round(upper,4), "\n \n")
The C-P C.I. is: 0.1216 <= pi <= 0.7376
The binom package calls the Clopper-Pearson interval
the “exact” interval.
> library(binom)
> binom.confint(x = sum.y, n = n, conf.level = 1alpha, methods = "exact")
method x n mean
lower
upper
1 exact 4 10 0.4 0.1215523 0.7376219
A summary of all intervals:
Name
Wald
Agresti and Coull
Wilson
Clopper-Pearson
Lower
0.0964
0.1671
0.1682
0.1216
Upper
0.7036
0.6884
0.6873
0.7376
Jeffrey’s
Blaker
0.1531
0.1500
0.6963
0.7171
There are many other confidence intervals for proportion!
For example, Agresti (2007) discusses a “mid-p” confidence
interval that adjusts the Clopper-Pearson for being too
conservative. However, this interval no longer is guaranteed
to have a true confidence level  1-. This interval and
many, many others are discussed in the papers referenced
earlier.
 2010 Christopher R. Bilder