Download A second look at careful estimates for proportions, including using the exact Binomial distirbution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Statistics of Proportions
Assume we have a simple random sample of a two-outcome (for simplicity, 0 and 1)
experiment. With n observations, let k be the number of 1‘s. We want to find an interval estimate
for p = P [X = 1] where X is a random variable with the distribution we are observing.
1. Binomial distribution (exact)
The number of 1‘s, call it N, is a random variable with binomial distribution:
n¡j
n!
P [N = j] =
p j(1 ¡ p) . Let the corresponding cumulative distribution function
j !(n ¡ j)!
l
X
P [N = i] = P [N ∙ l] . Since we observed k 1‘s, if the true
be denoted by F (l) =
i=0
probability is p, we can evaluate the probability of observing no more or no less than k 1‘s.
Choosing values of p implying the “boundary” probabilities (for example, respectively, 5%
probability of being less than k and 5% of being more than k, we have an interval estimate.
Here is an example, using a spreadsheet to compute the cumulative distribution function for
binomial distributions (most introductory textbooks will list the probability distribution, and it
will be up to you to add the numbers to get the cdf)
n
k
15
6
p
0.1
0.2
0.25
0.3
0.4
0.5
0.6
0.65
0.7
0.8
0.9
P[N>k]
0.00031
0.01806
0.05662
0.13114
0.39019
0.69638
0.90495
0.95781
0.98476
0.99922
1
P[N<k]
0.99969
0.98194
0.94338
0.86886
0.60981
0.30362
0.09505
0.04219
0.01524
0.00078
0.0000028465
This example k = 6 “successes” out of n = 15 attempts, suggests a confidence interval for p at the
(very roughly) 90% level of (0.25,0.65)
2. Normal Approximation (“precise”)
If p is not too close to 0 or 1, the normal approximation for the sample average (the empirical
proportion) kicks in early, so it is commonly used, except for extremely small samples. Since the
expectation of a binomial distribution of n observations with probability p is np, and its variance
p(1 ¡ p)
is np(1 ¡ p), the sample mean p¹ has mean p and variance
n . Hence, the interval estimate,
at confidence level α "
¸
q
q
p(1 ¡ p)
p(1 ¡ p)
P ¡z 1¡®
∙ p¹ ¡ p ∙ z 1¡®
= ® (1)
n
n
2
2
The glitch in this formula is that the unknown value p is everywhere. The “precise” way to
handle this problem is to look at the line above as two inequalities, that, squared, correspond to
two quadratic inequalities. Solving these inequalities for p produces our “precise” interval
estimate.
3. Normal Approximation (“Cautious” or “Pessimistic”)
Since0 ∙ p ∙ 1, the variance
of each observation is bounded by ¼, and hence is standard
q
p(1 ¡ p)
1 , inserting this value in the inequality (1) generates a
∙ 2p
deviation by ½. Since
n
n
confidence interval that is certainly not smaller (most likely larger) than the “precise” one.
Hence, we can use this as a simple, “pessimistic” estimate.
4. Normal Approximation (“Sloppy”)
Finally, if you don’t mind the loss of control, you can use the “practical” rule suggested by the
book: use the empirical proportion instead of the “true” one in the expression for the variance. It
will generally yield a smaller interval than the “cautious” version, but you have no way of
knowing whether the result is “optimistic” or “pessimistic”, that is whether the interval you
come up with is actually too small or too large for the given confidence level.