Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics of Proportions Assume we have a simple random sample of a two-outcome (for simplicity, 0 and 1) experiment. With n observations, let k be the number of 1‘s. We want to find an interval estimate for p = P [X = 1] where X is a random variable with the distribution we are observing. 1. Binomial distribution (exact) The number of 1‘s, call it N, is a random variable with binomial distribution: n¡j n! P [N = j] = p j(1 ¡ p) . Let the corresponding cumulative distribution function j !(n ¡ j)! l X P [N = i] = P [N ∙ l] . Since we observed k 1‘s, if the true be denoted by F (l) = i=0 probability is p, we can evaluate the probability of observing no more or no less than k 1‘s. Choosing values of p implying the “boundary” probabilities (for example, respectively, 5% probability of being less than k and 5% of being more than k, we have an interval estimate. Here is an example, using a spreadsheet to compute the cumulative distribution function for binomial distributions (most introductory textbooks will list the probability distribution, and it will be up to you to add the numbers to get the cdf) n k 15 6 p 0.1 0.2 0.25 0.3 0.4 0.5 0.6 0.65 0.7 0.8 0.9 P[N>k] 0.00031 0.01806 0.05662 0.13114 0.39019 0.69638 0.90495 0.95781 0.98476 0.99922 1 P[N<k] 0.99969 0.98194 0.94338 0.86886 0.60981 0.30362 0.09505 0.04219 0.01524 0.00078 0.0000028465 This example k = 6 “successes” out of n = 15 attempts, suggests a confidence interval for p at the (very roughly) 90% level of (0.25,0.65) 2. Normal Approximation (“precise”) If p is not too close to 0 or 1, the normal approximation for the sample average (the empirical proportion) kicks in early, so it is commonly used, except for extremely small samples. Since the expectation of a binomial distribution of n observations with probability p is np, and its variance p(1 ¡ p) is np(1 ¡ p), the sample mean p¹ has mean p and variance n . Hence, the interval estimate, at confidence level α " ¸ q q p(1 ¡ p) p(1 ¡ p) P ¡z 1¡® ∙ p¹ ¡ p ∙ z 1¡® = ® (1) n n 2 2 The glitch in this formula is that the unknown value p is everywhere. The “precise” way to handle this problem is to look at the line above as two inequalities, that, squared, correspond to two quadratic inequalities. Solving these inequalities for p produces our “precise” interval estimate. 3. Normal Approximation (“Cautious” or “Pessimistic”) Since0 ∙ p ∙ 1, the variance of each observation is bounded by ¼, and hence is standard q p(1 ¡ p) 1 , inserting this value in the inequality (1) generates a ∙ 2p deviation by ½. Since n n confidence interval that is certainly not smaller (most likely larger) than the “precise” one. Hence, we can use this as a simple, “pessimistic” estimate. 4. Normal Approximation (“Sloppy”) Finally, if you don’t mind the loss of control, you can use the “practical” rule suggested by the book: use the empirical proportion instead of the “true” one in the expression for the variance. It will generally yield a smaller interval than the “cautious” version, but you have no way of knowing whether the result is “optimistic” or “pessimistic”, that is whether the interval you come up with is actually too small or too large for the given confidence level.