Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Inference for Population Proportions So far we were interested in estimating and answering questions for population means. In these cases our parameters of interest were either a population mean µ or a difference between two population means, µ1 − µ2 . We will now study the analysis of population proportions: • the proportion of voters who intend on voting for the incumbent prime minister in the next election. • the proportion of cancer patients who are going to survive at least 5 years after treatment • the proportion of batteries which lasts at least 6 hours • the proportion of students who receive an A in Stat 151 We can also interpret the population proportion, as the probability of the event of interest (when randomly choosing an individual from the population). 2 Estimation of a population proportion p The sample proportion p̂ is defined for a sample an given by: p̂ = x n where x denotes the number of members in the sample that have the specified attribute (or number of successes), and n denotes the sample size. It seems natural to use the sample proportion for estimating a population proportion. In order to confirm that this is statistically reasonable, we need to study the distribution of p̂ (why is this a random variable?). The following is also called the Central Limit Theorem for proportions. For samples of size n: 1. (mean) mup̂ = p 2. (standard deviation) σp̂ = q p(1 − p)/n 3. (shape) If n is large then p̂ is approximately normally distributed. The first property means, that p̂ is an unbiased estimator for p, the second property means the larger n, the more likely p̂ is falling ”close” to p, and the last property lets us construct confidence intervals and tests (yippee!) Example A study showed, that the proportion of people in the 20 to 34 age group with an IQ (on the Wechsler Intelligence Scale) of over 120 is about 0.35. Calculate the probability for the event that in a sample of 50 there are more than 20 people with an IQ of at least 120. For this 1 sample p̂ = 20/20 = 0.4 We will calculate how likely a sample proportion of 0.4 (or larger) is occurring in a sample of size 50, with a true population proportion of 0.35 P (p̂ ≥ 0.4 = P ( q p̂ − 0.35 0.35(0.65)/50 = P (Z ≥ 0.74) = 1 − P (Z < 0.74) = 1 − 0.7704 = .2296 ≥q 0.4 − 0.35 ) standardize 0.35(0.65)/50 (table II) We calculated that the probability that more than 20 out of 50 people (between 20 and 34) have an IQ greater than 120 is .23. Not that unlikely. 2.1 Large-Sample Confidence Interval for a Population Proportion p Let p be the probability of an event of interest. We saw before that p̂ = nx is an unbiased estimate for p, if x is the number of successes in n trials. Usually p is unknown and based on a random sample we can calculate a (1 − α)100% confidence interval. A (1 − α)100% Large Sample Confidence Interval for a Population Proportion p. s p(1 − p) n where z1−α/2 is the 1 − α/2 percentile of a standard normal distribution. Since p is unknown, it is estimated using p̂. The sample size is considered large when the normal approximation to the binomial distribution is adequate – namely when the number of successes and the number of failures are both at least five. p̂ ± zα/2 Proof: P p̂ − z1−α/2 q p(1−p) n ≤ p ≤ p̂ + z1−α/2 q p(1−p) n = P −z1−α/2 p−p̂ p(1−p)/n ≤√ = P √ p−p̂ p(1−p)/n = 1− α 2 − (1 − (1 − α2 )) ≤ z1−α/2 ≤ z1−α/2 − P p−p̂ p(1−p)/n √ ≥ −z1−α/2 = 1−α since √ p−p̂ p(1−p)/n is according to the Central Limit Theorem standard normal distributed. Remark: A confidence interval is calculated, when p is unknown. So the boundaries will be calculated by replacing p by the unbiased estimator p̂. This is only appropriate if n is large and will result 2 in an approximate confidence interval, that means the probability for the parameter to fall into the interval is approximately 1 − α. So we use: Let zα/2 the (1−α/2) percentile of the standard normal distribution and np > 5 and n(1−p) > 5. Then is s s p̂(1 − p̂) p̂(1 − p̂) p̂ − zα/2 ; p̂ + zα/2 n n an approximate (1 − α) confidence interval for p. Example: Consider flipping a coin 1000 times. In only 400 of the experiments HEAD was observed. Is this a surprising number, if the coin is unbiased. To answer this question calculate a 95% confidence interval from this data and check if 0.5 (the probability for HEAD, when tossing an unbiased coin) is in the confidence interval. First check if the conditions are met: np = n(1 − p) = 1000 · 0.5 = 500 ≥ 5. We conclude that we can apply the Central Limit Theorem and can use the above described method for obtaining a confidence interval. p̂ − zα/2 q p̂(1−p̂) n ; p̂ + zα/2 q p̂(1−p̂) n h = 400 1000 q − 1.96 0.4·0.6 1000 q ; 0.4 − 1.96 0.4·0.6 1000 i = [0.4 − 0.030 ; 0.4 + 0.030] = [0.37 ; 0.43] We can be 95% confident, that the true probability for HEAD is in the interval [0.37; 0.43]. Since 0.5 is not in the interval, it seems to be unlikely that 0.5 is the true probability for HEAD. Check the coin, what makes it biased! 2.2 Choosing the Sample Size The Margin of Error for the estimation of p is q E = zα/2 p(1 − p)/n Choosing the sample size for estimating a proportion p follows the same argument, as finding the sample size for estimating a mean µ, only that the formula is based on another confidence interval. Assume a probability p shall be estimated within a margin of error of E with a (1 − α)100% confidence interval, then !2 z( α/2) p(1 − p) n≥ E Since p is not known, use a guess, or use p = 0.5 as a conservative value in this formula. Example A poll shall be conducted to find the proportion of Canadians supporting the Liberal party within a margin of error of 3% (E = 0.03) then n≥ 1.96 0.03 2 0.5(0.5) = 1067.111 A sample size of 1068 would be required to make this goal. (This is why most polls are based on samples of size of a little above 1000). 3 2.3 A Large Sample Test Concerning a Proportion p For developing a test again the facts we know from the CLT have to be considered. The point estimator for a proportion is the sample proportion p̂. From the Central Limit Theorem we know about the sampling distribution of p̂ that: 1. µp̂ = p s 2. σp̂ = p(1 − p) n 3. If n is large the sampling distribution of p̂ is approximately normal. So we get that p − p̂ z=q p(1−p) n is standard normally distributed for large sample sizes. Using these properties it can be proved that the following procedure, is a statistical test, that ensures, that the probability to make an error of type I is less or equal than α. A Large Sample Test concerning a Proportion p 1. Hypotheses: Test type Upper tail H0 : p ≤ p0 versus Ha : p > p0 Lower tail H0 : p ≥ p0 versus Ha : p < p0 Two tail H0 : p = p0 versus Ha : p 6= p0 Choose α. 2. Assumption:Random sample and, the sample size n is large, that is that np̂ > 5 and n(1 − p̂) > 5. 3. Test statistic: Let p0 be a value between zero and one and define the test statistic z0 = q p̂ − p0 (p0 (1 − p0 ))/n 4. p-value and Rejection Region: Test type p-value Rejection Region Upper tail P (z > z0 ) z0 > zα Lower tail P (z < z0 ) z0 < −zα Two tail 2 · P (z > abs(z0 )) abs(z0 ) > zα/2 4 Where zα is the 1 − α percentile of the standard normal distribution. 5. Decision: If P-value≤ α or z0 falls in the rejection region, then reject H0 If P-value> α or z0 does not fall in the rejection region then do not reject H0 6. Context. Example: Suppose that you want to show that the proportion of adults above 40 who are participating in fitness activities is below 0.2. 1. So you want to test ( putting what you want to show into the alternative hypothesis Ha ) H0 : p ≥ 0.2 vs. Ha : p < 0.2 at a significance level of α = 0.05. 2. The sample size is n = 100 and the number of people sampled who participate in those activities equals 19, so that p̂ = 0.19, np̂ = 19 > 5 and n(1 − p̂) = 81 > 5, so the assumptions are met (assuming the sample was randomly chosen). 3. Then 0.19 − 0.2 = −0.25 z0 = q 0.2·0.8 100 4. Now calculate the P-value, according to the choice of Ha it is a lower tail test, so the P-value is the lower tail probability. P − value = P (z < −0.25) = 0.4013 (from table II.) 5. Decision: Since 0.4013=P-value> 0.05 = α, H0 is not rejected. 6. Context: At significance level of 5% the sample data do not provide sufficient evidence that less than 20% of adults 40 and older take part in fitness activities. 5 2.4 Estimating the Difference between Two Population Proportions Instead of comparing two population means let’s now compare two population proportions. Assume you want to compare • the rate of people who play computer games in the age groups of 20 to 30 and 30 to 40 • The proportion of defective items manufactured in two production lines The statistic for estimating the difference in two population proportions that comes to mind is the difference in the sample proportion (p̂1 − p̂2 ). Let study the sampling distribution of this statistic to construct a confidence interval. Properties of the Sampling Distribution of the Difference between two Sample Proportions (p̂1 − p̂2 ) Consider that you have two independent samples of sizes n1 and n2 from binomial populations with parameters p1 and p2 , respectively. The sampling distribution of (p̂1 − p̂2 ) has these properties: 1. The mean of (p̂1 − p̂2 ) is µp̂1 −p̂2 = p1 − p2 and the standard error is s p1 (1 − p1 ) p2 (1 − p2 ) + n1 n2 s p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) + n1 n2 SE = which is estimated by ˆ = SE 2. The sampling distribution of (p̂1 − p̂2 ) is approximately normal distributed, when the sample sizes n1 and n2 are large, that is when n1 p1 > 5 and n1 (1 − p1 ) > 5 and n2 p2 > 5 and n2 (1 − p2 ) > 5 These results now lead to the description of the estimation of (p1 − p2 ). Large Sample Point Estimation of (p1 − p2 ) Point estimate: (p̂1 − p̂2 ) s Margin of error: zα/2 p1 (1 − p1 ) p2 (1 − p2 ) + n1 n2 Large Sample (1 − α)100% Confidence Interval for (p1 − p2 ) s (p̂1 − p̂2 ) ± zα/2 p1 (1 − p1 ) p2 (1 − p2 ) + n, 1 n2 6 For this we have to assume again that n1 and n2 are large, that is n1 p1 5, n1 (1 − p1 ), n2 p2 , n2 (1 − p2 ) are greater than 5. In order to apply the tools described above, find that p1 and p2 , the population proportions, are unknown. In order to use the above procedures, we have to replace the population proportions by their estimates pˆ1 and p̂2 . So that you will estimate the margin of error by s ˆ = ±1.96 ±1.96SE p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) + n1 n2 and use the following Approximate Large Sample (1 − α)100% Confidence Interval for (p1 − p2 ) s (p̂1 − p̂2 ) ± zα/2 p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) + n1 n2 For this we have to assume again that n1 and n2 are large, that is n1 p1 , n1 (1 − p1 ), n2 p2 , n2 (1 − p2 ) are greater than 5. Example: Suppose we want to compare therapies. The criteria for the comparison is the probability to survive at least 5 years after therapy. The study produced the following data: Population 1 Population 2 n 100 80 x 90 70 0.875 p̂ = x/n 0.9 That is 90 out of 100 patients, who underwent therapy 1 survived at least 5 years. If we use p̂1 as estimate for p1 and p̂2 as estimate for p2 , we find that n1 p1 , n1 (1−p1 ), n2 p2 , n2 (1− p2 ) are all greater than 5. So we can use the formula from above for calculating a 95% confidence interval for p1 − p2 . s (p̂1 −p̂2 )±zα/2 s p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 ) 0.9(0.1) 0.875(0.125) + = (0.025)±1.96· + = 0.025±0.093 n1 n2 100 80 or [-0.068 ; 0.118]. Since 0 is captured in this interval, we find, that this data does not provide evidence, that the two therapies result in different probabilities to survive 5 years. They can be different, but this data does not show it. 7 2.5 Statistical Test for Two Population Proportions p1 and p2 Notation: population 1 population 2 population sample proportion size proportion p1 n1 p̂1 p2 n2 p̂2 Large-Sample z Test for comparing p1 and p2 • Hypotheses Test type Upper tail H0 : p1 − p2 ≤ 0 versus Ha : p1 − p2 > 0 Lower tail H0 : p1 − p2 ≥ 0 versus Ha : p1 − p2 < 0 Two tail H0 : p1 − p2 = 0 versus Ha : p1 − p2 6= 0 Assumption: Both sample sizes are large: Random samples, n1 p̂1 > 5, n1 (1 − p̂1 ) > 5, n2 p̂2 > 5, n2 (1 − p̂2 ) > 5 Test statistic: (p̂1 − p̂2 ) z0 = q p̂ c (1−p̂c ) n1 + p̂c (1−p̂c ) n2 P-value and Rejection Region: Test type P-value Rejection Region Upper tail P (z > z0 ) z0 > zα Lower tail P (z < z0 ) z0 < −zα Two tail 2 · P (z < −abs(z0 )) abs(z0 ) > zα/2 • Decision • Context Example: Find if the proportions of red M&M’s in the plain and peanut variety do differ at a significance level of 0.05. The sample Plain(1) Peanut(2) Sample Size 56 32 Number of red M&Ms 12 8 This results in p̂1 = 12/56 = 0.214 and p̂2 = 8/32 = 0.25 and p̂c = (12 + 8)/(56 + 32) = 20/88 = 0.227 8 1. The question asks for a test of H0 : p1 − p2 = 0 vs. Ha : p1 − p2 6= 0. α = 0.05 2. Assumption: Since p̂1 n1 , (1− p̂1 )n1 , p̂2 n2 , (1− p̂2 )n2 are all greater than 5, the assumptions are met and the test will deliver a reliable result. 3. Test statistic: z0 = q p̂ (p̂1 − p̂2 ) c (1−p̂c ) n1 + p̂c (1−p̂c ) n2 =q (0.214 − 0.25) 0.227(0.773) 56 + 0.227(0.773) 32 =√ −0.036 = −0.3882 0.0031 + 0.0055 4. Rejection region: With α = 0.05 the rejection region for a two tailed test is: abs(z0 ) > zα/2 = 1.96. or using the p-value: 2-tailed p-value=2P (z > abs(z0 )) = 2P (z > 0.3882) = 2(1 − 0.6517) = 0.6966 5. Decision: Since the P-value is not smaller than α = 0.05 do not reject H0 at significance level of 0.05. 6. At significance level of 5% we conclude that we do not have enough evidence, that the proportion of red M&M’s is different for the plain and peanut variety. 9