Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 11 Comparing Two Populations 11.1 Inferences concerning the difference between two population means using independent samples Many investigations are carried out for the purpose of comparing two populations. For example, Population 1 = {All nonsmokers}. 1 = the mean life expectancy of nonsmokers. Population 2 = {All smokers}. 2 = the mean life expectancy of smokers. A medical researcher may be interested in whether 1 > 2. Notation Mean value Population 1 Population 2 1 2 Variance Standard deviation 1 2 2 1 2 2 Sample from population 1 Sample Size n1 Sample Mean x1 Sample Variance s12 Sample Standard deviation s1 Sample from population 2 n2 x2 s 22 s2 Comparison of means focuses on the difference, 1- 2. 1- 2 = 0 is equivalent to 1 = 2 1- 2 > 0 is equivalent to 1> 2 1- 2 < 0 is equivalent to 1< 2 Definition 11.1 Two samples are said to be independent if the selection of the individuals or objects that make up one sample does not influence the selection of those in the other sample. Since x1 provides an estimate of 1 and x 2 gives an estimate of 2, it is natural to use x1 x 2 as a point estimate of 1 - 2. Our inferential methods will be based on information about the sampling distribution of x1 - x 2 . Properties of the sampling distribution of x1 - x 2 If the random samples on which x1 and x 2 are based are selected independently, then 1. the mean value of x1 - x 2 : x1 x2 = x1 - x 2 = 1- 2. That is, the sampling distribution of x1 - x 2 is always centered at the value of 1- 2, thus x1 - x 2 is an unbiased statistic for estimating 1- 2. 2. the variance of x1 - x 2 : 2x x = 2x + 2x = 1 2 1 2 12 n1 The standard deviation of x1 - x 2 : x1 x2 = n22 . 12 n1 2 n22 . 2 3. If both population distributions are normal, x1 and x 2 each has a normal distribution and x1 - x 2 has also a normal distribution. Thus, z x1 x2 ( 1 2 ) 12 n1 22 n2 has the standard normal (z) distribution. 4. If n1 and n2 are both large (n1 30 and n2 30), x1 and x 2 each has approximately a normal distribution and x1 - x 2 has approximately a normal distribution even when each population distribution is not itself normal. Thus, z x1 x2 ( 1 2 ) 12 n1 22 n2 has approximately the standard normal (z) distribution. Note: Properties 1 and 2 follow from the following general results: 1) The mean value of a difference equals the difference of the two individual mean values. 2) The variance of a difference of independent quantities is the sum of the two individual variances. Generally, 12 and 22 are unknown, we must estimate them by the corresponding sample variances, s12 and s22 , and must use t The sampling distribution of t x1 x2 ( 1 2 ) s12 n1 s 22 n2 . x1 x2 ( 1 2 ) s12 n1 s 22 n2 If (1) the two random samples are independently selected, (2) the population distributions are normal or n1 and n2 are both large (n1 30 and n2 30) then the standardized variable t x1 x2 ( 1 2 ) s12 n1 s 22 n2 has (approximately) a t distribution with degrees of freedom df (V1 V2 ) 2 V2 1 n1 1 V2 2 n 2 1 , where V1 s12 n1 and V2 s22 n2 . df should be truncated (rounded down) to an integer. Summary of the two-sample t test for comparing two population means Null hypothesis: H0: 1- 2 = hypothesized value. Test statistic: t x1 x 2 hypothesized value s12 n1 s 22 n 2 . The appropriate df for the two-sample t test is df (V1 V2 ) 2 V2 1 n1 1 V2 2 n 2 1 , where V1 s12 n1 and V2 s22 n2 . df should be truncated (rounded down) to an integer. Alternative hypothesis Ha: 1- 2 > hypothesized value (Upper-tailed test) Ha: 1- 2 < hypothesized value (Lower-tailed test) Ha: 1- 2 hypothesized value (Two-tailed test) P-value Area under appropriate t curve to the right of the computed t Area under appropriate t curve to the left of the computed t (i) 2(area to the right of the computed t) if t is positive (ii) 2(area to the left of the computed t) if t is negative Assumptions: 1. The two samples are independently selected random samples. 2. n1 and n2 are both large (n1 30 and n2 30) or the population distributions are at least approximately normal. The two-sample t confidence interval for the difference between two population means When 1. the two samples are independent random samples, and 2. the sample sizes are both large (generally n1 30 and n2 30) OR the population distributions are approximately normal the general formula for a confidence interval for 1- 2 is s 22 n2 . and V2 s22 n2 . ( x1 - x 2 ) (t critical value) s12 n1` The t critical value is based on df (V1 V2 ) 2 V2 1 n1 1 V2 2 n 2 1 , where V1 s12 n1 df should be truncated (rounded down) to an integer. The t critical values for the usual confidence levels are given in Appendix Table 3 on page 708. Example 11.1 In a study comparing the salaries of workers with college degrees and workers without college degrees the following summary data resulted from independent random samples on both types of workers. Group College Degree No College Degree Sample Size 100 75 Sample Mean Salary $32,500 $25,800 Sample Standard Deviation $4,100 $5,600 (a) Is there sufficient evidence to conclude that college graduates earn at least $5,000 more than workers not having a college degree, on the average? Use = .05 (b) Estimate the difference in mean salaries for workers having a college degree and workers not having a college degree using a 95% confidence interval. (c) What is your conclusion based on confidence interval in part (b)? (a) 1. Population characteristics of interest: 1 = true mean salary of workers having a college degree. 2 = true mean salary of workers not having a college degree. 1 -2 = difference in mean salaries. 2. Null hypothesis: H0: 1 -2 = ? 3. Alternative hypothesis: Ha: 1 -2 > ? 4. Significance level: = 0.05 5. Test statistic: ed value x x 5, 000 t x1 x 2 hypothesiz = 12 2 2 2 2 s1 n1 s 2 n 2 s1 n1 s 2 n 2 6. Assumptions: This test requires two independent random samples and two large sample sizes. The given samples were two independent random samples and the sample sizes were n1 = 100, n2= 75, the two sample t test is appropriate. 7. Computations: n1 = 100, x1 =32,500, s1 = 4,100, n2 = 75, x2 = 25,800, and s2 = 5,600 t (32,5002 25,800) 5,000 = 1,700 / 765.6588 2.22. 2 ( 4 ,100) 100 ( 5, 600) 75 8. P-value: We first compute the df for the two-sample t test: 2 2 V1 sn11 = 4,1002/100 = 168,100, V2 ns22 =5,6002/75 = 418,133.3333, df (V1 V2 ) 2 V2 V2 1 2 n1 1 n2 1 (168,100 418,133.3333) 2 168,1002 1001 418,133.33332 751 129.781. We truncate df to 129. P-value = area under t curve with df = 129 to the right of 2.22 1-P( z < 2.22 ) = 0.0132. 9. Conclusion: Since P-value = 0.0132 < 0.05 = , H0 is rejected at the .05 level. There is sufficient evidence to conclude that college graduates earn at least $5,000 more than workers not having a college degree on the average. (b) A 95% confidence interval for 1 -2 is ( x1 - x 2 ) (t critical value) = (32,500 – 25,800) 1.96 s12 n1` s 22 n2 ( 4 ,100) 2 100 ) (5,600 75 2 = 6,700 1.96 765.6588 = 6,700 1500.69 = (5,199.31, 8200.69). Question: What is the interpretation of the confidence interval? (c) Since the entire interval is above 5,000, we can conclude that college graduates earn at least $5,000 more than workers not having a college degree on the average, the same conclusion as in part (a). 11.2 Large-sample inferences concerning a difference between two population proportions Many investigations are carried out to compare the proportion of successes in one population to the proportion of successes in a second population. Notation Population 1: Proportion of “successes” = 1 Population 2: Proportion of “successes” = 2 Sample from population 1 Sample from population 2 Sample size Sample proportion of successes n1 p1 n2 p2 When comparing the “success” proportions of two populations, it is common to focus on the quantity 1-2, the difference between the two proportions. Since p1 provides an estimate of 1 and p2 provides an estimate of 2, the obvious choice for an estimate of 12 is p1–p2. Since the statistic p1–p2 will be the basis for drawing inferences about 1-2, we introduce some properties of its sampling distribution first Properties of the sampling distribution of p1– p2 If two random samples are selected independently, the following properties hold: 1. p1 p 2 = 1-2 That is, the sampling distribution of p1-p2 is centered at 1-2, so p1-p2 is an unbiased statistic for estimating 1-2. 2. p21 p2 p21 p22 p p 1 2 1 (1 1 ) n1 1 (1 1 ) n1 2 (1 2 ) n2 2 (1n2 2 ) . 3. If both n1 and n2 are large (that is, n11 10, n1 (1- 1) 10, n22 10, and n2 (1- 2) 10), then p1 and p2 each has approximately a normal distribution, and p1– p2 also has approximately a normal distribution. Thus, the standardized variable z ( p1 p2 ) ( 1 2 ) 1 ( 1 1 ) n1 2 ( 1 2 ) n2 has approximately the standard normal(z) distribution. A large-sample test procedure For comparisons of 1 and 2, the most general null hypothesis of interest has the form H0: 1-2 = hypothesized value Since H0: 1-2 = 0, which is equivalent to 1=2, is almost always the relevant one in applied problems, we will focus exclusively on it. Let denote the common value of the two population proportions. Then the z variable obtained by standardizing p1– p2 becomes z p1 p2 ( 1 ) n1 ( 1 ) n2 . Unfortunately, this cannot serve as a test statistic, because is unknown and thus the denominator cannot be computed. A test statistic can be obtained by first estimating from the sample data and then using this estimate in the denominator of z. When 1 = 2, either p1 or p2 separately gives an estimate of the common proportion . However, a better estimate than either of these is a weighted average of the two. Definition 11.2 The combined estimate of the common population proportion is pc n1 p1 n 2 p 2 n1 n 2 = (total number of S’s in two samples) / (total sample size). The test statistic for testing H0: 1-2 = 0 results from using pc in place of in the standardized variable z. When H0 is true, z pc (1 ppc1) pp2c (1 pc ) has approximately the n1 n2 standard normal distribution. Thus we have the following test procedure. Summary of large-sample z tests for 1-2 = 0 Null hypothesis: H0: 1-2 = 0 Test statistic: z p1 p 2 pc ( 1 pc ) n1 Alternative hypothesis Ha: 1- 2 > 0 (Upper-tailed test) Ha: 1- 2 < 0 (Lower-tailed test) Ha: 1- 2 0 (Two-tailed test) pc ( 1 pc ) n2 P-value Area under the z curve to the right of the computed z Area under the z curve to the left of the computed z (i) 2(area to the right of the computed z) if z is positive (ii) 2(area to the left of the computed z) if z is negative Assumptions: b. The samples are independent random samples. c. Both sample sizes are large: n1p1 10, n1 (1- p1) 10, n2p2 10, and n2 (1- p2) 10. A confidence interval A large-sample confidence interval for 1-2 is a special case of the general z interval formula Point estimate (z critical value) (estimated standard deviation) The statistic p1– p2 gives a point estimate of 1-2, and the standard deviation of this statistic is p p 1 2 1 (1 1 ) n1 2 (1n2 2 ) . An estimated standard deviation is obtained by using the sample proportions p1 and p2 in place of 1 and 2 respectively. A large-sample confidence interval for 1-2 When 1. the samples are independent random samples, and 2. both sample sizes are large: n1p1 10, n1 (1- p1) 10, n2p2 10, and n2 (1- p2) 10. a large-sample confidence interval for 1-2 is (p1– p2) (z critical value) p1 (1 p1 ) n1 p 2 (1 p 2 ) n2 Example 11.2 Independent samples of the opinions of registered Republican and Democratic party members was taken concerning the weakening of the Endangered Species Act (ESA). The results are listed in the table below. Party Affiliation For Weakening ESA Against Weakening ESA Republican 38 62 Democrat 26 74 Let 1 denote the proportion of Republicans who are in favor of weakening the Endangered Species Act and 2 denote the proportion of Democrats who are in favor of weakening the Endangered Species Act. (a) Give point estimates of 1, 2, and 1 - 2. (b) Is there sufficient evidence to conclude that a higher proportion of Republicans would like to weaken the Endangered Species Act? Use = .05. (c) Compute a 95% confidence interval for 1 - 2. (d) What is your conclusion based on confidence interval in part (c)? (a) The point estimates for 1, 2, and 1 - 2 are p1 = 38 / (38 + 62) = 0.38, p2 = 26 / (26 + 74) = .26, and p1 - p2 = 0.38 – 0.26 = 0.12 (b) 1. Population characteristics of interest: 1 = the proportion of Republicans who are in favor of weakening the Endangered Species Act. 2 = the proportion of Democrats who are in favor of weakening the Endangered Species Act. 1 - 2 = the difference between the proportions of Republicans and Democrats who are in favor of weakening the Endangered Species Act. 2. Null hypothesis: H0: 1 -2 = 0 3. Alternative hypothesis: Ha: 1 -2 > 0 4. Significance level: = 0.05 5. Test statistic: z p1 p 2 pc ( 1 pc ) n1 pc ( 1 pc ) n2 6. Assumptions: This test requires two independent random samples and two large sample sizes. The given samples were two independent random samples with sample sizes of n1 = 38 + 62 = 100 and n2 = 26 + 74 = 100. Since n1 p1= 38 > 10, n1(1- p1) = 62 > 10, n2 p2= 26 > 10, and n2(1- p2) = 74 > 10, the large-sample z test for 1 -2 = 0 is appropriate. 7. Calculations: n1 = n2 = 100, p1 = .38, p2 = .26 pc n1 pn11 nn22 p 2 = (1000.38+100.26) / (100 + 100) = .32 z 0.38 0.26 .32 ( 1 .32 ) 100 .32 ( 1 .32 ) 100 = 0.12 / 0.066 =1.82 8. P-value: P-value = the area under the z curve to the right of 1.82 = 1- 0.9656 = 0.0344. 9. Conclusion: Since P-value = 0.0344 < 0.05 = , H0 is rejected. There is sufficient evidence to conclude that a higher proportion of Republicans would like to weaken the Endangered Species Act. (c) A 95% confidence interval for 1-2 is (p1– p2) (z critical value) = (0.38 – 0.26) (1.96) = (-0.0082, 0.2482) p1 (1 p1 ) n1 ( 0.38)(1 0.38) 100 p 2 (1 p 2 ) n2 1 0.26) = 0.12 1.960.0654 = 0.12 0.1282 0.26(100 Question: What is the interpretation of the confidence interval? (d) Since 0 is in this interval, there is no sufficient evidence to conclude that a higher proportion of Republicans would like to weaken the Endangered Species Act. Question: Why does the conclusion based on the confidence interval differ from the one based on the hypothesis test?