Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 651 Lecture #16 Copyright (c) Bani K. Mallick 1 Topics in Lecture #16 Inference about two population proportions Copyright (c) Bani K. Mallick 2 Book Sections Covered in Lecture #16 Chapter 10.3 Copyright (c) Bani K. Mallick 3 Lecture #15 Review: Categorical Data In general, we can discuss a problem where the outcome is binary, the success probability is p, and number of experiments is n. X = the number of successes in the experiment p̂ = the fraction of successes in the experiment Copyright (c) Bani K. Mallick 4 Lecture #15 Review: Categorical Data The number of success X in n experiments each with probability of success p is called a binomial random variable There is a formula for this: n! pk (1 p)nk Pr(X = k) = k! (n-k)! 0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc. Copyright (c) Bani K. Mallick 5 Lecture #15 Review: Categorical Data The fraction of successes in n experiments each with probability of success p also have a formula : n! k nk p (1 p ) Pr(p̂ = k/n) = k! (n-k)! The binomial formulae is used to understand the properties of the sample fraction, e.g., its standard deviation Copyright (c) Bani K. Mallick 6 Lecture #15 Review: If you code your attribute as “0” and “1” in SPSS, then the sample fraction is the p̂ sample as the sample mean of these “data” For example, let the “data” be 0,1,0,0,0,1,0,1 Then n = 8, and p̂ = 3/8 What is the sample mean of these data? Copyright (c) Bani K. Mallick 7 Lecture #15 Review: If you code your attribute as “0” and “1” in SPSS, then the sample fraction is the p̂ sample as the sample mean of these “data” For example, let the “data” be 0,1,0,0,0,1,0,1 Then n = 8, and p̂ = 3/8 What is the sample mean of these “data”? X 3/8 p ˆ Copyright (c) Bani K. Mallick 8 Lecture #15 Review: Categorical Data (1a)100% CI for the population fraction p ˆ z a/2 ˆ pˆ ˆ pˆ p ˆ(1 p ˆ) n z a/2 is by looking up 1a/2 in Table 1 Copyright (c) Bani K. Mallick 9 Lecture #15 Review: Sample Size Calculations If you want an (1a)100% CI interval to be p̂ E you should set nz 2 a/2 p(1 p) E2 Copyright (c) Bani K. Mallick 10 Lecture #15 Review: Sample Size Calculations nz 2 a/2 p(1 p) E2 The small problem is that you do not know p. You have two choices: Make a guess for p Set p = 0.50 and calculate (most conservative, since it results in largest sample size) Copyright (c) Bani K. Mallick 11 Comparison of Two Population Proportions In some cases, we may want to compare two populations p1 and p2 The null hypothesis is H0: p1 = p2 This is the same as H0: p1 - p2 = 0 There are two ways to test this hypothesis One is via what is called a chisquared statistic, which gives you only a p-value This is bad: why? Copyright (c) Bani K. Mallick 12 Comparison of Two Population Proportions In some cases, we may want to compare two populations p1 and p2 The null hypothesis is H0: p1 - p2 = 0 There are two ways to test this hypothesis One is via what is called a chisquared statistic, which gives you only a p-value This is bad: why? If you reject, you have no idea how different the populations are! Copyright (c) Bani K. Mallick 13 Comparison of Two Population Proportions The null hypothesis is H0: p1 - p2 = 0 The other way is to form a CI for the difference in population proportions p1 - p2 The estimate of this difference is simply the difference in the sample fractions: p ˆ1 p ˆ2 Copyright (c) Bani K. Mallick 14 Comparison of Two Population Proportions The standard error of the difference in the sample fractions: pˆ1 pˆ2 p1 (1 p1 ) p 2 (1 p 2 ) n1 n2 The usual way to form a CI is to replace the unknown population fractions by the sample fractions Copyright (c) Bani K. Mallick 15 Comparison of Two Population Proportions The estimated standard error of the difference in the sample fractions: ˆ pˆ1 pˆ2 p ˆ1 (1 p ˆ1 ) p ˆ 2 (1 p ˆ2 ) n1 n2 The (1a)100% CI then is p ˆ1 p ˆ 2 z a / 2 ˆ pˆ1 pˆ2 Copyright (c) Bani K. Mallick 16 Comparison of Two Population Proportions: Boxers versus Brief Most books force you to compute this by hand For female preferences in men: n1 177 , p̂1 0.7345 For male preferences: n2 188, p̂2 0.4681 Think the populations are different? p ˆ1 p ˆ2 0.2664 Copyright (c) Bani K. Mallick 17 Comparison of Two Population Proportions: Boxers versus Brief The estimated standard error of the difference in the sample fractions is ˆ pˆ1 ˆp2 p p 2 (1 ˆ p2 ) ˆ1 (1 p ˆ1 ) ˆ n1 n2 0.001102 0.001324 0.04944 Copyright (c) Bani K. Mallick 18 Comparison of Two Population Proportions: Boxers versus Brief Putting this together we get that the 95% CI is 0.2664 – 1.96 * 0.04944 = 0.17 up to the value 0.2664 + 1.96 * 0.04944 = 0.36 So, 95% CI is from 0.17 to 0.36 What is this a CI for? What is the conclusion? Copyright (c) Bani K. Mallick 19 Comparison of Two Population Proportions: Boxers versus Brief 95% CI is from 0.17 to 0.36 What is this a CI for? The difference in population fractions of preferring boxers is from 0.17 to 0.36 What is the conclusion? More females prefer men to wear boxers than do males, by 17% to 36% Copyright (c) Bani K. Mallick 20 Comparison of Two Population Proportions: Remarkably, but perhaps not surprisingly, you do not have to compute these confidence intervals by hand! The idea: simply pretend, and I do mean pretend, that the binary outcomes are real numbers and run your ordinary t-test CI, unequal variance line The results will be slightly different from your hand calculations, but actually a bit more accurate Copyright (c) Bani K. Mallick 21 Illustration with the Boxers Problem The value “1” indicates a preference for boxers Note how women have a higher preference for boxers than do men, in this sample Group Statistics Boxer vers us Briefs Preference Gender Female Male N 177 188 Mean .7345 .4681 Copyright (c) Bani K. Mallick Std. Error Std. Deviation Mean .4429 3.329E-02 .5003 3.649E-02 22 Illustration with the Boxers Problem Independent Samples Test Levene's Test for Equality of Variances F Boxer versus Briefs Preference Equal variances assumed Equal variances not assumed 49.523 t-test for Equality of Means Sig. t .000 df Sig. (2-tailed) Mean Difference Std. Error Difference 95% Confidence Interval of the Difference Lower Upper 5.373 363 .000 .2664 4.957E-02 .1689 .3639 5.393 361.642 .000 .2664 4.939E-02 .1692 .3635 Copyright (c) Bani K. Mallick 23 Illustration with the Boxers Problem Independent Samples Test Levene's Test for Equality of Variances F Boxer versus Briefs Preference Equal variances assumed Equal variances not assumed 49.523 Sig. t-test for Equality of Means t .000 5.373 5.393 df Sig. (2-tailed) 363 361.642 Mean Difference .000 .000 .2664 Std. Error Difference 4.957E-02 .2664 4.939E-02 95% Confidence Interval of the Difference Lower Upper .1689 .3639 .1692 .3635 Difference in sample means = 0.2664 Standard error of this difference = 0.04939 Copyright (c) Bani K. Mallick 24 Illustration with the Boxers Problem: hand CI is 0.17 to 0.36: note similarities! Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means Mean F Boxer versus Briefs Preference Equal variances assumed Equal variances not assumed 49.523 Sig. t .000 5.373 5.393 df Sig. (2-tailed) Difference 363 361.642 .000 .000 Std. Error Difference 95% Confidence Interval of the Difference Lower Upper .2664 4.957E-02 .1689 .3639 .2664 4.939E-02 .1692 .3635 p-value = 0.000. Note how you use the unequal variances p-value Copyright (c) Bani K. Mallick 25 Illustration with the Boxers Problem: hand CI is 0.17 to 0.36: note similarities! Independent Samples Test Levene's Test for Equality of Variances F Boxer versus Briefs Preference Equal variances assumed Equal variances not assumed 49.523 Sig. t-test for Equality of Means t .000 5.373 5.393 df Sig. (2-tailed) 363 361.642 Mean Difference Std. Error Difference 95% Confidence Interval of the Difference Lower Upper .000 .2664 4.957E-02 .1689 .3639 .000 .2664 4.939E-02 .1692 .3635 The 95% CI from SPSS is 0.1692 to 0.3635. Nearly same as hand calculation. Men and Women have different preferences at even 99.9% confidence. Copyright (c) Bani K. Mallick 26 US Availability and Rating: Are Better Beers More Widely Available? The “data” are coded as 0 = not widely available 1 = widely available Group Statistics Availability in the U.S. Very Good versus Other Very Good Fair or Good N 11 24 Mean 0.45 0.75 Std. Deviation .52 .44 Std. Error Mean .16 9.03E-02 With the “data” coded as 0 and 1, this means that in the sample, 45% of the very good beers were widely available Copyright (c) Bani K. Mallick 27 US Availability and Rating: Are Better Beers More Widely Available? Group Statistics Availability in the U.S. Very Good versus Other Very Good Fair or Good N 11 24 Mean 0.45 0.75 Std. Deviation .52 .44 Std. Error Mean .16 9.03E-02 With the “data” coded as 0 and 1, this means that in the sample, 75% of the fair/good beers were widely available Copyright (c) Bani K. Mallick 28 US Availability and Rating: Are Better Beers More Widely Available? Independent Samples Test Levene's Test for Equality of Variances F Availability in the U.S. Equal variances assumed Equal variances not assumed 3.169 Sig. .084 t-test for Equality of Means t -1.734 -1.628 df Sig. (2-tailed) 33 16.864 Mean Difference Std. Error Difference 95% Confidence Interval of the Difference Lower Upper .092 -.30 .17 -.64 5.12E-02 .122 -.30 .18 -.68 8.77E-02 This is the p-value for the hypothesis that the two population fractions are the same Copyright (c) Bani K. Mallick 29 Comparison of Two Population Proportions: Note that the p-values were > 0.10 What does this mean? Copyright (c) Bani K. Mallick 30 Comparison of Two Population Proportions: Note that the p-values were > 0.10 What does this mean? There is no evidence that those beers which are very good have any more or less national availability than those which are good or fair Copyright (c) Bani K. Mallick 31 Construction Example The construction example was based on a survey made available to me. I will look at the percentages of males sampled in Texas and in states outside of Texas If these were random samples, they would be a measure of how different states are in their gender distributions in the construction industry Copyright (c) Bani K. Mallick 32 Construction Data: Gender Differences by Texas or Not (1 = male) Group Statistics Sex State: Texas or Not Outside Texas Texas N 274 173 Mean .86 .26 Std. Deviation .34 .44 Std. Error Mean 2.07E-02 3.35E-02 Something strange: 86% of the sample outside Texas is male 26% of the sample in Texas is male Copyright (c) Bani K. Mallick 33 Construction Data: Gender Differences by Texas or Not (1 = male) Independent Samples Test Levene's Test for Equality of Variances F Sex Equal variances ass umed Equal variances not as sumed 43.713 Sig. .000 t-tes t for Equality of Means t df Sig. (2-tailed) Mean Difference Std. Error Difference 95% Confidence Interval of the Difference Lower Upper 16.260 445 .000 .60 3.72E-02 .53 .68 15.379 300.960 .000 .60 3.93E-02 .53 .68 Something strange: 86% of the sample outside Texas is male 26% of the sample in Texas is male Not surprising: p-value = 0.000 Copyright (c) Bani K. Mallick 34 Comparison of Two Population Proportions: Please study the slides for the next lecture before coming to class The material is somewhat difficult, and if you do not look at the slides and try to understand them, you will find my lecture all but impossible to understand. Copyright (c) Bani K. Mallick 35