Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Professor Green Intro. Stats Confidence Intervals and Hypothesis Testing Choosing the appropriate test. When formulating statistical tests, ask yourself the following questions: C Is the underlying population from which the data are drawn normal? C Is the population standard deviation known? Is it assumed? C Is the sample size sufficiently large so as to make the shape of the underlying population distribution irrelevant? C Are the observations independent of one another? The answers to these questions will inform your choice of which probability distribution to use for your test. For example, if the sample is small, the underlying distribution is normal, and σ is known, the distribution of the sample mean is normal. Or if the sample is small, the underlying distribution is normal, but σ is unknown, the distribution of the sample mean follows a tdistribution (see chp. 7). Note that all of the tests we will consider in the class assume that the observations are independent (an assumption that is often violated in time-series data, where one observation depends on the observation before it). Confidence intervals. You will be expected to understand the mechanics of how to calculate, say, a 95% confidence interval based on a normal approximation, and Chapter 6 provides numerous examples. More important, however, is an understanding of what confidence intervals represent. If the test statistics have been chosen appropriately (see above), a 95% interval makes the following claim about the procedure by which the interval was constructed: if this procedure were replicated a large number of times, 95% of the intervals would bracket the true population parameter (e.g., the population mean). The interval varies from one sample to the next. Yet we have enough information from just one sample to say something about what would happen if we had an infinite number of samples of size N. Hypothesis tests. The tricky part about hypothesis testing is formulating a meaningful and testable null hypothesis. Often the meaningful null hypothesis is the skeptic’s position (e.g., the draft lottery during the Vietnam War was random), and the burden of proof falls to the person making the affirmative claim. Sometimes, however, the skeptic’s position is a diffuse range of possible conjectures, in which case the null hypothesis generally becomes the position that can be stated precisely. (Hence the presumption that is accorded certain forms of social science that make precise predictions.) Here are some examples of hypothesis tests. Hearts is four-player card game. If all players have equal skill, any given player would be expected to win 25% of the games. In October of 1997, I played 120 games of the Windows95 version of hearts. The null hypothesis is that I am equal in skill to the 3 other computerized players. In formal terms, this claim implies that Η0: π=.25 Ηa: π is not equal to .25 Note that I could reject the null hypothesis if I performed significantly better or worse than 30 out of 120. Note also that once the null hypothesis is formulated, the rest of the test proceeds on the assumption that it is true. We ask, “Could the data we have observed in our sample have been generated by the underlying process depicted in the null hypothesis?” We construct the confidence interval based on the assumption that π=.25. Recall that the standard deviation for proportions sampled from a population in which ‘successes’ occur 25% of the time is standard error' B(1&B) ' n (.25)(.75) '.0395 120 Thus, our interval is centered at .25 with a standard deviation of .0395. If the sample proportions are distributed normally over repeated experiments, 95% of the time, the sample proportion will fall +/- 1.96 standard deviations around the mean. This interval spans from (.173 to .327) or, in terms of victories, from 20.7 to 39.3. So, if I accept the conventional view that I should reject a null hypothesis if the discrepancy between it and the observed data would occur less than 5% of the time due to random chance, then I will reject the hypothesis of equal playing talent if I observe a number of victories less than 21 or greater than 39. Note that all of these steps have proceeded without any information about what the actual outcome was! As it happens, I won 58 games. So reject that null hypothesis. When we ask Minitab to perform these calculations, we get a melange of results. Some reflect the logic of hypothesis testing, and others the logic of confidence intervals. It’s important to see the subtle differences between the two. When using dichotomous data, we pull down the menu Stats > Basic Stats > 1 Sample Proportions. Note that we don’t want to rely on the SE MEAN that comes out of Basic Stats > Display Descriptive Stats, because that uses a slightly different formula. Test and Confidence Interval for One Proportion Test of p = 0.25 vs p not = 0.25 Success = 1 Variable hearts1 X 58 N 120 Sample p 0.483333 95.0 % CI (0.391172, 0.576336) Exact P-Value 0.000 The null hypothesis I specified under “Options” was “not equal to.” The p-value here is the probability that we would observe a test statistic (.48) this different from .25 if the data had in fact been generated by a true population π of .25. The fact that this number is so small means that we would very, very seldom observe such a result by random chance. The confidence interval is created in a slightly different way. Here, the best guess of my population parameter π is .48. The interval around this guess is approximately +/- 2 standard errors in each direction. More on the “approximate” part in just a moment. Notice from the standard error formula that the probability that one would observe 25% wins if the true π = .48 is not the same as the probability that one would observe 48% wins if the true π = .25. Why not? Now let’s consider another hypothesis: I’m no worse at bridge than the Bicycle bridge computer (which is darn good). Bridge is a four player game, but players play on teams of two; so being as good as the computer means winning 50% of the time. I play 66 games. Thus, Η0: π=.5 Ηa: π is less than .5 Note that this is now a one-sided test because the alternative hypothesis is a “less than” statement. In other words, if I win every game, the evidence will not reject the null hypothesis in favor of the alternative hypothesis. Again, I construct the hypothesis test on the assumption that Η0 is true. Following the same procedure as before, we obtain a standard error: standard error' B(1&B) ' n (.5)(.5) '.0615 66 To reject the null hypothesis at the 5% level, we must observe a number of wins less than 1.65*.0615=.102 below the expected rate of .50. Winning 39.8% of the time means winning 26.3 games. Thus, we would reject the null hypothesis that I’m as good as the computer if I win 26 or fewer games. Fortunately, I won 28. Whew! Again, we can get Minitab to do some of the heavy lifting. The alternative hypothesis is that the proportion is below .50 (which tells Minitab to perform a one-sided test). Note that the resulting p-value is .13, which is greater than .05. Hence, we cannot reject the null hypothesis because results like this (.42) could be attributable to chance if the data had been generated by a true parameter of .50. Test and Confidence Interval for One Proportion Test of p = 0.5 vs p < 0.5 Variable bridge X 28 N 66 Sample p 0.424242 95.0 % CI (0.303402, 0.552106) Exact P-Value 0.134 Notice how the confidence interval is constructed. The interval is centered at .4242, the 95% interval ranges from .303 to .552, which is not quite symmetric around .4242. The reason is that Minitab is using the exact binomial distribution rather than the normal approximation to the binomial. If we want the normal approximation, we must select it as an option: Test and Confidence Interval for One Proportion Test of p = 0.5 vs p < 0.5 Success = 1 Variable bridge X 28 N 66 Sample p 0.424242 95.0 % CI (0.305008, 0.543477) Z-Value -1.23 P-Value 0.109 Now the interval is centered around .4242 assuming a standard error of (.42(1-.42)/66).5=.0608. Now +/-1.96 times this standard error creates the interval reported by Minitab. The Z value tells us that the test statistic (.42) is 1.23 standard errors below the hypothesized parameter of .50. Testing two sample proportions. One year later, I returned to my hearts program. Whether due to rustiness or diminished intelligence, my success rate proved to be 19 wins in 60 games. Could my earlier hearts record and my current record have been generated by the same level of hearts proficiency? If my true level of hearts acumen had been unchanged, our best guess of it would be 77 wins out of 180 tries, or .4278. The difference between my earlier and later proportions of success was .483 - .317 = .166. The null hypothesis is that Η0: π1 = π2 (or equivalently, π1 - π2 = 0) Ηa: π1 > π2 The alternative hypothesis, in other words, is that I am losing it. Under the null hypothesis, the difference between the two proportions is expected to be zero with a standard error of standard error' B(1&B) B(1&B) % ' n1 n2 .245 .245 % '.0782 120 60 Thus, .166 is (.1666/.0782) = 2.13 standard errors away from zero. Using a normal distribution table, we see that only 1.7% of the area under the curve lies 2.13 standard deviations to the right of zero. So it seems that my hearts talent has diminished over time. Pathetic. Test and Confidence Interval for Two Proportions Variable hearts1 hearts2 X 58 19 N 120 60 Sample p 0.483333 0.316667 Estimate for p(hearts1) - p(hearts2): 0.166667 95% CI for p(hearts1) - p(hearts2): (0.0188550, 0.314478) Test for p(hearts1) - p(hearts2) = 0 (vs > 0): Z = 2.13 P-Value = 0.017 Sample means based on continuous variables. The same principles apply to confidence intervals and hypothesis tests involving a sample mean. We know the sampling distribution for a mean calculated from a sample of size n given a known σ. If the population from which the sample is drawn is normal, so is the sampling distribution; otherwise, the sampling distribution becomes approximately normal when n is large. Again, to construct a 95% interval around a sample mean, we multiply σ/sqrt(n) by +/- 1.96 and add the resulting values to the mean. If we wish to perform an hypothesis test, we build the interval around the value stated in H0. If the standard deviation of the underlying population(s) is not known, the t-distribution is used instead of the normal. The reason is that some of the information in the sample is used to estimate σ. We will meet this distribution in future weeks. It is very close to a normal for n > 120. Perils of hypothesis testing. Note that in recent years, rigid hypothesis tests have come under attack. Research literatures become skewed by articles reporting significant test results; essays that do otherwise fail to earn publication; and, of course, the reader may not be shown lots of insignificant tests that were performed but not disclosed. Furthermore, significance levels used to judge hypotheses tend to be arbitrary. Consequently, it is becoming increasingly common to report test statistics, standard errors, and p-values — leaving the readers to draw their own conclusions. Two other problems warrant mention. If one is on the lookout for weird results — e.g., the strong relationship between party identification and zodiac sign in one General Social Survey — one may be able to show post hoc that such results could not be attributed to chance. Note, however, that “1-in-20” reasoning assumed by classical hypothesis testing is valid only if we assume that the data are not drawn in a tendentious manner. Finally, one must take care not to confuse substantive significance with “statistical significance.” With sufficiently large sample sizes, even minuscule differences in sample proportions may prove to be greater than random chance would suggest. But who cares? Conversely, one should not be too quick to dismiss interesting results on the grounds that they could be due to chance. Time will tell.