Download Professor Green Intro. Stats Confidence Intervals and Hypothesis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Professor Green
Intro. Stats
Confidence Intervals and Hypothesis Testing
Choosing the appropriate test. When formulating statistical tests, ask yourself the following
questions:
C
Is the underlying population from which the data are drawn normal?
C
Is the population standard deviation known? Is it assumed?
C
Is the sample size sufficiently large so as to make the shape of the underlying population
distribution irrelevant?
C
Are the observations independent of one another?
The answers to these questions will inform your choice of which probability distribution to use for
your test. For example, if the sample is small, the underlying distribution is normal, and σ is
known, the distribution of the sample mean is normal. Or if the sample is small, the underlying
distribution is normal, but σ is unknown, the distribution of the sample mean follows a tdistribution (see chp. 7). Note that all of the tests we will consider in the class assume that the
observations are independent (an assumption that is often violated in time-series data, where one
observation depends on the observation before it).
Confidence intervals. You will be expected to understand the mechanics of how to calculate, say,
a 95% confidence interval based on a normal approximation, and Chapter 6 provides numerous
examples. More important, however, is an understanding of what confidence intervals represent.
If the test statistics have been chosen appropriately (see above), a 95% interval makes the
following claim about the procedure by which the interval was constructed: if this procedure were
replicated a large number of times, 95% of the intervals would bracket the true population
parameter (e.g., the population mean).
The interval varies from one sample to the next. Yet we have enough information from just one
sample to say something about what would happen if we had an infinite number of samples of size
N.
Hypothesis tests. The tricky part about hypothesis testing is formulating a meaningful and
testable null hypothesis. Often the meaningful null hypothesis is the skeptic’s position (e.g., the
draft lottery during the Vietnam War was random), and the burden of proof falls to the person
making the affirmative claim. Sometimes, however, the skeptic’s position is a diffuse range of
possible conjectures, in which case the null hypothesis generally becomes the position that can be
stated precisely. (Hence the presumption that is accorded certain forms of social science that
make precise predictions.)
Here are some examples of hypothesis tests.
Hearts is four-player card game. If all players have equal skill, any given player would be
expected to win 25% of the games. In October of 1997, I played 120 games of the Windows95
version of hearts. The null hypothesis is that I am equal in skill to the 3 other computerized
players. In formal terms, this claim implies that
Η0: π=.25
Ηa: π is not equal to .25
Note that I could reject the null hypothesis if I performed significantly better or worse than 30 out
of 120.
Note also that once the null hypothesis is formulated, the rest of the test proceeds on the
assumption that it is true. We ask, “Could the data we have observed in our sample have been
generated by the underlying process depicted in the null hypothesis?”
We construct the confidence interval based on the assumption that π=.25. Recall that the
standard deviation for proportions sampled from a population in which ‘successes’ occur 25% of
the time is
standard error'
B(1&B)
'
n
(.25)(.75)
'.0395
120
Thus, our interval is centered at .25 with a standard deviation of .0395. If the sample proportions
are distributed normally over repeated experiments, 95% of the time, the sample proportion will
fall +/- 1.96 standard deviations around the mean. This interval spans from (.173 to .327) or, in
terms of victories, from 20.7 to 39.3. So, if I accept the conventional view that I should reject a
null hypothesis if the discrepancy between it and the observed data would occur less than 5% of
the time due to random chance, then I will reject the hypothesis of equal playing talent if I observe
a number of victories less than 21 or greater than 39. Note that all of these steps have
proceeded without any information about what the actual outcome was!
As it happens, I won 58 games. So reject that null hypothesis.
When we ask Minitab to perform these calculations, we get a melange of results. Some reflect the
logic of hypothesis testing, and others the logic of confidence intervals. It’s important to see the
subtle differences between the two. When using dichotomous data, we pull down the menu Stats
> Basic Stats > 1 Sample Proportions. Note that we don’t want to rely on the SE MEAN that
comes out of Basic Stats > Display Descriptive Stats, because that uses a slightly different
formula.
Test and Confidence Interval for One Proportion
Test of p = 0.25 vs p not = 0.25
Success = 1
Variable
hearts1
X
58
N
120
Sample p
0.483333
95.0 % CI
(0.391172, 0.576336)
Exact
P-Value
0.000
The null hypothesis I specified under “Options” was “not equal to.” The p-value here is the
probability that we would observe a test statistic (.48) this different from .25 if the data had in fact
been generated by a true population π of .25. The fact that this number is so small means that we
would very, very seldom observe such a result by random chance.
The confidence interval is created in a slightly different way. Here, the best guess of my
population parameter π is .48. The interval around this guess is approximately +/- 2 standard
errors in each direction. More on the “approximate” part in just a moment. Notice from the
standard error formula that the probability that one would observe 25% wins if the true π = .48 is
not the same as the probability that one would observe 48% wins if the true π = .25. Why not?
Now let’s consider another hypothesis: I’m no worse at bridge than the Bicycle bridge
computer (which is darn good). Bridge is a four player game, but players play on teams of two;
so being as good as the computer means winning 50% of the time. I play 66 games. Thus,
Η0: π=.5
Ηa: π is less than .5
Note that this is now a one-sided test because the alternative hypothesis is a “less than” statement.
In other words, if I win every game, the evidence will not reject the null hypothesis in favor of the
alternative hypothesis.
Again, I construct the hypothesis test on the assumption that Η0 is true. Following the same
procedure as before, we obtain a standard error:
standard error'
B(1&B)
'
n
(.5)(.5)
'.0615
66
To reject the null hypothesis at the 5% level, we must observe a number of wins less than
1.65*.0615=.102 below the expected rate of .50. Winning 39.8% of the time means winning 26.3
games. Thus, we would reject the null hypothesis that I’m as good as the computer if I win 26 or
fewer games. Fortunately, I won 28. Whew!
Again, we can get Minitab to do some of the heavy lifting. The alternative hypothesis is
that the proportion is below .50 (which tells Minitab to perform a one-sided test). Note that the
resulting p-value is .13, which is greater than .05. Hence, we cannot reject the null hypothesis
because results like this (.42) could be attributable to chance if the data had been generated by a
true parameter of .50.
Test and Confidence Interval for One Proportion
Test of p = 0.5 vs p < 0.5
Variable
bridge
X
28
N
66
Sample p
0.424242
95.0 % CI
(0.303402, 0.552106)
Exact
P-Value
0.134
Notice how the confidence interval is constructed. The interval is centered at .4242, the 95%
interval ranges from .303 to .552, which is not quite symmetric around .4242. The reason is that
Minitab is using the exact binomial distribution rather than the normal approximation to the
binomial. If we want the normal approximation, we must select it as an option:
Test and Confidence Interval for One Proportion
Test of p = 0.5 vs p < 0.5
Success = 1
Variable
bridge
X
28
N
66
Sample p
0.424242
95.0 % CI
(0.305008, 0.543477)
Z-Value
-1.23
P-Value
0.109
Now the interval is centered around .4242 assuming a standard error of (.42(1-.42)/66).5=.0608.
Now +/-1.96 times this standard error creates the interval reported by Minitab. The Z value tells
us that the test statistic (.42) is 1.23 standard errors below the hypothesized parameter of .50.
Testing two sample proportions. One year later, I returned to my hearts program. Whether due
to rustiness or diminished intelligence, my success rate proved to be 19 wins in 60 games. Could
my earlier hearts record and my current record have been generated by the same level of hearts
proficiency?
If my true level of hearts acumen had been unchanged, our best guess of it would be 77
wins out of 180 tries, or .4278. The difference between my earlier and later proportions of
success was .483 - .317 = .166. The null hypothesis is that
Η0: π1 = π2 (or equivalently, π1 - π2 = 0)
Ηa: π1 > π2
The alternative hypothesis, in other words, is that I am losing it. Under the null hypothesis, the
difference between the two proportions is expected to be zero with a standard error of
standard error'
B(1&B) B(1&B)
%
'
n1
n2
.245 .245
%
'.0782
120 60
Thus, .166 is (.1666/.0782) = 2.13 standard errors away from zero. Using a normal distribution
table, we see that only 1.7% of the area under the curve lies 2.13 standard deviations to the right
of zero. So it seems that my hearts talent has diminished over time. Pathetic.
Test and Confidence Interval for Two Proportions
Variable
hearts1
hearts2
X
58
19
N
120
60
Sample p
0.483333
0.316667
Estimate for p(hearts1) - p(hearts2): 0.166667
95% CI for p(hearts1) - p(hearts2): (0.0188550, 0.314478)
Test for p(hearts1) - p(hearts2) = 0 (vs > 0): Z = 2.13 P-Value = 0.017
Sample means based on continuous variables. The same principles apply to confidence intervals
and hypothesis tests involving a sample mean. We know the sampling distribution for a mean
calculated from a sample of size n given a known σ. If the population from which the sample is
drawn is normal, so is the sampling distribution; otherwise, the sampling distribution becomes
approximately normal when n is large. Again, to construct a 95% interval around a sample mean,
we multiply σ/sqrt(n) by +/- 1.96 and add the resulting values to the mean. If we wish to perform
an hypothesis test, we build the interval around the value stated in H0.
If the standard deviation of the underlying population(s) is not known, the t-distribution is used
instead of the normal. The reason is that some of the information in the sample is used to estimate
σ. We will meet this distribution in future weeks. It is very close to a normal for n > 120.
Perils of hypothesis testing. Note that in recent years, rigid hypothesis tests have come under
attack. Research literatures become skewed by articles reporting significant test results; essays
that do otherwise fail to earn publication; and, of course, the reader may not be shown lots of
insignificant tests that were performed but not disclosed. Furthermore, significance levels used to
judge hypotheses tend to be arbitrary. Consequently, it is becoming increasingly common to
report test statistics, standard errors, and p-values — leaving the readers to draw their own
conclusions.
Two other problems warrant mention. If one is on the lookout for weird results — e.g., the
strong relationship between party identification and zodiac sign in one General Social Survey —
one may be able to show post hoc that such results could not be attributed to chance. Note,
however, that “1-in-20” reasoning assumed by classical hypothesis testing is valid only if we
assume that the data are not drawn in a tendentious manner. Finally, one must take care not to
confuse substantive significance with “statistical significance.” With sufficiently large sample
sizes, even minuscule differences in sample proportions may prove to be greater than random
chance would suggest. But who cares? Conversely, one should not be too quick to dismiss
interesting results on the grounds that they could be due to chance. Time will tell.