Download HYPOTHESIS TESTING: SEEING GENERAL PRINCIPLES

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
HYPOTHESIS TESTING: SEEING GENERAL PRINCIPLES
THROUGH THE ONE-SAMPLE T-TEST
It’s often helpful to look again at class material but in a slightly different way. In this
case I want to explore, again, the general idea of hypothesis testing but from a different direction.
The goal is to help you establish your intuition and understanding more firmly by reviewing the
material but ordering it differently than its order in class.
Let’s begin by thinking about how we might construct a test of the idea that the true mean
body length of adult killifish, Fundulus grandis, in the marshes near the FSU Marine Lab
(FSUML) is 100 mm. Now, clearly, if we catch and measure 50 adult F. grandis at the FSUML
and they range in size from 55 mm to 75 mm with an average of 64 mm, we’d probably be
willing to bet, without doing any statistics, that the average size is NOT 100 mm. We come to
this conclusion because we find it hard to believe that, in a sample of 50 fish, we wouldn’t even
catch one fish that was 100 mm and that the average of those 50 fish could be 64 mm if the real
mean were 100 mm. In other words, a difference between the actual average (64 mm) and our
hypothesized average (100 mm) of 36 mm is too large, compared to the range of variation we
found, to be consistent with the hypothesis. Sure, it could happen, but we’d have to have caught
a very aberrant sample of adults, which might happen only once in a very, very large number of
trials. In this case we’d reject the hypothesis as being extremely unlikely.
This is of course the essence of hypothesis testing - we encounter a result that is so
unlikely under the original, or null, hypothesis that we cannot believe that hypothesis to be true.
All statistical hypothesis tests in the frequentist mode of operation (as opposed to a pure
Bayesian mode) work in this manner.
But what if our sample of 50 adults ranged from 80 to 110 mm with an average of 90
mm? Now we might have several adults whose size was greater than 100 mm even though the
average size was below 100 mm. It could be that a sample with this average might occur
reasonably often even if the true mean was 100 mm. In other words, we might believe that this
result is not sufficiently odd under the null hypothesis; it’s not sufficiently unlikely, if you will,
to cause us to reject the null hypothesis. In thinking about the problem purely at this intuitive
level, we might be critical of our reasoning at this point - wondering if we have a clear idea of
when we would find a result sufficiently unlikely to cause us to reject our null hypothesis. That
is, how unlikely does a result have to be before we reject the original (= null) hypothesis?
Now, we could, without reference to any numerical method, decide on a criterion for
rejection. That is, we might say that we’ll reject the null hypothesis if the probability of a result
like the one we have, or a result even more extreme than the one we have, is no greater than onein-twenty, or 0.05. This gives us a veneer of objectivity except for the untidy fact that we don’t
have a way to estimate this probability. Clearly what we need is some function of the data
whose distribution under the null hypothesis can be found. If we had that distribution, then we
would know how unlikely any particular result might be under that null hypothesis.
So let’s build an intuitively appealing function of the data. That is, let’s build a function
of the data that has the intuitive property of revealing something about the likely truth of the null
hypothesis. First, we could agree that the further the average deviates from the one predicted by
the null hypothesis, the less likely we are to believe that the null hypothesis is true. So our
function should include some variation on the idea of that difference,
X’ - µ (0)
where X’ refers to the sample average and µ (0) refers to the mean value specified by the null
hypothesis.
But the credence we’d give a particular absolute difference might depend on how
variable the data appeared. A deviation from the prediction of 10 mm might be more convincing
if our sample of 50 adult fish varied only over a range of 10 mm, say from 85 to 95 mm with an
average of 90 mm. If they varied more widely, say from 80 to 110 with an average of 90, we’d
likely be less confident in declaring the null hypothesis false, as we stated earlier. So we might
like to weigh X’ - µ (0) by some measure of the spread in the data. Using the range (maximum minimum) might not be a good idea because it is very sensitive to 1-2 odd points in the sample.
But we know that the variance is a good measure of the spread of a distribution so we could use
the sample variance, s2 , as a weight. The higher the variance, the less confident we are in a
given difference and the lower the variance, the more confident we are. We could then weigh X’
- µ (0) by the variance; more specifically, we would weigh by the inverse of the variance, which
would give us an index with properties we like (index rises as X’ - µ (0) rises, index rises as s2
falls). All seems well, except that variances are in units squared and a difference is in units unsquared and it’s always nice to work in the same unit of measurement. So we can weigh by the
sample standard deviation, s.
Our function, [X’ - µ (0)] / s, is appealing. Now all we need do is figure out its
distribution under the null hypothesis.
Now, if we think about it, we realize that this looks like something we’ve seen before.
The numerator is a linear combination of a random variable and a constant and so we know how
to find its expected value, at least. It’s a function of a sample average and a constant, and we
know something about the distribution of sample averages and linear combinations of averages,
at least as sample size goes up (central limit theorem). In fact, we know that an average will
have a normal distribution with expected value µ and variance σ2 /n. And so we can see that
we’re dangerously close to a quantity whose distribution we, or some smart statistician, could
find. So one way to approach this is to realize that
[X’ - µ (0)] / [ s / √ n ]
looks like it ought to be close to some variation on a normal distribution. Another way to
approach it is to realize that if we squared all of this we’d be close to something we called an Fdistribution a few weeks ago (see the section on squares of centered normal distributions). One
way or another, it turns out that this index, which we call a t-statistic, has a distribution that can
be found. It’s also a nice, dimensionless quantity, which is a nice property for a general index.
Of course, we could find that distribution in a practical way. We could erect a
distribution of F. grandis body sizes (any distribution would do) with a true mean of 100 mm
and some variance, simulate 1,000 samples of 50 from this original distribution (thereby
obtaining a distribution of sample averages) and the resultant distribution of our t-statistic can be
found accordingly by calculating a t-statistic for each of our 1,000 sample averages. Clearly the
numerator of our t-statistic will be centered on 0 because we will have about as many sample
averages above 100 mm as below 100 mm. We would reject the null hypothesis that the true
mean is 100 mm if we take a real sample whose t-statistic fell into the “rare” range of possible tstatistics.
What is the “rare” range? Well, if we say that we’re not sure if the true mean would be
greater or less than 100 mm, then the “rare” range of t-statistics would be those values that are so
large or so small that they would occur only rarely, say no more than 5% of the time. If we mean
“no more than 5% of the time” then we mean that the cumulative probability associated with all
“rare” values must not exceed 5%. So we find the 2.5% thresholds on either side of our tstatistic’s distribution and name these values as our thresholds - we reject the null hypothesis if
we obtain a t-statistic value greater or less than these thresholds. If we choose a directional
alternative hypothesis, that is, we want to test our null against an alternative that the true mean is
larger (or smaller) than the null hypothesis, we would adjust our threshold values. In this case,
we are looking for a result in one direction that would occur no more than 5% of the time, so we
find the critical value beyond which no more than 5% of the observations will fall.
Of course, the distributions of t-statistics can be derived analytically for different sample
sizes (related to the degrees of freedom used as a parameter of the t-distribution) so we don’t
have to do simulations. But you can see how the whole process unfolds by thinking about the
simulated distributions.
Consider what happens if the true adult size distribution were such that the true mean
adult body size was 110 mm. We could simulate samples of size 50 and find the distribution of
the t-statistic if the true mean was 110 mm. Remember that we’re still calculating the numerator
of our t-statistic as X’ - 100 mm because that’s our null hypothesis. You can see, I hope, that
with a true mean of 110 mm and a null hypothesis of 100 mm, we will accumulate more t-values
whose numerator is in excess of 0 than we did under the null hypothesis, so our t-distribution
will shift to the right of the distribution under the null hypothesis. We can calculate the
probability of a type II error (accepting the null hypothesis when it is wrong) and power
(probability of rejecting the null hypothesis when it is false) from that distribution simply by
counting the proportion of samples falling to either side of the critical t-value that we chose when
we specified the null hypothesis and the type of alternative (directional or not). In general, with
analytic solutions, we could calculate these probabilities without counting up our simulation
results, but you get the idea. You can also see immediately, I hope, as we showed in class, that
choosing a directional or non-directional alternative changes the power of the test.
Now consider what happens as the true mean adult size increases. If the true mean were
120 mm, and we repeated this process, we’d accumulate even more positive t-values than we did
when we simulated a true mean of 110 mm, the distribution would shift even further to the right
of our distribution under the null hypothesis, and our power would increase. And it would
happen even more as the true mean moved to 130 mm, 140 mm, etc. So as the true mean
increases, so does the power of our test. This is a critical lesson: power is a function of the true
alternative hypothesis. In practical terms, we can see that, for a given sample size and variance
within the data, we have more power to detect a large difference from the null hypothesis than
we do to detect a small difference. Put another way, we can draw a curve of power on the
vertical axis against “magnitude of difference” on the horizontal axis. The curve would rise and
eventually hit an asymptote - at some point, when the true mean is a lot higher than 100 mm,
almost all of the distribution of the t-statistic under the alternative hypothesis will be to the right
of the distribution under the null hypothesis and we can’t increase power much more.
I suppose this is a good juncture at which to state that this idea of power vs. magnitude
of difference is NOT something we discussed in class.
What we did discuss was how power is affected by the sample size and the variance in
the data. We saw that as the sample size increases, the distribution of the t-statistic under any
particular alternative hypothesis shifts to the right. To make sure you see why this happens,
remember that we can rewrite the t-statistic as
[X’ - µ (0)] √ n / [ s ]
So you can see that with a specific alternative, the numerator of the t-statistic increases as the
square root of the sample size increases, moving the distribution of the t-statistic to the right. As
the variation within the data increases, the t-statistic decreases because s is in the denominator.
What happens with decreases in s is NOT that the whole distribution shifts location, but the
width of the t-distribution under both the null and alternative hypotheses decreases. As the shape
of the distribution changes, so does the probability of a type II error and power: the higher the
variance, the lower the power.
It’s easy to see these effects when thinking about a one-sample t-test because the effects
of each factor on the test statistic and its distribution under the null and alternative hypotheses
are relatively easy to visualize. The effects are the same for any and all test statistics, whether
we talk about analysis of variance, regression, or other test machineries. The key is to realize
that power is a function of the specific alternative hypothesis, sample size, and the variance of
the data. Knowing these facts, you can then study how different choices of experiment and
experimental design can affect each of these elements and design the optimal experiment or
observational hypothesis test for the problem you’re investigating.