Download HARVARDx | HARPH525T114-G003200_TCPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Central limit theorem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
HARVARDx | HARPH525T114-G003200_TCPT
In the last module we talked about how the central limit theorem tells us the distribution, what the
statistical distribution of the difference of the sample average is. But we were stuck because we were
using an equation that included two quantities that we don't know. They're the, and they are the
population standard deviation. So what we can do, or what we do in practice, is we compute the sample
standard deviation.
So here's the formula for the sample variance. There's a minus one here, which it makes this a better
estimate of the true standard deviation. And there's some statistical theory that backs that up, which
we're not going to get into in this module. So basically what you do is you take the sample average, and
you compute the average distance of the sample measurements from that sample average, and it turns
out that this is a pretty good estimate of the real population standard error. Especially when M is large.
So we stick in those quantities, SX and SY, and now we have a fraction, a ratio, that we can actually
compute. We can compute this from the sample. We can compute this from the sample. This and this
we can compute from the sample. And we know what M and N is because we decided how big to make
the sample. And the central limit theorem tells us that when N and M are large enough, there's different
rules of thumb. One is that if 30 is enough, but that's doesn't always work.
Then this ratio, call the t statistic is normal, means zero, standard deviation 1. So now we can actually
answer that question. We saw 3.3, and that 3.3, if we compute this quantity, turns out that it's a
relatively big number. And it actually gives us a p-value of a 0.02. So the probability of seeing the tstatistic we saw, under the [INAUDIBLE] that there's no difference between men and women is actually
quite small. And that probability comes from the central limit approximation. So that's what we get it
from. We compute a number, and then we ask ourselves, what's the probability of seeing a number that
big when the quantity we're observing is normally distributed.
All right, now, the central limit theorem depends on N and M being very large, it's an asymptotic result,
that's how we call it in statistics, in mathematics. And when we have 10, you're not that close to infinity.
So this approximation might not necessarily work. And in one of the modules, you're going to get to
write your own code to check if it holds or not. Now, there is still hope when we have a small sample
size. If the distribution of the population is approximately normal.
1
OK, so now we're not talking central limit theorem anymore. Now we're talking about the original
distribution of the data. Which in this case, because Its heights, is actually approximately numbered. If
the original data were zeros and ones, like coin tosses, this wouldn't work. But for height it works pretty
well. So what is this new mathematical result we're going to use? Well, it turns out that if the original
population distribution is normal, then the t-statistic, you could show, follows a specific distribution. It's
not normal. It's actually called the t-distribution. And that's where the statistic gets its name.
And in R, you can find out what the probability of observing a number like the one we observed when
the distribution under the node is t-distributed, you can also find that very easily in R. It's not normal, it's
t. But again, the t-distribution, we know what it is, you can look it up in a table or using a computer.
There's one little detail that the t-distribution you need to know to actually compute it and it's called the
degrees of freedom. In this case, I'm just telling you that the degrees of freedom for this statistic would
be M plus N minus 2. But that is something you can learn more about by studying the t-distribution and
its properties and a statistics book.
But in any case, if these assumptions hold, we can use a t distribution instead of normal. And in this
case, the p value actually becomes a little bigger. It's closer now to-- it's bigger than the 0.2 we got with
the normal distribution. And that makes sense because when we use the t statistic, we're adding
variability in the denominator. Because these are estimates, and they also can move. So it makes the
probability of seeing bigger numbers larger.
So now we're going to reveal the true answer. The actual difference was 5.2. We got a 3.3. So that
that's actually something we could have predicted that our answer could have been this far from the
truth. We got the answer right because we did say that based on our p value, we did conclude that
there was a difference in heights. The actual difference 5.2, which is not 3.3. But we can actually see
that that is in agreement with our central limit theorem approximations.
How do we do that? Well we can create confidence intervals. And actually, I would, I'm going to highly
recommend that when you're reporting results when possible, you want to report confidence intervals
instead of just p values. Why is that? Because when you report a confidence interval, you actually give
the reader an idea of how large the observed difference or effect size is. When you report a p-value you
only report the probability of seeing something that large of the null, which could be very unlikely. But
for a very small and scientifically insignificant difference.
2
So what is a confidence interval? Just to give an example with the sample average. We know that the
standard error of X bar is this quantity here. Then because we know that it's normally distributed, or if
we assume that it's normally distributed, then this interval, which is shown now down here, is going to
fall on the true population mean 95% of the time. So this gives us an idea, when we report it like this,
when we say 3.3 plus or minus 1.2, or whatever it is that we get. That gives us an idea of what our
estimate is, and how much it varies. How much the estimate varies, so it gives us an idea of the
possibilities of what the true the true population mean is.
So just like we did this for the sample mean, we can actually also do it for the difference, which is what
we're interested in here. And I'm not going to show you the equations, but it's pretty straightforward how
do you get from here to there. The other thing we want to point out, is that this 2 comes from the
normal approximation, right? We say, a normal distribution has 95% of its data within two standard
deviations from its mean. If instead, we're using the t-distribution approximation, then this number is
going to change. It's going to get a little bigger.
So that's something to keep in mind. When are we using the normal distribution, when are we using the
t-distribution approximations? So you're going to have some modules and some labs to practice this
and to get a better idea of how this all works in practice.
3