Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HARVARDx | HARPH525T114-G003200_TCPT In the last module we talked about how the central limit theorem tells us the distribution, what the statistical distribution of the difference of the sample average is. But we were stuck because we were using an equation that included two quantities that we don't know. They're the, and they are the population standard deviation. So what we can do, or what we do in practice, is we compute the sample standard deviation. So here's the formula for the sample variance. There's a minus one here, which it makes this a better estimate of the true standard deviation. And there's some statistical theory that backs that up, which we're not going to get into in this module. So basically what you do is you take the sample average, and you compute the average distance of the sample measurements from that sample average, and it turns out that this is a pretty good estimate of the real population standard error. Especially when M is large. So we stick in those quantities, SX and SY, and now we have a fraction, a ratio, that we can actually compute. We can compute this from the sample. We can compute this from the sample. This and this we can compute from the sample. And we know what M and N is because we decided how big to make the sample. And the central limit theorem tells us that when N and M are large enough, there's different rules of thumb. One is that if 30 is enough, but that's doesn't always work. Then this ratio, call the t statistic is normal, means zero, standard deviation 1. So now we can actually answer that question. We saw 3.3, and that 3.3, if we compute this quantity, turns out that it's a relatively big number. And it actually gives us a p-value of a 0.02. So the probability of seeing the tstatistic we saw, under the [INAUDIBLE] that there's no difference between men and women is actually quite small. And that probability comes from the central limit approximation. So that's what we get it from. We compute a number, and then we ask ourselves, what's the probability of seeing a number that big when the quantity we're observing is normally distributed. All right, now, the central limit theorem depends on N and M being very large, it's an asymptotic result, that's how we call it in statistics, in mathematics. And when we have 10, you're not that close to infinity. So this approximation might not necessarily work. And in one of the modules, you're going to get to write your own code to check if it holds or not. Now, there is still hope when we have a small sample size. If the distribution of the population is approximately normal. 1 OK, so now we're not talking central limit theorem anymore. Now we're talking about the original distribution of the data. Which in this case, because Its heights, is actually approximately numbered. If the original data were zeros and ones, like coin tosses, this wouldn't work. But for height it works pretty well. So what is this new mathematical result we're going to use? Well, it turns out that if the original population distribution is normal, then the t-statistic, you could show, follows a specific distribution. It's not normal. It's actually called the t-distribution. And that's where the statistic gets its name. And in R, you can find out what the probability of observing a number like the one we observed when the distribution under the node is t-distributed, you can also find that very easily in R. It's not normal, it's t. But again, the t-distribution, we know what it is, you can look it up in a table or using a computer. There's one little detail that the t-distribution you need to know to actually compute it and it's called the degrees of freedom. In this case, I'm just telling you that the degrees of freedom for this statistic would be M plus N minus 2. But that is something you can learn more about by studying the t-distribution and its properties and a statistics book. But in any case, if these assumptions hold, we can use a t distribution instead of normal. And in this case, the p value actually becomes a little bigger. It's closer now to-- it's bigger than the 0.2 we got with the normal distribution. And that makes sense because when we use the t statistic, we're adding variability in the denominator. Because these are estimates, and they also can move. So it makes the probability of seeing bigger numbers larger. So now we're going to reveal the true answer. The actual difference was 5.2. We got a 3.3. So that that's actually something we could have predicted that our answer could have been this far from the truth. We got the answer right because we did say that based on our p value, we did conclude that there was a difference in heights. The actual difference 5.2, which is not 3.3. But we can actually see that that is in agreement with our central limit theorem approximations. How do we do that? Well we can create confidence intervals. And actually, I would, I'm going to highly recommend that when you're reporting results when possible, you want to report confidence intervals instead of just p values. Why is that? Because when you report a confidence interval, you actually give the reader an idea of how large the observed difference or effect size is. When you report a p-value you only report the probability of seeing something that large of the null, which could be very unlikely. But for a very small and scientifically insignificant difference. 2 So what is a confidence interval? Just to give an example with the sample average. We know that the standard error of X bar is this quantity here. Then because we know that it's normally distributed, or if we assume that it's normally distributed, then this interval, which is shown now down here, is going to fall on the true population mean 95% of the time. So this gives us an idea, when we report it like this, when we say 3.3 plus or minus 1.2, or whatever it is that we get. That gives us an idea of what our estimate is, and how much it varies. How much the estimate varies, so it gives us an idea of the possibilities of what the true the true population mean is. So just like we did this for the sample mean, we can actually also do it for the difference, which is what we're interested in here. And I'm not going to show you the equations, but it's pretty straightforward how do you get from here to there. The other thing we want to point out, is that this 2 comes from the normal approximation, right? We say, a normal distribution has 95% of its data within two standard deviations from its mean. If instead, we're using the t-distribution approximation, then this number is going to change. It's going to get a little bigger. So that's something to keep in mind. When are we using the normal distribution, when are we using the t-distribution approximations? So you're going to have some modules and some labs to practice this and to get a better idea of how this all works in practice. 3