Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HARVARDx | HARPH525T114-G001400_TCPT In this module, we're going to introduce the normal approximation. It's also called the bell curve, it's also called the Gaussian distribution. So here's a formula for it. It tells us that the proportion of numbers between a and b is given by this integral on the right. Now, there's two numbers I want you to focus on, on this equation on the right, and it's mu and sigma. Those are called the average and the standard deviation. The average is sometimes called the mean. You might hear me use those two interchangeably throughout this class, mean and average. Now notice that those are the only two numbers we need to know. They're the only two numbers that define the integral on the right. So if a list of numbers follows a normal distribution, and we know mu and we know sigma, we know everything there is to know about this distribution. We can answer the question, what proportion of x is there between any a and any b? So if the distribution of your data is approximated by a normal distribution, then you know for your data, for any interval, what proportion of your data is in that interval. For example, you're going to know that 95% of your data is within two standard deviations of the average. You're going to know that 99% of your data is within 2.5 standard deviations of the average. And you know that 68% are within one standard deviation from the average. So let's use our example. With height, we can compute the mean-- it's about 68 inches-- and we can compute the standard deviation-- it's about three inches. And, as we will see soon, it is approximately normal. So if someone asks us, what proportion of your data, what proportion of your adults are between 65 inches and 71 inches, we don't have to go back and count. Because we know if it's approximated by a normal distribution, and because we know that this interval we were given was one standard deviation away from the average, we know that 68% is going to be very close to the right answer. So the average, just to define it very quickly, is defined by this formula here. You basically add up your heights, and you divide by the total number of adults, and that's going to give you the average height. The standard deviation is like a average distance from the mean. Notice that here, we're computing in 1 the square. This is a mathematical convenience. This is called a variance, it's the standard deviation squared. And what it does, again, is take, roughly, an average of the distance between each individual and the mean. So let's look at the distribution of heights. Those are with a histogram. And let's show a normal distribution with mean obtained from the data and standard deviation obtained from the data. And you can see that it looks like a pretty good approximation. To look at this a little bit more rigorously, we can look at what is called a quantile- quantile plot. So what we do is, we compute, for example, for percentile. So we go 1%, 2%, 3%. So we compute, what is the number for which 1% of the data is below that number? 2%, 3%, 4%, and we compute what are called the percentiles. We do that for our data set, and we also do that for the normal distribution. And what we're plotting here is, on the x-axis, the percentiles for the normal distribution, and on the y-axis, the percentage for our data. If these two have the same distribution, then these points should fall on the identity line, and in this case they are pretty close. So this is telling us that this is a pretty good approximation. Now, the final concept I want to describe today is standard units. It's a very useful concept. If you know your data is approximately normal, then you can convert it to what are called standard units, by subtracting the mean and dividing by the standard deviation for each point in your distribution. Those are now going to have normal distribution with mean 0 and standard deviation 1. So now, when you talk about a 3, or a 2, or a 1, you're not talking about inches or pounds, or whatever other unit. You're basically saying, how many standard deviations away am I from the mean? So if you have a Z of 2, you're quite tall. Or if you have a Z of 3, then you're extremely tall. So when we convert data to Z scores, we no longer have to think about units. If we have a 2, it is a big number. Regardless of if it's weight or inches, or whatever it is, we know that, if it's normally distributed, only 5% of the data is more than two standard deviations away from the mean. So it's a very convenient transformation that we use over and over in this class. So in summary, if your data is approximated by the normal distribution, then all you need to know about your data is the mean and standard deviation. So those are considered a summary of the data. You say what's the mean, what's the standard deviation-- with those two numbers, you can describe the 2 what's the mean, what's the standard deviation-- with those two numbers, you can describe the proportion of any interval. Now, that is only if your data is normally distributed. In other cases-- imagine income distribution, or gene expression without taking the log-- these are quantities are normally distributed. Therefore, the mean and standard deviation can't be considered summaries in the same way. They may be useful for certain calculations, but they will not be summaries of the data in the same way as they are when the data is normally distributed. 3