Download HARVARDx | HARPH525T114-G001400_TCPT

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

HARVARDx | HARPH525T114-G001400_TCPT
In this module, we're going to introduce the normal approximation. It's also called the bell curve, it's also
called the Gaussian distribution.
So here's a formula for it. It tells us that the proportion of numbers between a and b is given by this
integral on the right. Now, there's two numbers I want you to focus on, on this equation on the right, and
it's mu and sigma. Those are called the average and the standard deviation. The average is sometimes
called the mean. You might hear me use those two interchangeably throughout this class, mean and
Now notice that those are the only two numbers we need to know. They're the only two numbers that
define the integral on the right. So if a list of numbers follows a normal distribution, and we know mu
and we know sigma, we know everything there is to know about this distribution. We can answer the
question, what proportion of x is there between any a and any b?
So if the distribution of your data is approximated by a normal distribution, then you know for your data,
for any interval, what proportion of your data is in that interval. For example, you're going to know that
95% of your data is within two standard deviations of the average. You're going to know that 99% of
your data is within 2.5 standard deviations of the average. And you know that 68% are within one
standard deviation from the average.
So let's use our example. With height, we can compute the mean-- it's about 68 inches-- and we can
compute the standard deviation-- it's about three inches. And, as we will see soon, it is approximately
normal. So if someone asks us, what proportion of your data, what proportion of your adults are
between 65 inches and 71 inches, we don't have to go back and count.
Because we know if it's approximated by a normal distribution, and because we know that this interval
we were given was one standard deviation away from the average, we know that 68% is going to be
very close to the right answer.
So the average, just to define it very quickly, is defined by this formula here. You basically add up your
heights, and you divide by the total number of adults, and that's going to give you the average height.
The standard deviation is like a average distance from the mean. Notice that here, we're computing in
the square. This is a mathematical convenience. This is called a variance, it's the standard deviation
squared. And what it does, again, is take, roughly, an average of the distance between each individual
and the mean.
So let's look at the distribution of heights. Those are with a histogram. And let's show a normal
distribution with mean obtained from the data and standard deviation obtained from the data. And you
can see that it looks like a pretty good approximation.
To look at this a little bit more rigorously, we can look at what is called a quantile- quantile plot. So what
we do is, we compute, for example, for percentile. So we go 1%, 2%, 3%. So we compute, what is the
number for which 1% of the data is below that number? 2%, 3%, 4%, and we compute what are called
the percentiles.
We do that for our data set, and we also do that for the normal distribution. And what we're plotting here
is, on the x-axis, the percentiles for the normal distribution, and on the y-axis, the percentage for our
data. If these two have the same distribution, then these points should fall on the identity line, and in
this case they are pretty close. So this is telling us that this is a pretty good approximation.
Now, the final concept I want to describe today is standard units. It's a very useful concept. If you know
your data is approximately normal, then you can convert it to what are called standard units, by
subtracting the mean and dividing by the standard deviation for each point in your distribution.
Those are now going to have normal distribution with mean 0 and standard deviation 1. So now, when
you talk about a 3, or a 2, or a 1, you're not talking about inches or pounds, or whatever other unit.
You're basically saying, how many standard deviations away am I from the mean?
So if you have a Z of 2, you're quite tall. Or if you have a Z of 3, then you're extremely tall. So when we
convert data to Z scores, we no longer have to think about units. If we have a 2, it is a big number.
Regardless of if it's weight or inches, or whatever it is, we know that, if it's normally distributed, only 5%
of the data is more than two standard deviations away from the mean. So it's a very convenient
transformation that we use over and over in this class.
So in summary, if your data is approximated by the normal distribution, then all you need to know about
your data is the mean and standard deviation. So those are considered a summary of the data. You say
what's the mean, what's the standard deviation-- with those two numbers, you can describe the
what's the mean, what's the standard deviation-- with those two numbers, you can describe the
proportion of any interval.
Now, that is only if your data is normally distributed. In other cases-- imagine income distribution, or
gene expression without taking the log-- these are quantities are normally distributed. Therefore, the
mean and standard deviation can't be considered summaries in the same way. They may be useful for
certain calculations, but they will not be summaries of the data in the same way as they are when the
data is normally distributed.