Download 0.1 Normal Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
0.1
Normal Distribution
By now:
Description of the distribution of a quantitative variable:
1. Obtain a plot
2. Examine the plot for patterns and deviations from it
3. Calculate an appropriate numerical summary (center, spread)
In addition to this, one may use software to obtain a smooth curve to describe the distribution.
Such curves are called density curves.
A smooth density curve is an idealization that pictures the overall pattern of the data but
ignores minor irregularities as well as any outliers.
For histograms you have to choose the class intervals, this is dropped in favor for a smooth
curve.
1
Result: The density curve pictures the same as the histogram, it is a smooth approximation
to the bars of the histogram.
In general:
Density curve
• A density curve is always on or above the horizontal axis
• The area between the horizontal axis and a density curve equals always 1.
• The area under the curve above a certain interval is the proportion of all observations
that fall in that range.
The histogram pictures the result in the sample, whereas the density curve shows the distribution in the whole population.
In that sense we can think of the density curve as the underlying ”model” for the population
the actual observations are coming from. (We randomly pick from all observations the n
measurements we see in our sample(data set)).
But if that is so, we understand that the mean of the actual observations x̄ is different from
the mean of the underlying population or density curve.
We call the mean of the population or density curve µ (lower case Greek letter mu).
The same holds for the standard deviation, for the actual sample it is called s and the standard
deviation of the population is σ (lower case Greek letter sigma).
One shape of distributions curves shall be discussed in more depth:
1
The normal distributions
One important class of density curves are called normal curves. These curves are symmetric,
unimodal, and bell-shaped. For every combination of a mean µ and a standard deviation σ
there is a different curve.
2
The mean is the center of the distribution and the standard deviation σ controls the spread of
the distribution. The larger σ the larger the spread.
The hight of the density curve of a normal distribution is for any value x given through
1 x−µ 2
1
f (x) = √
e− 2 ( σ )
σ 2π
with π = 3.1415... and e = 2.7182.....
We will not use this property in the future, but it is important to see how this distribution is
mathematically described.
Three reasons that make the normal distribution so important:
1. Normal distributions can often be used to describe distribution of real data.
2. It can be proved mathematically that for large numbers the normal distribution plays a
central role in statistics.
3. Many statistical tools developed for the normal distribution work very well for data that
follow any symmetric distribution.
For all normal distributions the following rule holds up.
The 68–95–99.7 rule
• Approximately 68% of the observations fall within σ of the mean µ.
• Approximately 95% of the observations fall within 2σ of the mean µ.
• Approximately 99.7% of the observations fall within 3σ of the mean µ.
3
Notation:
A Normal distribution with mean µ and standard deviation σ will abbreviated by N (µ, σ)
1.1
z–Score as tool for standardization
The z–score is a measure of relative standing.
If you obtained a score in an achievement test you might want to know your standing in relation
to other people who have taken the test. Are you below or above the mean, how far above or
below in relation to the other people.
Definition:
If x is an observation from a distribution with mean µ and standard deviation σ, the standardized value or z-score of x is
x−µ
z=
σ
The z–score tells how many standard deviations the observation falls from the mean, and in
which direction. The standardization makes numbers from different contexts comparable.
The z–score is especially useful for distributions of observation that are approximately normal.
In this case, according to the 68-95-99.7 rule, a z–score outside the interval from -2 to 2 only
occurs in 5% of the cases and a z–score outside the interval of -3 to 3 only in 0.3% of the time.
Example:
Two graduate students:
accounting major gets job offer for $ 35000
advertising major gets job offer for $ 33000
accounting: x̄ = 36000 and s = 1500
advertising: x̄ = 32.500 and s = 1000
acc z − score = 35000−36000
= −0.67
1500
33000−32500
adv z − score =
= 0.5
1000
4
The advertising major can be happier about the job offer than the accounting major.
The z–score results from the original observation through a linear transformation
µ 1
+ x
σ σ
Then from the result we derived earlier, we know that the distribution of the z–score has the
same shape as the distribution of the original value.
We also conclude that the mean of the z–score
z=−
µz = a + b µ = −
µ 1
+ µ=0
σ σ
and the standard deviation of the z–score
σz = b σ =
1
σ=1
σ
Definition: The Normal distribution with µ = 0 and standard deviation σ = 1 is called the
Standard Normal Distribution, N (0, 1).
Theorem: If a variable X has a normal distribution with mean µ and standard deviation σ,
then the standardized variable
X −µ
Z=
σ
has a standard normal distribution.
1.2
Cumulative Proportions
The areas under a normal density curve represent proportions of observations from that specific
normal distribution.
For many statistical tools it is necessary to be able to determine such proportions.
First we will present how to compute these for a Standard Normal Distribution.
Since the normal distribution is a continuous distribution the following holds for every normal
distributed variable X:
P (X < z) = P (X ≤ z) area under the curve from −∞ to z.
5
The area under the curve of a normal distribute random variable is very hard to calculate.
There is no simple formula that can be used to calculate the area.
Table A from the text book tabulates for many different values of z ∗ the area under the curve
from −∞ to z ∗ , which is called the cumulative area or cumulative proportion, for standard
normal distributed variables.
From now on use Z to indicate a standard normal distributed variable(µ = 0 and σ = 1).
Using the table you find that,
• P (Z < −1.75) = P (Z ≤ −1.75) = 0.0401 and
• P (Z > 1.34) = 1 − P (Z ≤ 1.34) = 1 − 0.9099 = 0.0901 and
6
The shaded area equals 0.0918.
• P (−1 ≤ Z ≤ 1) = P (Z ≤ 1) − P (Z ≤ −1) = 0.8413 − 0.1587 = 0.6826. (compare with
the 68-95-99.7 Rule.)
The shaded area equals 0.6826.
The first proportion can be interpreted as meaning that, in a long sequence of observations
from a Standard Normal distribution, about 4.01% of the observed values will be smaller than
-1.75.
Try this for different values!
Until this point we covered how to find areas under the standard normal density curve. It
remains to show how to find these for any normal distribution, hopefully using the results for
the standard normal distribution.
Lemma: Is X normal distributed with mean µ and standard deviation σ then the standardized
variable
X −µ
is normal distributed with µ = 0 and σ = 1
Z=
σ
or Z ∼ N (0, 1).
The following example illustrates how the probability and the percentiles can be calculated by
using the standardization process from the Lemma.
Example: Let X be normal distributed with µ = 100 and σ = 5, X ∼ N (100, 5).
1. Calculate the area under the curve between 98 and 107 for the distribution chosen above.
P (98 < X < 107) = P ( 98−100
< X−100
<
5
5
7
2
= P (− 5 < Z < 5 )
= P (−0.4 < z < 1.4)
7
107−100
)
5
This can be calculated using Table A. P (−0.4 < Z < 1.4) = P (Z < 1.4) − P (Z <
−0.4) = 0.9192 − 0.3446 = 0.5746.
Inverse normal calculations
In other situations, we will not be given an interval for which we want to find the area above,
but an area will be given and we are to find the value on the measurement scale so that the
cumulative area up to this value matches the area given.
Definition:
For any particular number r between 0 and 1, the rth percentile xr of a distribution is a value
such that the cumulative area for xr is r.
If X is a variable the rth percentile xr is given by:
P (X ≤ xr ) = r
To determine the percentiles for a standard normal distribution, we can use Table A from the
textbook again.
• Suppose we want to describe the values that make up the smallest 2%. So we are looking
for the 0.02th percentile z0.02 , with
P (Z ≤ z0.02 ) = 0.02.
So look in the body of Table A for the cumulative area 0.0200. The closest you will find
0.0202 for zr = −2.05 This is the best approximation you can find from the table.
The result is that the smallest 2% of the values of a standard normal distribution fall
into the interval (−∞, −2.05].
8
• Suppose now we are interested in the largest 5%. So we are looking for z ∗ , with
P (Z > z ∗ ) = 0.05
In Table A we can only find areas to the left of a given value, the first step is to determine
the area to the left of z ∗ :
P (Z ≤ z ∗ ) = 1 − 0.05 = 0.95
That tells us that in fact z ∗ = z0.95 the 0.95th percentile.
Checking the Table we find values 0.9495 and 0.9505, with 0.95 exactly in the middle, so
we take the average of the corresponding numbers and get
z0.95 =
1.64 + 1.65
= 1.645
2
• And now we are interested in the most extreme 5%. That means we are interested in
the middle 95%. Since the normal distribution is the symmetric the most extreme 5%
can be split up in the lower 2.5% and the upper 2.5%. Symmetry about 0 implies that
−z0.025 = z0.975 .
In Table A we find z0.025 = −1.96, so that z0.975 = 1.96
We found the result, that the 5% most extreme values can be found outside the interval
[−1.96, 1.96].
The following example illustrates how percentiles of any normal distribution can be calculated
by using the standardization process from the Lemma.
Example: Let X be normal distributed with µ = 100 and σ = 5, X ∼ N (100, 5).
9
1. Find the 0.3th percentile for this distribution, that is x0.3 , use
0.3 = P (X ≤ x0.3 )
= P ( X−100
≤ x0.3 −100
)
5
5
x0.3 −100
= P (Z ≤
)
5
But then x0.3 −100
equals the 0.3th percentile from a standard normal distribution, which
5
can be found in Table A.
x0.3 − 100
= −1.88
5
This is equivalent to x0.3 = −1.88 · 5 + 100 = 100 − 9.40 = 90.6.
So that the lower 30% of a normal distributed random variable with mean µ = 100 and
σ = 5 fall into the interval (−∞, 90.6]
10