Download Chapter 6: Normal Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Randomness wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Birthday problem wikipedia , lookup

Inductive probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Conditioning (probability) wikipedia , lookup

Law of large numbers wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
Chapter 6: Normal Distributions
Section
1
2
3
4
Y. Butterworth
Title
Graphs of Normal Probability Distributions
Standard Units & Areas Under Normal Dist.
Areas Under Any Normal Curve
Normal Approximation to the Binomial Dist.
Ch. 6 Notes – Foothill M10
Notes Pages
2–4
5–9
10 – 15
16 –
1
§6.1 Graphs of Normal Probability Distributions
The defining formula, called a density function, for all normally distributed random
variables is as follows:
f(x) =
-1/2[(x – μ)/σ]2
e
σ √2π
μ = mean
σ = std. dev.
The graph of this distribution has the following characteristics:
1)
It is bell-shaped
2)
It is symmetric about the vertical line x = μ
3)
The x-axis forms a horizontal asymptote (it approaches, but never touches or
crosses the x-axis)
4)
5)
6)
The inflection points (point where the concavity changes sign) represent σ
Area under the curve is equal to 1
There is no area under the curve for a single value, P(X=x) doesn’t have
any meaning on a continuous curve. We can only determine the area
under the curve for an interval P(X<x) or P(X>x) or P(xs < X < xB). This
idea can get messy as we are really talking about a Calculus concept here.
Here is a picture of a normal curve. We’ll use it to point out some of the key concepts we
are discussing here.
The normal density curve changes shape in two ways:
1)
Based on its location along the x-axis – change based upon μ
2)
Based on its width and height – change based upon σ
Let’s look at some example using technology:
Example:
Y. Butterworth
Using your TI-83 or 84 we will plot 3 normal density curve with
different mean.
Curve 1: μ = 0 & σ = 1
Curve 2: μ = 1 & σ = 1
Curve 3: μ = 2 & σ = 1
a)
Find μ – 3.5σ for curve 1
b)
Find μ + 3.5σ for curve 3
c)
In Y= menu input the following for each curve, so that you
have Y1, Y2 & Y3 with equations for curve 1, 2 & 3
Ch. 6 Notes – Foothill M10
2
respectively.
DISTR  normalpdf(x, μ, σ)
Note: The x is key: X,T,θ,n. Make sure equal sign of Y1, Y2 & Y3 are
highlighted (if not move cursor over and enter).
d)
e)
f)
Example:
In WINDOW menu change the x-min to a)’s calculation
& x-max to b)’s calculation
In ZOOM menu select ZOOMFIT
What do we notice?
Using the TI-83/84 we will plot 3 normal density curves with
different standard deviation
Curve 1: μ = 0 & σ = 0.5
Curve 2: μ = 0 & σ = 1.5
Curve 3: μ = 0 & σ = 2.5
a)
Find μ – 3.5σ for curve 3
b)
Find μ + 3.5σ for curve 3
c)
In Y= menu input the following for each curve, so that you
have Y1, Y2 & Y3 with equations for curve 1, 2 & 3
respectively.
DISTR  normalpdf(x, μ, σ)
Note: The x is key: X,T,θ,n. Make sure equal sign of Y1, Y2 & Y3 are
highlighted (if not move cursor over and enter).
d)
e)
f)
In WINDOW menu change the x-min to a)’s calculation
& x-max to b)’s calculation
In ZOOM menu select ZOOMFIT
What do we notice this time? If you need to know which
graph is which, use the TRACE and the arrow up key to
scroll through them. The cursor will blink on the graph and
at the top the function will appear.
Note: This is not as easy with EXCEL, as we must create a column with the values of -3.5σ to 3.5σ by
increments of 0.2σ. In a second column, we need to create the y-values which are come from pasting the
Statistical function: NORMDIST(x, μ, σ, false), the x is the value from the x-column. You will have to
replicate this for the 3 curves. Once the data is there you can create a chart using the Chart Wizard. You
will use a SCATTERPLOT with smooth , connected points. To input the values you will want to choose
SERIES at the first dialogue box and I advise the use of the column/row selection for inputting the data.
After entering x & y values click ADD and do the next til you have all 3. I may show my finished example
in class. It depends if my MAC version and the PC will play “nice”.
Y. Butterworth
Ch. 6 Notes – Foothill M10
3
Now, let’s recall the Empirical Rule that we studied earlier. Recall that your book had
only loosely referred to this rule and we had developed the entire thing. We will now do
a couple of examples that are like those that we did in §3.2 (p. 11 of my notes). This time
we will be finding probability based upon the Empirical Rule, because those percentages
in the Empirical Rule represent the area under the curve of a normally distributed random
variable, and we know the area under the curve represents probability.
Empirical Rule
68% of the data will fall within 1 standard deviation of the mean
95% of the data will fall within 2 standard deviations of the mean
99.7% of the data will fall within 3 standard deviations of the mean
μ – 2σ μ – σ
Example:
μ
μ + σ μ + 2σ
a)
The heights of men in the US are normally distributed with a mean
of 69 inches and a standard deviation of 2.8 inches. What
percentage of men are:
Between 63.4 inches and 74.6 inches?
b)
Taller than 69 inches?
c)
Shorter than 60.6 inches?
Note: To answer these questions we are using the Empirical Rule and finding the probability based upon
the number of standard deviations from the mean, in other words, based on the z-score that we discussed in
§3.2.
Y. Butterworth
Ch. 6 Notes – Foothill M10
4
Example:
The number of days of gestation for a human is normally
distributed with a mean of 268 days and a standard deviation of 15 days. What is the
probability of a gestation lasting:
a)
Between 253 and 283 days?
b)
Longer than 268 days?
c)
Less than 223 days?
Also in this section is a discussion on control charts. Control charts are a time series
chart that plot the values of the mean against the theoretical values for the mean (μ), 2
standard deviations from the mean (μ±2σ) and 3 standard deviations from the mean
(μ±3σ). The values used for μ & σ are values that are the “accepted values” (meaning that
they may not be known population values, but after many statistical experiments this has come to be the
accepted or average mean and the accepted standard deviation) . Control Charts help to establish
when data is no longer conforming to an accepted distribution. Why would looking at a
control chart help?
1)
Anything outside 3 standard deviations is considered a rare event,
of which only 0.3% of all data points would lie outside those
“bounds”
2)
95% of all data points will lie within 2 standard deviation
If we have a violation on 1) then we have to seriously consider whether the distribution is
conforming to the accepted or not. The probability that it is a “false alarm” that a data
point lies outside 3σ is 0.003, which means it is highly unlikely to be a “false alarm”.
If we have too many on one side of the mean or the other that is also an indication that
our mean isn’t where it is expected to be, and is also a cause for alarm. Using probability
we can determine that if 9 or more consecutive points are either above or below the
mean, but still within 2σ, then our probability of a “false alarm” is 0.004, which still
means it is highly unlikely to be a false alarm.
Finally, if 2 of 3 consecutive points lie between 2σ and 3σ above or below the mean, this
is also cause for alarm. Again, it is the percentage of data that we are expecting to see
brings serious doubt that we would see 2 or more consecutive data points in this region,
thus it is cause for alarm. Probability again tells us that there is a 0.004 chance that it is a
“false alarm”.
Y. Butterworth
Ch. 6 Notes – Foothill M10
5
There is a nice graphic on p. 242 of your book that shows the Out-of-Control Signals for
a control chart.
Example:
Let’s look at #14 on p. 247
The manager of Motel 11 has 316 rooms in Palo Alto, CA. From observation over a long
period of time, she knows that on an average night, 268 rooms will be rented. The longterm standard deviation is 12 rooms. This distribution is approximately mound-shaped
and symmetrical.
a)
Night 1
Rooms 234
For 10 consecutive nights, the following numbers of rooms were
rented each night:
2
3
4
5
6
7
8
9
10
258
265
271
283
267
290
286
263
240
Step 1:
Calculate μ±2σ and μ±3σ
Step 2:
Draw the time series plot with nights on the x-axis
and the values of μ, μ±2σ and μ±3σ on the y-axis
340
320
rooms
300
mean
2 std above
280
2 std below
260
3 std above
3 std below
240
220
200
0
1
2
3
4
Step 3:
5
6
7
8
9
10
11
Are any of the Out-of Control Signals present?
Your Turn: You do the same with the part b) data. You can use the chart
above to draw the Control Chart.
Night 1
2
3
4
5
6
7
8
9
10
Rooms 238
245
261
269
273
250
241
230
215
217
Y. Butterworth
Ch. 6 Notes – Foothill M10
6
§6.2 Standard Units & Areas Under Normal Distribution
In this section, we finally get to the z-score and discuss the difference between a raw
score and a z-score. We’ve already discussed the z-score multiple times, but I will reiterate. A raw score is the value of a random variable from a sample. This value must
be standardized to find the probability under the normal distribution when using a table.
Standardizing just means finding the z-score, finding the value of the raw score in terms
of how many standard deviations it is from the mean. As we mentioned before, this helps
in comparing 2 raw scores to one another and also to compare two raw scores from
different distributions!!
Z-Scores
z= x – μ
σ
x = raw score
μ = mean
σ = std. dev.
Note: This is valid for a sample too.
Interpreting
Positive Z  Above the mean
Negative Z  Below the mean
| z | is distance from mean
| z | > is more unusual when comparing two
To find normal probabilities on Tables we have to normalize and then use the Standard
Normal Table to look up the probability that a Standard Normal random variable is below
a certain value (this is for a left-tail table, of which most are). However, we will be using our
calculators to find the probability under the curve. We will practice all the skills just as if
we were going to use a table, however. We will learn the following:
1)
2)
3)
4)
5)
Y. Butterworth
How to write the desired probability using probability notation and the
random variable, X.
How to write the desired probability using probability notation and the
standardized random variable, Z.
How to find the probability of being less than a raw score as if we were
going to use a left-tail table
How to find the probability of being greater than a raw score as if we were
going to use a left-tail table.
How to find the probability of being between two raw scores as if we were
going to use a left-tail table
Ch. 6 Notes – Foothill M10
7
Example:
Fawns in Mesa Verde Nat’l Park have a body weight that is
approximately normally distributed with a mean of 27.2 kg and a
standard deviation of 4.3 kg (based upon info from The Mule Deer of Mesa
Verde Nat’l Park, by GW Mierau and JL Schmidt, Mesa Verde Museum Ass.)
a)
Y. Butterworth
If X is the weight of a fawn in kilograms, answer all the following
questions based on the given information.
Write the correct probability notation for this scenario & draw a
picture for the area under the curve:
What is the probability that a randomly chosen fawn weighs at
most 30 kg?
b)
Write the correct probability notation for this scenario & draw a
picture for the area under the curve:
What is the probability that a fawn weighs more than 19 kg?
c)
Write the correct probability notation for this scenario & draw a
picture for the area under the curve:
What is the probability that a fawn weighs between 32 and 35 kg?
d)
Using the probability notation in a), rewrite your probability in
terms of the standardized random variable, Z. Does the picture
change?
e)
Using the probability in c) rewrite your probability in terms of the
standardized random variable, Z.
f)
If you were given the information that the standardized random
variable was at least 1.28, for a randomly chosen fawn, write this
using probability notation and then convert it back to a probability
for the random variable, X.
Ch. 6 Notes – Foothill M10
8
When working with a left-tail table we can only find the area under the curve to the left
of the random variable, X. Because of this limitation we must learn to rewrite
probabilities in terms of being less than the random variable, X. Here are the short-cuts:
P(X < x)
Straight table look up.
P(X > x) = 1 – P(X < x)
[Technically it needs to ≤ x but there is no area under
x, so for any continuous distribution it doesn’t
matter.]
P(xs < X < xB) = P(X < xB) – P(X < xs)
Example:
Using the last example, rewrite parts b & c so that you could look
up the probabilities in a left-tail table.
a)
b)
Now we will find probabilities. Since we have a TI-83/84 we can use the DIST menu
and the normalcdf(left limit, right limit, mean, std dev) If you leave off the mean and the
std dev the calculator will assume a standard normal, with mean, 0 and std dev, 1. For
the left limit use a very small number (e.g. -1x1010)when looking for a left-tail, and if
looking for a right-tail use a very large number (e.g. 1x1010).
Example:
a)
Now, using fawn example, find the probabilities in a, b & c.
b)
c)
d)
Y. Butterworth
Use the standardized score you found in part d) and make sure it
agrees with the probability that you found in a). Make sure that
you take into account that the mean is now 0 and the std. dev. is
now 1.
Ch. 6 Notes – Foothill M10
9
§6.3 Areas Under Normal Curves
This section discusses how to use the Standard Normal Table to find the
probability of any Normally Distributed random variable despite the mean
and standard deviation. Since we will be using the TI-83/84 to find normal
probabilities. We will discuss how to find probabilities for any normal
distribution using a standard normal, just in case you ever find yourself
without a calculator that will calculate these probabilities.
1)
2)
Translate the random variable to a z-score
Look up in a left-tail table, translating if necessary (a pix will help)
a)
P(X < x) = P(Z < z)
b)
P(X > x) = 1 – P(X < x) = 1 – P(Z < z)
c)
P(xS < X < xB) = P(X < xB) – P(X < xS)
= P(Z < zB) – P(Z < zS)
Recall: For finding probabilities with the TI-83/84, use DISTR menu and normalcdf(lower
bound, upper bound, mean, std. dev. Remember if you don’t put in mean and std dev. the
calculator will assume std. normal.
Example:
Let’s do an example where we draw the pictures and
prepare for the look up using a Z-score, then we’ll use
our calculator to find the probabilities with mean 0 and
std dev 1 and compare it with the answer from the
original mean and std. dev.
Porphyrin is a pigment in blood protoplasm and other body fluids that is
significant in body energy and storage. Let x be a random variable that
represents the number of milligrams of porphyrin per deciliter of blood. In health
adults, x is approximately normally distributed with mean of 38 and
std. dev. of 12 (Diagnostic Tests with Nursing Implications, edited by S. Loeb,
Springhouse Press). This is problem #26 p. 269 of your text.
a)




Y. Butterworth
Write the probability that a randomly chosen
individual will have a porphyrin level lower than
60 using probability notation.
Next, convert it to a probability using a Z .
Sketch the picture of the probability that you are
trying to find.
Use your calculator to find the probability, using
the z-score.
Compare it to the answer for the probability of x
under the original mean and std. dev.
Ch. 6 Notes – Foothill M10
10
b)
Answer all the same questions as a) for the
probability that a randomly chosen individual will
have a porphyrin level above 16.
c)
Answer all the same questions as a) for the
probability that a randomly chosen individual will
have a porphyrin level between 16 and 60.
d)
Answer all the same questions as a) for the
probability that a randomly chosen individual will
have a porphyrin level greater than 60.
Our next topic is how to find the value of the random variable given the probability of
that random variable’s occurrence. As before, I’m going to teach you based upon doing
it with a table, so that if you ever find yourself without the TI-83/84 you could still do the
problem.
1)
Find the z based upon the reverse table look up (look in the body)
2)
Use x = zσ + μ
to find x, since z = x – μ
σ
Since you will have your calculator, we’ll practice all but the table look-up. You will use
DISTR menu again and this time INVNORM(probability, μ,σ) Just as before, if you
leave off mean and std dev, the calculator will assume std. normal.
Y. Butterworth
Ch. 6 Notes – Foothill M10
11
Just as before we will draw pictures to help us with the concepts.
1)
P(X < x) = known probability
known
probability
x
2)
P(X > x) = known probability
known
probability
Convert To
1 –known
x
x
SO,
3)
P(X > x) = 1 – P(X < x) = known ∴ P(X < x) = 1 – known
P(-x < X < x) = known
known
prob
-x
x
SO,
Example:
Example:
so, if –x & x are symmetric about μ,
the area in the tails is 1 – known, so
below –x is half that probability
1 – known
2
-x
P(-x < X < x) = 1 – 2P(X < -x) = known ∴ P(X < -x) = 1 – known
2
Human gestations periods are approximately normally distributed
with a mean of 268 days and std. dev. of 15 days. Premature
babies are those that are in the lowest 4%. Find the length of
gestation, in days, that separates the premature babies from the
non-premature babies.
The number of years before replacement is necessary for a specific
brand of CD players is approximately normally distributed with a
mean of 9.4 years and std. dev. of 4.2 years. Find the length of
time, in years, that represents the cut-off points for the amount of
time that these CD players “consistently” last.
Recall that your book terms “consistently” within _____ std. dev. See p.11 Ch.
5 notes!
Y. Butterworth
Ch. 6 Notes – Foothill M10
12
Example:
Most exhibition shows open in the morning and close in the late
evening. A study of Saturday arrival times showed that t he
average arrival time was 3 hours and 48 minutes after the doors
opened, and the std. dev. was estimated at about 52 minutes.
Assume the arrival times have a normal distribution.
Remember units must be consistent! This is problem #37 from p. 270 of our
text.
*a)
At what time, after the doors open, will all but 10% of the
people who are coming to the Sat. show have arrived?
b)
At what time, after the doors open, will only 15% of the people
who are coming to the Sat. show have arrived?
*c)
What are the arrival times that represent the cut-off points for
which the “usual” amount of people will have arrived? See p. 11
of Ch. 5 notes!
*Note: I have changed the problem or added the problem to the original.
One of the questions that is of concern when analyzing data is whether we have data that
is approximately normally distributed. When we can answer this question in the
affirmative we have an easier time analyzing data, especially small data sets that don’t
conform to the requirements of the Central Limit Theorem for their normality. We are
now going to see some ways of checking for Normality.
Y. Butterworth
Ch. 6 Notes – Foothill M10
13
Approximately Normal?
1)
Histogram
≈ Bell Shaped (Symetric)
2)
Outliers
Not > 1  Outliers below Q1–1.5IQR or beyond Q3 + 1.5IQR
3)
Skewness
Pearson’s Index = 3(x-bar – x-tilde) is between -1 & 1
s
4)
Normal Probability (Quantile) Plot
Z-score (x’s) plotted against values of R.V, x (y’s) forms
≈ straight line
Technology can help us with:
1)
Histograms as we have already seen
BTW here are the instruction for doing a histogram with EXCEL’s data
analysis package. I had forgotten that a histogram was possible with
EXCEL.
Histogram
Input Upper Class Limits into another column
ToolsData AnalysisHistogram
Highlight the data in the 1st column of your original workbook
Click on Bin Range and highlight the column containing upper class limits
Uncheck the labels box below Bin Range
Check New Workbook
Check Chart Output
Click on OK
2)
3)
Pearson’s Skewness: EXCEL &TI used to get mean, median & std. dev.,
but calculation must be done by hand
Normal Probability Plot: Both EXCEL & the TI we can force the issue
Enter raw scores & find the Z-Score for each
In EXCEL =STANDARDIZE(x, μ, σ)
Now, use a scatterplot with just dots and plot standardized values
against the x values (make sure std. vals are in 1st column & x vals are in 2nd
column)
In TI’s DATA EDITOR with the cursor on the L2, type in
(L1-mean)/std dev, where L1 contains the data and L2 will contain
the z-scores
Now, use the STAT PLOT menu and turn on the first one that
looks like dots by putting your cursor on the graph and entering.
In the X-List enter L2, and in the Y-List enter L1. Mark with the
square. ZOOM and choose ZOOMSTAT.
Y. Butterworth
Ch. 6 Notes – Foothill M10
14
Example:
a)
b)
c)
f)
Y. Butterworth
Now let’s test the normality of the following data which represents
the average number of murders per capita for a large city.
12.6, 9.5, 14.9, 11.9, 11.0, 10.7, 9.7, 10.5, 8.2, 7.8, 9.5, 13.1, 8.2,
7.1, 9.2, 8.3, 8.7, 9.6, 10.9, 9.4, 12.1, 15.6, 12.0
Create a histogram of the data
Calculate Pearson’s Skewness
Are there any outliers?
Make a Normal Probability Plot
Ch. 6 Notes – Foothill M10
15
§6.4 Normal Approximation to the Binomial
This approximation is used appropriately when the following conditions are
met.
np ≥ 5 and nq ≥ 5
or
n > 30
If
X ~ Binomial(n,p,q) then X ~ N(μ, σ) where
μ = np
σ = √npq
The only trouble with the normal approximation to the binomial is the discrepancy
between the types of distributions. The binomial is discrete. Remember the “chunky”
probability histograms that we used to show the shape of the distribution? Well, this
causes some problems because there is area under the curve for a given value in the
binomial, whereas in a continuous distribution such as the normal, there is no area under
the curve for any particular value. As such, we have to “fake” the area. This is done with
a continuity correction. This takes into account that each bar on the binomial is one unit
wide, so for each value we subtract 0.5 and add 0.5 to the value to approximate that
interval. Below is what it will look like on the normal curve.
P(X = x) where x is Binomially Dist
 P(x – 0.5 < X < x + 0.5) where x is Normal with μ = np & σ = √npq
x
x–0.5 x x+0.5
P(X > x) where x is Binomially Dist
 P(X > x + 0.5) where x is Normal with μ = np & σ = √npq
x
x
x+0.5
P(X ≥ x) where x is Binomially Dist
 P(X > x – 0.5) where x is Normal with μ = np & σ = √npq
x–0.5 x
x
Y. Butterworth
Ch. 6 Notes – Foothill M10
16
P(X < x) where x is Binomially Dist
 P(X < x – 0.5) where x is Normal with μ = np & σ = √npq
x–0.5 x
x
P(X ≤ x) where x is Binomially Dist
 P(X < x + 0.5) where x is Normal with μ = np & σ = √npq
x
x x+0.5
Now, all we need to do is try some examples!!
Example:
Step 1:
*Use the normal approximation to the binomial to find the
probability that at least 70 of 100 mosquitoes will be killed with an
insect spray when the probability of killing them with the spray is
0.75. Use your calculator to calculate the actual probability too
and compare.
What are n, p & x?
Step 2:
Calculate μ&σ
Step 3:
Define the probability in terms of the binomial & do the continuity
correction.
Step 4:
Find the probability using the normal distribution
Your Turn: *If 23% of all patients with high blood pressure have bad side
effects from a certain medicine, use the normal approximation to
find the probability that among 120 patients more than 32 will
have bad side effects.
*Note: Both examples are from Wadpole p. 230 Ex. 25 & p. 238 Ex. 23.
Y. Butterworth
Ch. 6 Notes – Foothill M10
17