Download 8.6 HE NORMAL DISTRIBUTION

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
a.
b.
c.
d.
High value ⳱ 0.5, low value ⳱ 0
High value ⳱ 1.25, low value ⳱ ⫺1.5
High value ⳱ 7.5, low value ⳱ 6.25
High value ⳱ 20.3, low value ⳱ 13.6
0 to 1
1 to 2
2 to 3
3 to 4
4 to 5
5 to 6
6 to 7
7 to 8
8 to 9
9 to 10
4. A spinner that yields the values 0 to 10, equally
spaced, is shown:
9
0
1
8
2
7
3
6
5
4
a. Explain why when you spin the spinner,
the exact value that comes up should have
the continuous uniform distribution.
b. What is the mean and standard deviation
of this random variable?
c. You try the spinner a few times, and you
suspect that there is something wrong with
it—namely, that it favors stopping in certain areas. You spin it 100 times, and collect
these data:
14 times
15 times
13 times
16 times
11 times
6 times
4 times
3 times
8 times
10 times
Perform a chi-square test to test whether
the spinner is really giving continuous uniform data.
5. Every time the phone rings, you check the
exact position of the second hand of the analog
clock on the wall. Explain how this is a continuous uniform variable. What are the upper
and lower bounds?
6. The airport shuttle bus will arrive randomly
in the next 3 minutes. Find the following
probabilities.
a. p(wait at least two minutes)
b. p(wait less than 1/2 minute)
c. p(wait between 1 and 2 minutes).
For additional exercises, see page 727.
8.6 THE NORMAL DISTRIBUTION
The most famous distribution is the normal distribution. It is also called the
Gaussian distribution, after the famous mathematician Carl Friedrich Gauss.
More colloquially, it is often referred to as the bell-shaped curve. As seen in
Figure 8.4, this density is certainly bell-shaped.
The normal curve fits an amazing number of situations. The reason for
this lies in the central limit theorem, which is explained in Chapter 11. Figure 8.9 shows the relative frequency histogram of the heights of 102 women
in a particular college statistics class. Figure 8.10 is the relative frequency histogram of the body temperatures of 130 people. The superimposed normal
curves follow both histograms reasonably well.
The one in Figure 8.11 does not. This relative frequency histogram
is one of the total number of dogs and cats that the 165 people in the
statistics class owned during their lives. The largest number was 40. Note
that this is not a physiological measurement. We try to fit it with a normal
curve. The normal curve tends to be centered around the bulk of the data,
which in Figure 8.11 is in the range from 0 to 5. But in order to extend
adequately out to the larger numbers, it also has to extend well below
0.15
0.10
0.05
0.0
55
60
Figure 8.9
65
70
Height (inches)
75
Heights of 102 women.
0.5
0.4
0.3
0.2
0.1
0.0
96
97
98
99
100
Body temperature (°F)
101
Figure 8.10
Body temperatures of 130 people (data are from Allen L. Shoemaker, “What’s
Normal?—Temperature, Gender, and Heart Rate,”
Journal of Statistics Education, July 1996).
0.15
0.10
0.05
0.0
–10
0
10
20
30
40
Number of dogs and cats owned in lifetime
Figure 8.11
Number of dogs and cats owned
by 165 students.
Statistics and Individual Differences
The French scientist A. Quetelet (1796–
1874) first noted that if you measure the
heights of a large group of people, the relative frequency polygon of these heights
will resemble the bell-shaped curve of a
normal distribution. He also studied the
distributions of other characteristics, such
as weight, chest girth, and arm length, and
found that all of these follow nearly the
same shape of distribution. This property
of the distribution of physiological characteristics was found to occur so frequently
that the British scientist Sir Francis Galton
(1822–1911) coined the term normal to
describe these distributions.
This property of most physiological
characteristics has many practical applications. For example, the designer of an
airplane cockpit must arrange it so that
most pilots are comfortable and can reach
all of the controls. Clearly, this requires
knowledge of average heights, average arm
lengths, and so on, as well as knowledge
of the variability around these averages so
that most pilots will be accommodated.
zero, because the normal curve is always symmetric in shape about its
midpoint. The area to the left of 0 under the best-fitting normal curve in
the graph is about 0.22, which means that 22% of the people should have
fewer than 0 dogs and cats. One cannot have a negative number of pets,
so the normal distribution is not appropriate here. The point here is not
that there is a probability of fewer than 0 cats and dogs, for a probability
of 0.01 of this, say, would be tolerable. Rather, the point is that there
is the large probability of 0.22, indicating a serious error in fitting the
relative frequency histogram of the data with the normal curve. We never
seek a perfect fit of distribution to data, but we always seek a good fit!
From our viewpoint that data represent a sample from a population, we
expect the relative frequency histogram of the sample to fit the population
distribution well.
If you look at the normal curves in the figures, you see that they all
have the same shape but different centers (theoretical means) and spreads
(theoretical standard deviations). For the batting averages (Figure 8.4), the
best-fitting normal curve covers the data from about 0.200 to 0.360; for
the heights (Figure 8.9), it covers the data from about 60 to 71 inches; for the
body temperatures (Figure 8.10), it covers the data from about 96 degrees to
101 degrees. For any mean and standard deviation there is a normal curve,
which is one reason the normal is so versatile. The three normal curves in
Figure 8.12 have different means, ⫺10, 0, and 10, but the same standard
deviation of 5. The three normal curves in Figure 8.13, on the other hand,
Normal density
0.08
0.06
0.04
0.02
0.0
–30
Figure 8.12
viation of 5.
–20
–10
0
10
20
30
Normal curves having a standard de-
Normal density
0.4
SD = 1
SD = 5
SD = 10
0.3
0.2
0.1
0.0
–30
Figure 8.13
–20
–10
0
10
20
30
Normal curves having a mean of 0.
have the same mean of 0 but three different standard deviations: 2, 5, and
10. The one with standard deviation 2 is the tall narrow curve; the one with
standard deviation 10 is the low flat curve.
These normal curves are symmetric around their mean; that is, the
part of the curve to the right of the mean is a mirror image of the part of
the curve to the left. The calculus-based formula for the theoretical mean
shows that the mean of a distribution is the geometrical center of gravity, or
balance point, of the density curve of the distribution (this is a property of all
theoretical means of distributions). Thus in a symmetrical distribution the
balance point, or theoretical mean, is located right at the theoretical median,
namely, the point that has half the area on either side. Can you think up
examples of curves in which the center of gravity (the mean) does not equal
the theoretical median?
Because all normal curves have the same shape, the areas under normal
curves will, if the horizontal axis is given in terms of distances from the mean
0.025
0.025
0.135
–2 × SD
Figure 8.14
abilities.
0.34
–1 × SD
0.34
0
Mean
0.135
1 × SD
2 × SD
Standard normal curve prob-
in standard deviation units, be the same. For example, no matter what
the mean and standard deviation are, the area, and hence the probability,
between the mean and the mean plus one standard deviation is about 0.34.
By symmetry, the area between the mean and the mean minus one standard
deviation is also about 0.34. See Figure 8.14. Putting those areas together, the
probability of being between the mean minus one standard deviation and the
mean
plus one standard deviation is about 0.34 Ⳮ 0.34 ⳱ 0.68—that is, approximately 23 .
Table 8.8 looks at the examples we have seen so far. For the batting
averages the mean was 0.263 and the standard deviation was 0.029. So
Mean ⫺ SD ⳱ 0.263 ⫺ 0.029 ⳱ 0.234
Mean Ⳮ SD ⳱ 0.263 Ⳮ 0.029 ⳱ 0.292
It turns out that 178 of the 263 players had batting averages between 0.234
and 0.292, which is a proportion of 0.68, exactly what the normal curve
would indicate. In the case of the heights, the mean minus one standard
deviation and the mean plus one standard deviation are 62.81 and 68.31,
respectively. A proportion of 0.66 of the women had heights in that range—
very close to the normal curve’s value of 0.68. Similarly, 90 of 130, or 69%
of the people measured, had temperatures between 97.52 and 98.98. Those
examples have histograms that look reasonably normal. The dogs and cats
histogram did not look normal. Going down one standard deviation from
the mean gives ⫺1.15, an impossible number of pets. And 89% of the people
are in the range from the mean minus the standard deviation to the mean
plus the standard deviation, which is a lot more than the normal distribution
would predict. This is simply more evidence that the normal distribution
does not fit the pet data well.
Table 8.8
The (Mean ⫾ SD): 67% Rule Empirically Demonstrated
Batting averages
Heights
Temperatures
Dogs and cats
Meanⴱ
SDⴱ
Mean ⫺ SD
Mean Ⳮ SD
Actual proportion lying
between (mean ⫺ SD)
and (mean Ⳮ SD)
0.263
65.56
98.25
4.03
0.029
2.75
0.73
5.18
0.234
62.81
97.52
⫺1.15
0.292
68.31
98.98
9.21
178/263 ⳱ 0.68
67/102 ⳱ 0.66
90/130 ⳱ 0.69
147/165 ⳱ 0.89
*The means and SDs are sample means and sample SDs.
Table 8.9
The (Mean ⫾ 2SD): 95% Rule Empirically Demonstrated
Batting averages
Heights
Temperatures
Dogs and cats
Mean
SD
Mean ⫺ (2 ⫻ SD)
Mean Ⳮ (2 ⫻ SD)
Actual proportion
between
(mean ⫺ 2 ⫻ SD)
and (mean Ⳮ 2 ⫻ SD)
0.263
65.56
98.25
4.03
0.029
2.75
0.73
5.18
0.205
60.06
96.79
⫺6.33
0.321
61.06
99.71
14.39
251/263 ⳱ 0.95
99/102 ⳱ 0.97
123/130 ⳱ 0.95
159/165 ⳱ 0.96
According to the normal curve in Figure 8.14, going from the mean
minus two standard deviations to the mean plus two standard deviations
should yield 13.5% Ⳮ 34% Ⳮ 34% Ⳮ 13.5% ⳱ 95% of the data. How does
that work in the examples? For the batting averages,
Mean ⫺ 2 ⫻ SD ⳱ 0.263 ⫺ 2 ⫻ 0.029 ⳱ 0.205
Mean Ⳮ 2 ⫻ SD ⳱ 0.263 Ⳮ 2 ⫻ 0.029 ⳱ 0.321
Of the 263 players, 251 had batting averages between 0.205 and 0.321, which
is 251/263 ⳱ 95%. Table 8.9 shows this and the other examples. This time,
even in the case of the dogs and cats data, about 95% of the data are in the
range. But this was merely a lucky chance occurrence, because the dogs and
cats data do not follow a normal distribution and hence cannot be expected
to follow this 95% rule.
The conclusion is that if the histogram of a set of data approximately
follows a normal curve, then just by knowing the mean and the standard
deviation of the data, one can estimate the proportion of the data in the
range from the mean minus to the mean plus the standard deviation, or
from the mean minus to the mean plus twice the standard deviation. In fact,
we can do much more along these lines, as in the following.
The z Statistic: One normal curve is the most used of them all. It is called
the standard normal curve, or standard normal density, and it is the one
with theoretical mean 0 and theoretical standard deviation 1. Figure 8.15
shows the standard normal curve.
Any kind of normal data (that is, data whose relative frequency histogram is bell-shaped) can be turned into standard normal data (bell-shaped
distribution having a mean of 0 and standard deviation of 1) by standardizing the observations, which means for each observation subtracting the
mean of the observations and then dividing by the standard deviation of the
observations. Such a standardized value is called a z statistic, often called a
standardized score or a z score. As an example, the data columns in Table 8.10
show the calories per hot dog in 20 brands of beef hot dogs. The hot dogs
averaged 156.85 calories, and the standard deviation was 22.07 calories. The
z columns give the corresponding z statistics:
data value ⫺ mean
standard deviation
It is preferable to insert the theoretical mean and standard deviation in the
z-statistic formula. However, because the theoretical mean and standard
deviation are usually not known, z statistics typically use the sample mean
and standard deviation, quantities known to estimate their theoretical
counterparts well.
The first brand of hot dog has 186 calories per hot dog. Using the sample
mean and standard deviation, its z statistic is (186 ⫺ 156.85)/22.07 ⳱ 1.32.
What that means is that this brand has 1.32 standard deviations more than
z statistic ⳱
–3
–2
–1
Figure 8.15
Table 8.10
0
z
1
2
3
Standard normal curve.
Hot-Dog Calorie Data and z Statistics
Datum
z
Datum
z
Datum
z
Datum
z
Datum
z
186
181
176
149
1.32
1.09
0.87
⫺0.36
184
190
158
139
1.23
1.50
0.05
⫺0.81
175
148
152
111
0.82
⫺0.40
⫺0.22
⫺2.08
141
153
190
157
⫺0.72
⫺0.17
1.50
0.01
131
149
135
132
⫺1.17
⫺0.36
⫺0.99
⫺1.13
Sources: Davis S. Moore and George P. McCabe, Introduction to the Practice of Statistics, (New York:
Freeman, 1989); and Consumer Reports, June 1986, pp. 366–367.
the average number (157) of calories per hot dog. There is one brand with
111 calories per hot dog. Its z statistic is (111 ⫺ 156.85)/22.07 ⳱ ⫺2.08,
which means it has 2.08 standard deviations fewer calories per hot dog than
the average. Note that this brand of hot dog falls outside the 95% range of
mean ⫾ 2 standard deviations. In summary, the z statistic for an observation
is the number of standard deviations above or below the average. Very often
scientists will report data as z statistics instead of reporting raw data. This is
especially the case if the original scale is rather arbitrary or it is important to
see at a glance where each individual observation ranks relative to the rest
of the data.
The z statistics now conform to the standard normal curve in that the
average of the z statistics is 0, the standard deviation of the z statistics is 1,
and the shape of the histogram is, provided the original histogram also was,
bell-shaped.
One important use of z statistics comes in the next section, where we
use the normal curve to estimate proportions for any set of normal-looking
data, no matter what the mean and standard deviation are, using just one
table of probabilities, namely, the table for the standard normal (Table E).
SECTION 8.6 EXERCISES
1. A random variable is assumed to have the
normal distribution with mean 5.0 and standard deviation 1.5.
a. State a range in which 68% of the observations of this random variable will lie.
b. State a range in which 95% of the observations of this random variable will lie.
2. Another random variable is assumed to have
the normal distribution, this time with mean
⫺3.5 and standard deviation 0.5.
a. State a range in which 68% of the observations of this random variable will lie.
b. State a range in which 95% of the observations of this random variable will lie.
3. A random variable was observed 100 times.
The observations were as follows:
43.50164
42.17374
47.51862
44.4694
49.31204
44.50963
41.58814
44.30996
42.2318
46.89344
48.0677
44.96988
43.53012
45.09659
47.52947
41.21647
39.81531
41.78741
42.98898
50.89797
43.29784
43.77065
45.71922
41.54672
47.38623
43.56211
43.80182
46.89443
40.79547
44.01106
41.97859
47.93914
53.50043
41.1815
43.72759
48.56094
40.81974
46.58111
46.51913
41.1561
45.62537
44.88452
41.70193
48.42776
45.43842
41.4995
48.26565
47.51046
47.30188
42.12499
44.73116
43.92343
40.97019
52.83752
42.42954
49.37738
45.81967
41.13073
42.17687
51.1321
41.49603
45.58613
44.5927
45.38359
42.97241
49.5567
44.59925
44.26008
45.75363
44.42216
44.54217
40.9955
39.9309
48.91718
40.75161
47.00198
45.50033
49.58127
50.40565
47.77691
43.03819
47.79572
42.30571
42.79054
48.66678
44.49416
40.48248
43.50889
41.08607
40.68009
48.65868
44.94595
39.40947
42.92722
45.72352
41.54957
45.28159
43.87346
44.24591
43.61332
Before gathering the data, the researcher hypothesized that the data would follow the
normal distribution with mean 45 and standard deviation 3.
a. If the hypothesis is correct, what percentage of the observations should lie between
42 and 48?
b. What percentage of the values really lie
between 42 and 48?
c. If the hypothesis is correct, what percentage of the observations should lie between
39 and 51?
d. What percentage of the values really lie
between 39 and 51?
e. Does the researcher’s hypothesis seem reasonable?
4. A random variable X has a normal distribution with mean 10 and standard deviation 2.
Calculate the following probabilities:
a. p(X ⬍ 10)
b. p(X ⬎ 10)
c. p(8 ⬍ X ⬍ 10)
d. p(6 ⬍ X ⬍ 12)
5. The following 50 observations were made of
the weights (kg) of 12-year-old boys:
40.96
42.99
38.53
44.81
37.06
44.57
42.11
41.06
48.64
44.61
36.13
50.87
31.00
33.54
44.39
34.20
39.69
38.15
42.96
49.99
37.14
32.82
40.61
52.28
45.82
41.86
37.64
32.04
31.54
40.73
34.85
49.89
54.18
39.58
41.62
38.93
29.71
34.87
26.82
31.02
34.66
38.73
33.01
50.05
35.13
41.42
38.07
44.10
35.03
49.15
The mean of this set of data is 40, and the
standard deviation is 6.43.
a. What percentage of the observations fall
6.
7.
8.
9.
within one standard deviation of the
mean?
b. What percentage of the observations fall
within two standard deviations of the
mean?
c. Do these data seem to follow the normal
distribution in the sense of obeying the 67%
and 95% rules?
A random variable X has the standard normal
distribution.
a. What is the mean of X?
b. What is the standard deviation of X?
c. What is the probability of X being between
⫺1 and 1?
A random variable has the normal distribution with mean 3 and standard deviation 2.
Convert the following observations into standard units:
a. 3.234
b. 5.193
c. 1.401
d. ⫺0.0184
Repeat Exercise 7, this time for a normal random variable having mean 6 and standard
deviation 3.
a. 5.290
b. 2.816
c. 8.791
d. 10.271
Repeat Exercise 7, this time for a normal random variable having mean ⫺5 and standard
deviation 4.
a. ⫺4.823
b. ⫺5.972
c. ⫺11.732
d. 1.672
8.7 USING THE NORMAL-CURVE TABLE
Since the normal curve is so useful, tables have been prepared that provide
the area under the curve to the left of a given value of z. Recall that we used
similar tables to find chi-square probabilities. Remember from the previous
section how you can change any set of observations to z statistics. If you can