Download Chapter 6: The standard deviation as a ruler and the normal models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Chapter 6: The standard deviation as a ruler and the normal models
A high school cross-country runner competes in two races (on different courses) and wants
to know whether she did better in the first or second race. Even though most of the runners
competed in both races, the results of the races were quite different (because the courses
were different). How can performances be compared?1
Alternatively, race times can be compared by transforming the times to z-scores (also called
standardized scores). The transformation of a race time to a z-score has the effect of measuring the difference between the race time and the mean race time using a single scale,
the units of which are called standard deviation units. The standardized scores can then be
compared without worrying about differences between race courses, weather conditions, and
so on.
For example, a race time that is one standard deviation larger the mean is not unusually slow; whereas a race time that is three standard deviation below the mean is unusually
fast.
If the race time is represented by y, then the z-score is defined to be
y−y
s
z=
where y and s are the mean and standard deviation of all the runners’ times.
Suppose that race I times have a mean of y I = 72.3 minutes with a standard deviation
of sI = 13.8 minutes, and the race II times have a mean of y II = 59.0 minutes with a
standard deviation of sII = 18.5 minutes. Two examples follow.
1. Times for race I and II times were 82 and 71 minutes, respectively. The better time,
relative to the competitors is determined by the computing the z-scores:
Race I:
z=
82 − 72.3
9.7
=
= .70
13.8
13.8
Race II:
z=
71 − 59
= .65
18.5
In race I, the time of 82 minutes was .7 standard deviations above the mean and the
second race times was .65 standard deviations greater than the race II mean time of 50
minutes. Her performance relative to the other runners was nearly the same because
the standardized times are nearly equal (.7 and .65).
1
If the times of all racers were available, then her percentiles could be computed and compared.
36
2. Consider another runner with a time of 50 minutes in race I and a 38 minutes in race
II. In which race was his relative performance better based on the z-score measure?
Race I:
z=
50 − 72.3
−22.3
=
= −1.62
13.8
13.8
Race II:
z=
38 − 59
= −1.14
18.5
His relative performance was better (faster) in race I.
Remarks:
• z-scores are unit-less.
• Computing the z-score involved subtracting a measure of center (the mean) and dividing by a measure of spread (the standard deviation). Resistant measures of center
and spread might be used in place of the mean and standard deviation. However, it’s
necessary to use the mean and standard deviation in order to extract accurate information about relative performance.
Median and IQR are used for the related purpose of identifying outliers. The IQR
outlier rule identifies an observation as an outlier if is larger than Q3 + 1.5IQR or
smaller than Q1 − 1.5IQR.
• z-scores have a several useful properties
provided the data distribution is symmetric,
unimodal and without outliers.
Example The dotplot to the right shows the
record times as of 1984 for n = 35 Scottish
hill races.2 Two variables affecting the record
times are the distance in miles and the climb
(total elevation gained over the route), in feet.
To adjust for distance, the pace (time/distance)
is shown in the right histogram below. The
median pace is M = 6.77 minutes/mile or 6
minutes, 46.3 seconds per mile; in contrast, the
mean pace is 7.016 minutes/mile. The median
Two Breweries
Traprain
Seven Hills
Scolty
N Berwick Law
Moffat Chase
Meall Ant−Suidhe
Lomonds
Largo Law
Lairig Ghru
Knockfarrel
Kildcon Hill
Half Ben Nevis
Greenmantle
Goatfell
Eildon Two
Dollar
Criffel
Creag Dubh
Creag Beag
Craig Dunain
Cow Hill
Cockleroi
Carnethy
Cairnpapple
Cairngorm
Cairn Table
Burnswark
Black Hill
Bens of Jura
Ben Rha
Ben Nevis
Ben Lomond
Acmony
4
2
6
8
10
12
The Scottish hill races are instances of fell running, also known as mountain running
and hill running,
Pace (minutes/mile)
a sport of running and racing, off road, over upland country where the gradient climbed is a significant
component of the difficulty. The name arises from the origins of the English sport on the fells of northern
Britain, especially those in the Lake District. Fell races are organized on the premise that contenders possess
mountain navigation skills and carry adequate survival equipment as prescribed by the organizer.
37
distance is 6 miles, and the median climb is 1250 feet. The dotplot show the data. It’s not
shown but the Bens of Jura race is 16.0 miles and involves climbing 7500 feet over 7 summits.
There is some interest to understand how the record times vary with distance and elevation
gain as the times reflect attributes related to human endurance. For example, as humans
(and presumably other animals) approach the limits of endurance, there is very little gain
in performance per unit effort compared to gain per unit effort when not near the limits of
endurance.
A closer look at the z-score transformation: Computing z-scores involves shifting and rescaling a datum y. It’s helpful to examine the effects of shifting and re-scaling a set of data
values to understand the effect of the the z-score transformation.
1. Shifting: Shifting a set of data means adding or subtracting a constant from every
datum. Numerical summaries are shown below and right.
0
2
4
Frequency
6
8
Min 3.87
Q1
5.96
M
6.73
Q3
7.96
Max 12.78
IQR 2.00
y
7.01
s
1.68
4
6
8
10
12
Pace (minutes/mile)
Suppose 1 (minute/mile) is added to each datum. How would this affect:
• the median?3
• the mean?4
• the IQR?5
• the standard deviation?6
• the 5-number summary?7
3
1 (minute/mile) larger
1 (minute/mile) larger
5
no change
6
no change
7
each value would be 1 (minute/mile) larger.
4
38
2. Scaling: Scaling a data set is carried out by multiplying all values by the same constant8 . Suppose pace is converted to minutes/km. The conversion is accomplished by
dividing by .6 or multiplying by 1.667 the values. How would this affect:
• the median?9
• the mean?10
• the IQR?11
• the standard deviation?12
• the 5-number summary?13
3. Standardizing: Standardizing data into z-scores shifts the data by the mean y and
scales the shifted data by dividing by the standard deviation s. Note that the mean
and standard deviation of the resulting z-scores are z = 0 and sz = 1.
Normal models: The normal distribution is an idealized model that is often used for
distributions that are unimodal and roughly symmetric (sometimes called mound-shaped).
Frequency
100
200
• Histograms constructed from
a quantitative variable often
appear to be bell-shaped.
The histogram to the
right is an example.14
300
400
500
• Visually, the normal distribution is
a bell-shaped curve
0
• A normal model is completely specified by
the mean µ and standard deviation σ 15 .
You only need to know that the data are
4
6
normal in distribution and the values of µ
and σ to know exactly what the distribution looks like.
8
8
10
12
14
16
18
Verbal IQ score
Scaling can be carried out by division as well since multiplication by 1/α is division by α.
increases by 1.667
10
increases by 1.667
11
increases by 1.667
12
increases by 1.667
13
each value increases by 1.667
14
The data are verbal IQ scores measured on 2287 eighth-grade pupils in the Netherlands. Snijders, T.
A. B. and Bosker, R. J. (1999) Multilevel Analysis. An Introduction to Basic and Advanced Multilevel
Modelling. London: Sage.
15
Greek letters are used instead of the symbols y and s. The symbols y and s are used only when computed
from data whereas µ and σ are not computed from data.
9
39
• The normal model is remarkably useful as it provides a reasonably good approximation
in many situations. Examples include the distributions of SAT scores, weights of
newborn babies, and heights of adult American women.
• Normal models are easy to use for comparing a single value to a large set of values,
and for characterizing the distribution of a variable. Examples of questions that can
be answered using the normal model include: how does a SAT score of 624 compare
to other scores? Have SAT math scores improved in the past 10 years?
Why use a model? Why not just use the actual distribution (e.g., histogram)?
1. A model is compact. Saying that SAT math scores are approximately normal with
mean µ = 500 and standard deviation σ = 100 (written N (500, 100)) is much more
compact than giving the entire set of SAT scores.
2. It’s easier to work with a model than the full data set and it’s often possible to derive
simple and useful results.
3. Occasionally, there are theoretical reasons why a model should be the correct one.
Here, we will use the normal model only when it is justified from our data.
The normal distribution function: The normal distribution is a curve that describes the
relative frequency of values in a population. The equation describing the curve is equation
[
]
1
(x − µ)2
f (x) = √
exp −
, for − ∞ < x < ∞.
2σ 2
σ 2π
0.0
0.000
0.1
0.001
0.2
0.002
0.3
0.003
0.4
0.004
If the graph of a normal distribution is needed, f (x) is plotted against x for a large number
of values. We won’t use the distribution function in this class since there are tables and
computers that immediately provide f (x).
−4
−2
0
2
4
40
200
400
600
800
• Normal curves have the same shape regardless of the values of µ and σ. Two normal
distribution curves are shown above. The left figure corresponds to µ = 0 and σ = 1
and the right figure corresponds to µ = 500 and σ = 100
• Suppose that the distribution of some data can be approximated by the N (µ, σ) model
and we want to characterize the distribution. A standardization process is used since
the standardized values will follow a N (0, 1) (standard normal ) model and it is easy
to work N (0, 1) distribution.
• Given a realization (value) x from the N (µ, σ) distribution, the standardized score is
computed using the formula
x−µ
z=
.
(1)
σ
• For example, if SAT scores are N (500, 100), then the standardized SAT scores will be
N (0, 1), as shown above. The transformation of some scores using formula (1) is
800 − 500
100
600 − 500
600 → 1 since
100
400 − 500
400 → −1 since
100
200 − 500
200 → −3 since
100
800 → 3 since
= 3
= 1
= −1
= −3
The empirical (68-95-99.7) rule: This rule is used to quickly characterize a normal distribution. It states that
1. 68% of the values fall within 1 standard deviation of µ.
2. 95% of the values fall within 2 standard deviations of µ.
3. 99.7% of the values fall within 3 standard deviations of µ.
The following figure shows the empirical rule visually.
41
0.4
0.3
0.2
0.1
68%
13.5%
13.5%
2.35%
0.0
2.35%
−4
−2
0
2
4
z
Example Suppose that the SAT math scores are approximately N (500, 100) in distribution.
1. According to the empirical rule, about 95% of SAT math scores fall between what two
values?16
2. About what percentage of scores are below 400?17
3. About what percentage of scores are above 700?18
4. About what percentage of scores are between 300 and 400?19 .
5. What percentile is 400? What percentile is 700?20
16
300 and 700 because the empirical rule says that 95% falls within 2 standard deviations of the mean.
For the SAT scores, 2 standard deviations below the mean is µ − 2σ = 500 − 2 × 100 = 300, and 2 standard
deviations above the mean is µ + 2σ = 500 + 2 × 100 = 700.
17
16% because 400 is 1 standard deviation below the mean and about 16% of the distribution is less than
1 standard deviation below the mean.
18
700 is 2 standard deviations above the mean. 2.5% of the distribution is to the right of 2 standard
deviations above the mean, so the percentage above 700 is 2.5%.
19
16% are below 400 and 2.5% are below 300. The difference is 13.5%.
20
400 is the 16th percentile because 16% of the scores are less than 400. 700 is the 97.5th percentile since
2.5% of the scores are greater than 700.
42
6. About what percentage of scores is below 650?21
Calculating proportions with the normal distribution It’s necessary to go beyond the empirical rule and the 68, 95, and 99.7 percentages to answer more general questions.
As above, the proportion of a set of data between two values correspond to the area under the normal curve. Remember that the area under the entire normal curve is 1.
Areas under normal curves are determined using a calculator, software (StatCrunch), or
using a table of the standard normal distribution given in Table Z on pages A-110 and A-111
in the back of the text. A similar table can be opened (and saved) by going to the URL
http://www.math.umt.edu/steele/STAT451/normal.pdf. Use of the table is illustrated here.
Example: Suppose the annual snowfall amounts (inches) in Missoula are accurately modeled
by a normal distribution with mean 46 inches and standard deviation 18 inches.
0.020
Question: According to the normal model, what proportion of
years have snowfall amounts below 25 inches?
0.005
0.010
0.015
Answer : Let y = annual snowfall
(inches). We want the area to the
left of 25. Convert this to a statement
about the N (0, 1) distribution
by computing the z-score for 25:
y−µ
25 − 46
=
= −1.17.
σ
18
0.000
z=
0
20
40
60
80
Inches
0.000
0.005
0.010
0.015
0.020
The area to the left of y = 25 under the N (46, 18) distribution is the same as the area
the left of z = −1.17 under the N (0, 1) distribution.
From Table Z, this area is 0.1210. If the normal model is
accurate for the distribution of annual snowfall, then
about 12.1% of years in Missoula have snowfalls totals
that are less than 25 inches.
0
20
40
Inches
21
We’re not ready to answer this question.
43
60
80
Question: According to the normal model, approximately what proportion of years have
between 25 and 50 inches of snowfall? Refer to the figure on the bottom of the previous
page.
Answer : The area is the same as the area between the corresponding z-scores on the N (0, 1)
distribution:
z=
25 − 46
50 − 46
= −1.17 and z =
= .22.
18
18
The area between these two values is the area to the left of .22 minus the area to the left of
−1.17. From Table Z, the area to the left of −1.17 is .1210. The proportion is
.5871 − .1210 = .4661.
Hence, the normal model estimates that about 46.6% of years have snowfall amounts between 25 and 50 inches. This proportion appears consistent with the figure.
Approximately what proportion of Missoula annual snowfalls are greater than 50 inches?22
0.005
0.000
Question: According to the normal model, what
is the 10th percentile of snowfall amounts?
(i.e.: what is the snowfall amount such that only
10% of snowfalls are smaller?)
0.010
0.015
0.020
Reversing the standard normal transformation: Sometimes, instead of wanting to know what
proportion of values are in some interval, we want to
know what value corresponds to some proportion
(the reverse question).
Answer : First, sketch the distribution and guess
0
20
40
60
80
roughly where the 10th percentile is. Call it
Inches
the unknown 10th percentile y. The sketch provides check to see if the answer is reasonable.
It should be clear that the 10th percentile here is less than 28 inches. (28 inches is 1 standard
deviation below the mean).
• First, determine the 10th percentile of the standard normal distribution using Table
Z. Look for the value z such that the area to the left of this value is about .10. The
closest tabled value is .1003, and it corresponds to z = −1.28. So, the 10th percentile
is 1.28 standard deviations below the mean.
Find the proportion below 50 and subtract the proportion from 1. This yields 1 − .5871 = .4129 or
41.3%.
22
44
• If the 10th percentile y of the N (46, 18) distribution were standardized, then the z-score
would be z = −1.28. Mathematically,
y − 46
= −1.28.
18
The 10th percentile is y, and solving for y yields
y = 46 − 1.28(18) = 46 − 23.04 = 22.96 inches.
Sketching the normal distributions is very helpful for solving these problems.
Example: The weight of a certain candy bar is advertised as 8 ounces. The actual weights of
these candy bars are not all 8 ounces, but vary. The distribution of actual weights is closely
approximated by a normal distribution with mean 8.2 ounces and standard deviation .13
ounces.
0.0
2. According to the normal distribution,
what proportion of the candy bars weigh
between 7.9 and 8.1 ounces?
0.5
1.0
1.5
2.0
2.5
3.0
1. According to the normal distribution, what proportion of the candy bars weigh less
than the advertised weight of 8 ounces?
8 − 8.2
Answer : z =
= −1.54, and the
.13
area to the left
of z = −1.54 is .0618.
Thus, the proportion
of bars that weigh
less is .0618.
7.8
8.1 − 8.2
= −.77.
.13
8.2
8.4
8.6
8.4
8.6
Ounces
3.0
Answer : z =
8.0
2.0
2.5
The area to the left of z = −.77
is .2206. Also,
0.5
0.0
and the area to the left of 2.31 is .0104.
Finally, the proportion of bars that weigh
between 8.1 and 7.9 is
.2206 − .0104 = .2102.
1.0
1.5
7.9 − 8.2
= −2.31,
.13
7.8
8.0
8.2
Ounces
45
3. According to the normal distribution, what is the weight such that only 1% of candy
bars weigh more than this weight? (What percentile is being asking for?)
2.5
x − 8.2
.
.13
2.0
Set z = 2.33 =
3.0
Answer : the 99th percentile.
The 99th percentile of the
N(0,1) distribution is 2.33.
0.0
0.5
1.0
1.5
Solving for x leads to
1.3 × 2.33 + 8.2 = x,
and so x = 8.50 ounces
is the 99th percentile.
7.8
8.0
8.2
8.4
8.6
Ounces
4. According to the normal model, how unlikely would it be to obtain a chocolate bar
that weighed 8.8 ounces?
Answer :
8.8 − 8.2
.13
= 4.651.
z =
The area to the right of 4.651 is less than .0001. Less than 1 in 10, 000 candy bars are
at least 8.8 ounces according to the normal model. It’s not impossible, but extremely
unlikely.
46