Download Sec 3.2 Navidi

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
SECTION 3.2
MEASURES OF SPREAD
Section 3.2 - Objectives
1.
2.
3.
4.
5.
6.
7.
Compute the range of a data set
Compute the variance of a population and a sample
Compute the standard deviation of a population and a
sample
Approximate the standard deviation using grouped
data
Use the Empirical Rule to summarize data that are
unimodal and approximately symmetric
Use Chebyshev’s Inequality to describe a data set
Compute the coefficient of variation
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Objective 1
Compute the range of a data set
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The Range
The range of a data set is a measure of spread. That is, it
measure how spread out the data are.
The range of a data set is the difference between the
largest and the smallest value.
Range = Largest Value – Smallest Value
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.10
The following table presents the average monthly temperature, in degrees
Fahrenheit, for the cities of San Francisco and St. Louis. Compute the range for
each city.
Jan
Feb
Mar
Apr
May Jun
Jul
Aug
Sep
Oct
Nov
Dec
San Francisco
51
54
55
56
58
60
60
61
63
62
58
52
St. Louis
30
35
44
57
66
75
79
78
70
59
45
35
Source: National Weather Service
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
The largest value for San Francisco is 63 and the smallest is 51.
The range for San Francisco is 63 – 51 = 12.
The largest value for St. Louis is 79 and the smallest is 30.
The range for St. Louis is 79 – 30 = 49.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The Range is Not Used in Practice
Although the range is easy to compute, it is not often used in
practice. The reason is that the range involves only two
values from the data set; the largest and smallest.
The measures of spread that are most often used are the
variance and the standard deviation, which use every value
in the data set.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Objective 2
Compute the variance of a population and a
sample
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Variance
When a data set has a small amount of spread, like the San
Francisco temperatures, most of the values will be close to
the mean. When a data set has a larger amount of spread,
more of the data values will be far from the mean.
The variance is a measure of how far the values in a data
set are from the mean, on the average.
The variance is computed slightly differently for populations
and samples. The population variance is presented first.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Definition: Population Variance
Let x1, x2, x3, … xN denote the values in a population of
size N. Let μ denote the population mean. The population
variance, denoted by σ2 , is
2 
2
(
x


)
 i
N
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Procedure for Computing the Population Variance
Following is the procedure for computing the population variance of a data set:
Step 1: Compute the population mean μ.
Step 2: For each population value xi compute xi – μ. This is called the deviation
for the value xi.
Step 3: Square the deviations to obtain the quantity (xi – μ)2.
Step 4: Sum the squared deviations to obtain the quantity Σ(xi – μ)2.
Step 5: Divide the sum obtained in Step 4 by the population size N to obtain
the population variance σ2.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.11
Compute the population variance for the San Francisco temperatures.
San Francisco
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
51
54
55
56
58
60
60
61
63
62
58
52
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Compute the population mean μ.
Step 1:
x


i
N
51 54  55  56  58  60  60  61 63  62  58  52
12
 57.5

Step 2: For each population value xi compute xi – μ. These values are shown
in the second row below.
Xi
xi – μ
51
54
55
56
58
60
60
61
63
62
58
52
–6.5
–3.5
–2.5
–1.5
0.5
2.5
2.5
3.5
5.5
4.5
0.5
-5.5
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Step 3: Square the deviations to obtain the quantity (xi – μ)2. These values
are shown in the third row.
xi
51
54
55
56
58
60
60
61
63
62
58
52
xi – μ
–6.5
–3.5
–2.5
–1.5
0.5
2.5
2.5
3.5
5.5
4.5
0.5
-5.5
(xi – μ)2
42.25
12.25
6.25
2.25
0.25
6.25
6.25
12.25
30.25
20.25
0.25
30.25
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Step 4:
(xi – μ)2
Sum the squared deviations to obtain the quantity Σ(xi – μ)2.
42.25
12.25
 (x   )
6.25
2.25
0.25
6.25
6.25
12.25
30.25
20.25
0.25
30.25
 42.25  12.25  6.25  2.25  0.25  6.25  6.25
2
i
12.25  30.25  20.25  0.25  30.25
 169
Step 5: Divide the sum obtained in Step 4 by the population size N to obtain
the population variance σ2.

2
(x


i
  )2
N
169

 14.083
12
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sample Variance
When the data values come from a sample rather than a
population, the variance is called the sample variance. The
procedure for computing the sample variance is a bit different
from the one used to compute a population variance.
In the formula, the mean μ is replaced by the sample mean and
the denominator is n – 1 instead of N. The sample variance is
denoted by s2.
s2 
2
(
x

x
)
 i
n 1
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Why Divide by n –1?
When computing the sample variance, we use the sample mean to
compute the deviations. For the population variance we use the
population mean for the deviations.
It turns out that the deviations using the sample mean tend to be a
bit smaller than the deviations using the population mean. If we
were to divide by n when computing a sample variance, the value
would tend to be a bit smaller than the population variance.
It can be shown mathematically that the appropriate correction is
to divide the sum of the squared deviations by n –1 rather than n.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.12
A company that manufactures batteries is testing a new type of battery
designed for laptop computers. They measure the lifetimes, in hours, of six
batteries, and the results are presented in the following table. Find the sample
variance of the lifetimes.
Battery Lifetime
3
4
6
5
4
2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Step 1:
Compute the sample mean.
x

x
n
i

3 465 42
4
6
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Solution (continued):
Step 2: For each population value xi compute xi – . These values are shown
in the second row below.
xi
3
4
6
5
4
2
xi –
–1
0
2
1
0
–2
Step 3: Square the deviations to obtain the quantity (xi – )2. These values
are shown in the third row.
xi
3
4
6
5
4
2
xi –
–1
0
2
1
0
–2
(xi – )2
1
0
4
1
0
4
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Solution (continued):
Step 4:
Sum the squared deviations to obtain the quantity Σ(xi – )2.
(xi – )2
1
0
4
1
0
4
2
(
x

x
)
 1 0  4  1 0  4  10
 i
Step 5: Divide the sum obtained in Step 4 by the n –1 to obtain the sample
variance s2.
s2 
2
(
x

x
)
 i
n 1

10 10

2
6 1 5
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Objective 3
Compute the standard deviation of a population
and a sample
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Standard Deviation
Because the variance is computed using squared deviations, the units of
the variance are the squared units of the data. For example, in Battery
Lifetime example, the units of the data are hours, and the units of
variance are squared hours. In most situations, it is better to use a
measure of spread that has the same units as the data.
We do this simply by taking the square root of the variance. This
quantity is called the standard deviation. The standard deviation of a
sample is denoted s, and the standard deviation of a population is
denoted by σ.
s  s2
  2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.12
Recall that in the Battery Lifetime example, the sample variance was computed
as s2 = 2. Find the sample standard deviation.
Battery Lifetime
3
4
6
5
4
2
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
The sample standard deviation, s, is the square root of the sample variance.
s  s2  2  1.414
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Standard Deviation on the TI-84 PLUS
The following steps will compute the standard deviation for both sample data
and population data on the TI-84 PLUS Calculator:
Enter the data into L1 in the data editor.
Run the 1-Var Stats command (the same command
used for means and medians), selecting L1 as the
location of the data.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example – 3.14 (TI-84 PLUS Calculator)
Using the TI-84 PLUS Calculator, find the sample variance.
A company that manufactures batteries is testing a new type of battery
designed for laptop computers. They measure the lifetimes, in hours, of six
batteries, and the results are presented in the following table.
Battery Lifetime
3
4
6
5
4
2
Solution:
Running the 1-Var Stats command, we find the
sample standard deviation to be s = 1.414213562.
We square this quantity to find the sample variance:
s2 = (1.414213562)2 = 2.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Standard Deviation & Resistance
Recall that a statistic is resistant if its value is not affected
much by extreme data values.
The standard deviation is not resistant.
That is, the standard deviation is affected by extreme data
values.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Objective 4
Approximate the standard deviation using
grouped data
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Approximating the Standard Deviation
Sometimes we don’t have access to the raw data in a data
set, but we are given a frequency distribution. In these cases
we can approximate the standard deviation.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Approximating the Standard Deviation
Following is the procedure for approximating the standard deviation:
Step 1: Compute the midpoint of each class and approximate the mean of the
frequency distribution.
Step 2: For each class, subtract mean from the class midpoint to obtain
(Midpoint – Mean).
Step 3: For each class, square the differences obtained in Step 2 to obtain
(Midpoint – Mean)2, and multiply by the frequency to obtain
(Midpoint – Mean)2 x (Frequency).
Step 4: Add the products (Midpoint – Mean)2 x (Frequency) over all classes.
Step 5: To compute the population variance, divide the sum obtained in Step 4 by
n. To compute the sample variance, divide the sum obtained in Step 4 by
n –1.
Step 6: Take the square root of the variance obtained in Step 5. The result is the
standard deviation.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.16
The following table presents the number of text messages sent via
cell phone by a sample of 50 high school students. Approximate
the sample standard deviation number of messages sent.
Number of Text Messages Sent
Frequency
0 – 49
10
50 – 99
5
100 – 149
13
150 – 199
11
200 – 249
7
250 – 299
4
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Step 1:
Compute the midpoint of each class. Recall from
Section 3.1 that the sample mean was computed
as 137.
Number of Text Messages Sent
Class
Midpoint
0 – 49
25
50 – 99
75
100 – 149
125
150 – 199
175
200 – 249
225
250 – 299
275
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Step 2:
For each class, subtract mean from the class
midpoint to obtain (Midpoint – Mean).
Number of Text Messages Sent
Class
Midpoint
(Midpoint –
Mean)
0 – 49
25
–112
50 – 99
75
–62
100 – 149
125
–12
150 – 199
175
38
200 – 249
225
88
250 – 299
275
138
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Step 3:
For each class, square the differences obtained in Step 2 to
obtain (Midpoint – Mean)2, and multiply by the frequency to
obtain (Midpoint – Mean)2 x (Frequency).
Number of Text Messages Sent
Frequency
(Midpoint –
Mean)
(Midpoint –
Mean)2 x
(Frequency)
0 – 49
10
–112
125,440
50 – 99
5
–62
19,220
100 – 149
13
–12
1,872
150 – 199
11
38
15,884
200 – 249
7
88
54,208
250 – 299
4
138
76,176
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Step 4:
Add the products (Midpoint – Mean)2 x (Frequency) over all
classes.
(Midpoint – Mean)2 x
(Frequency)
125,440
  (Midpoint-Mean)2  Frequency 
19,220
= 125,440+19,220+1,872+15,884+54,208+76,176
1,872
15,884
 292, 800
54,208
76,176
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Step 5:
Since we are computing the sample variance, we
divide the sum obtained in Step 4 by n –1.
s2 
  (Midpoint-Mean)2  Frequency 
n 1

292, 800
50  1
 5975.51020
Step 6:
Take the square root of the variance to obtain the
standard deviation.
s  s2  5975.51020  77.30142
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Grouped Data on the TI-84 PLUS
The same procedure used to compute the mean for grouped data in a
frequency distribution may be used to compute the standard deviation.
Enter the midpoint for each class into L1 and the corresponding frequencies in
L2. Next, select the 1-Var stats command and enter L1 in the List field and L2
in the FreqList field, if using Stats Wizards. If you are not using Stats
Wizards, you may rund1-Var Stats command followed by L1, comma, L2.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.16 using TI-84 PLUS
Class Midpoint
Frequency
25
10
75
5
125
13
175
11
225
7
275
4
The output for Example 3.16 on the TI-84
PLUS Calculator is presented below.
The value of s represents the approximate
sample standard deviation. In this example s
= 77.30142. Therefore the approximate
standard deviation is 77.30142.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Objective 5
Use the Empirical Rule to summarize data that
are unimodal and approximately symmetric
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Bell-Shaped Histogram
Many histograms have a single mode near the center of the
data, and are approximately symmetric. Such histograms
are often referred to as bell-shaped.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The Empirical Rule
When a data set has a bell-shaped histogram, it is often possible to use the
standard deviation to provide an approximate description of the data using a
rule known as The Empirical Rule.
When a population has a histogram that is approximately bell-shaped, then:
Approximately 68% of the data will be within one standard deviation of the
mean. In other words, approximately 68% of the data will be in the interval
µ - σ to µ + σ.
Approximately 95% of the data will be within two standard deviations of the
mean. In other words, approximately 95% of the data will be in the interval µ 2σ to µ + 2σ .
All, or almost all, of the data will be within three standard deviations of the
mean. In other words, all, or almost all, of the data will be in the interval
µ - 3σ to µ + 3σ.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The Empirical Rule
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.17
The following table presents the U.S. Census Bureau projection for the percentage of
the population aged 65 and over for each state and the District of Columbia.
Compute the population mean and standard deviation and use The Empirical Rule to
describe the data.
Alabama
Arkansas
Connecticut
Florida
Idaho
Iowa
Louisiana
Massachusetts
Mississippi
Nebraska
New Jersey
North Carolina
Oklahoma
14.1
14.3
14.4
17.8
12
14.9
12.6
13.7
12.8
13.8
13.7
12.4
13.8
Rhode Island
Tennessee
Vermont
West Virginia
Alaska
California
Delaware
Georgia
Illinois
Kansas
Maine
Michigan
Missouri
14.1
13.3
14.3
16
8.1
11.5
14.1
10.2
12.4
13.4
15.6
12.8
13.9
Nevada
New Mexico
North Dakota
Oregon
South Carolina
Texas
Virginia
Wisconsin
Arizona
Colorado
D.C.
Hawaii
Indiana
12.3
14.1
15.3
13
13.6
10.5
12.4
13.5
13.9
10.7
11.5
14.3
12.7
Kentucky
Maryland
Minnesota
Montana
New Hampshire
New York
Ohio
Pennsylvania
South Dakota
Utah
Washington
Wyoming
13.1
12.2
12.4
15
12.6
13.6
13.7
15.5
14.6
9
12.2
14
Solution
We first note that the histogram is
approximately bell-shaped.
We may use the TI-84 PLUS Calculator – or other technology – to compute the
population mean and population standard deviation.
Mean:
µ = 13.24901961
Standard Deviation:
σ = 1.682711694
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
Next, we compute the quantities:
    13.24901961 1.682711694  11.57
    13.24901961 1.682711694  14.93
Approximately 68% of
the data values are
between these.
  2  13.24901961 2(1.682711694)  9.88
  2  13.24901961 2(1.682711694)  16.61
Approximately 95% of
the data values are
between these.
  3  13.24901961 3(1.682711694)  8.20
  3  13.24901961 3(1.682711694)  18.30
Almost all of the data
values are between these.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
8.20
9.88
11.57
14.93
16.61
18.30
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Objective 6
Use Chebyshev’s Inequality to describe a data
set
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Any Data Set
When a distribution is bell-shaped, we use The Empirical
Rule to approximate the proportion of data within one or
two standard deviations. Another rule called Chebyshev’s
Inequality holds for any data set.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chebyshev’s Inequality
In any data set, the proportion of the data that is within K standard
deviations of the mean is at least 1– 1/K2. Specifically, by setting
K = 2 or K = 3, we obtain the following results.
At least 3/4 (75%) of the data are within two standard
deviations of the mean.
At least 8/9 (89%) of the data are within three standard
deviations of the mean.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.20
As part of a public health study, systolic blood pressure was measured for a
large group of people. The mean was 120 and the standard deviation was 10.
What information does Chebyshev’s Inequality provide about these data?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
We compute the following:
x  2s  120  2(10)  100
x  2s  120  2(10)  140
x  3s  120  3(10)  90
x  3s  120  3(10)  150
We conclude:


At least 3/4 (75%) of the people had systolic blood pressures between 100
and 140.
At least 8/9 (89%) of the people had systolic blood pressures between 90
and 150.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Objective 7
Compute the coefficient of variation
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Coefficient of Variation
The coefficient of variation (CV for short) tells how large
the standard deviation is relative to the mean. It can be
used to compare the spreads of data sets whose values
have different units.
The coefficient of variation is found by dividing the standard
deviation by the mean.

CV 

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example 3.21
National Weather service records show that over a thirty-year
period, the annual precipitation in Atlanta, Georgia had a mean
of 49.8 inches with a standard deviation of 7.6 inches, and the
annual temperature had a mean of 62.2 degrees Fahrenheit with
a standard deviation of 1.3 degrees. Compute the coefficient of
variation for precipitation and for temperature. Which has greater
spread relative to its mean?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Solution
We compute the following:
CV for precipitation =
standard deviation for precipitation 7.6

 0.15
mean precipitation
49.8
CV for temperature =
standard deviation for temperature 1.3

 0.02
mean temperature
62.2
The CV for precipitation is larger than the CV for temperature. Therefore
precipitation has a greater spread relative to its mean.
Copyright © The McGraw-Hill Companies, Inc. Permission required for
reproduction or display.
Do You Know…
•
•
•
•
•
•
•
How to compute the range of a data set?
How to compute the variance of a population and a
sample and the appropriate notation?
How to compute the standard deviation of a population
and a sample and the appropriate notation?
How to approximate the standard deviation using
grouped data?
How to use the Empirical Rule to summarize data?
How to use Chebyshev’s Inequality to describe a data
set?
How to compute the coefficient of variation?
Copyright © The McGraw-Hill Companies, Inc. Permission required for
reproduction or display.