Download C.1 Representing Data and Linear Modeling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
9781133105060_App_C1.qxp
12/27/11
1:31 PM
Page C1
Appendix C.1
C
■
C1
Representing Data and Linear Modeling
Further Concepts in Statistics
C.1 Representing Data and Linear Modeling
■ Use stem-and-leaf plots to organize and compare sets of data.
■ Use histograms and frequency distributions to organize and represent data.
■ Use scatter plots to represent and analyze data.
■ Fit lines to data.
Stem-and-Leaf Plots
Statistics is the branch of mathematics that studies techniques for collecting, organizing, and interpreting data. In this section, you will study several ways to organize and
interpret data.
One type of plot that can be used to organize sets of numbers by hand is a stemand-leaf plot. A set of test scores and the corresponding stem-and-leaf plot are shown
below.
Test Scores
93, 70, 76, 58, 86, 93, 82, 78, 83, 86,
64, 78, 76, 66, 83, 83, 96, 74, 69, 76,
64, 74, 79, 76, 88, 76, 81, 82, 74, 70
Stems
5
6
7
8
9
Leaves
8
Key: 5 8 ⫽ 58
4469
0044466666889
122333668
336
ⱍ
Note from the key in the stem-and-leaf plot that the leaves represent the units digits of
the numbers and the stems represent the tens digits. Stem-and-leaf plots can also be
used to compare two sets of data, as shown in the following example.
Example 1
Comparing Two Sets of Data
Use a stem-and-leaf plot to compare the test scores given above with the test scores
below. Which set of test scores is better?
90, 81, 70, 62, 64, 73, 81, 92, 73, 81, 92, 93, 83, 75, 76,
83, 94, 96, 86, 77, 77, 86, 96, 86, 77, 86, 87, 87, 79, 88
SOLUTION
Begin by ordering the second set of scores.
62, 64, 70, 73, 73, 75, 76, 77, 77, 77, 79, 81, 81, 81, 83,
83, 86, 86, 86, 86, 87, 87, 88, 90, 92, 92, 93, 94, 96, 96
9781133105060_App_C1.qxp
C2
12/27/11
Appendix C
■
1:31 PM
Page C2
Further Concepts in Statistics
Now that the data have been ordered, you can construct a double stem-and-leaf plot by
letting the leaves to the right of the stems represent the units digits for the first group of
test scores and letting the leaves to the left of the stems represent the units digits for the
second group of test scores.
Leaves (2nd Group)
42
977765330
877666633111
6643220
Stems
5
6
7
8
9
Leaves (1st Group)
8
4469
0044466666889
122333668
336
By comparing the two sets of leaves, you can see that the second group of test scores is
better than the first group.
Example 2
Using a Stem-and-Leaf Plot
The table below shows the percent of the population of each state and the District
of Columbia that was at least 65 years old in July 2009. Use a stem-and-leaf plot to
organize the data. (Source: U.S. Census Bureau)
AL 13.8
CO 10.6
GA 10.3
IA
14.8
MD 12.2
MO 13.7
NJ
13.5
OH 13.9
SC 13.7
VT 14.5
WY 12.3
SOLUTION
AK
CT
HI
KS
MA
MT
NM
OK
SD
VA
7.6
13.9
14.5
13.0
13.6
14.6
13.0
13.5
14.5
12.2
AZ
DE
ID
KY
MI
NE
NY
OR
TN
WA
13.1
14.3
12.1
13.2
13.4
13.4
13.4
13.5
13.4
12.1
AR
DC
IL
LA
MN
NV
NC
PA
TX
WV
14.3
11.7
12.4
12.3
12.7
11.6
12.7
15.4
10.3
15.8
CA
FL
IN
ME
MS
NH
ND
RI
UT
WI
11.2
17.2
12.9
15.6
12.8
13.5
14.7
14.3
9.0
13.5
Begin by ordering the numbers, as shown below.
7.6, 9.0, 10.3, 10.3, 10.6, 11.2, 11.6, 11.7, 12.1, 12.1, 12.2,
12.2, 12.3, 12.3, 12.4, 12.7, 12.7, 12.8, 12.9, 13.0, 13.0, 13.1,
13.2, 13.4, 13.4, 13.4, 13.4, 13.5, 13.5, 13.5, 13.5, 13.5, 13.6,
13.7, 13.7, 13.8, 13.9, 13.9, 14.3, 14.3, 14.3, 14.5, 14.5, 14.5,
14.6, 14.7, 14.8, 15.4, 15.6, 15.8, 17.2
Next, construct a stem-and-leaf plot using the leaves to represent the digits to the right
of the decimal points.
12/27/11
1:31 PM
Page C3
Appendix C.1
Stems
7
8
9
10
11
12
13
14
15
16
17
Leaves
6
■
Representing Data and Linear Modeling
ⱍ
Key: 7 6 ⫽ 7.6
C3
Alaska has the lowest percent.
0
336
267
11223347789
0012444455555677899
333555678
468
2
Florida has the highest percent.
Histograms and Frequency Distributions
When you want to organize large sets of data, such as those given in Example 2, it
is useful to group the data into intervals and plot the frequency of the data in each
interval. For instance, the frequency distribution and histogram shown in Figure C.1
represent the data given in Example 2.
Frequency Distribution
Interval
Tally
关6, 8兲
|
关8, 10兲
|
关10, 12兲
|||| |
关12, 14兲
|||| |||| |||| |||| |||| ||||
关14, 16兲
|||| |||| ||
关16, 18兲
|
Histogram
30
Number of states
(including District
of Columbia)
9781133105060_App_C1.qxp
25
20
15
10
5
6
8 10 12 14 16 18 20
Percent of population
65 or older
FIGURE C.1
A histogram has a portion of a real number line as its horizontal axis. A bar graph
is similar to a histogram, except that the rectangles (bars) can be either horizontal or
vertical and the labels of the bars are not necessarily numbers.
Another difference between a bar graph and a histogram is that the bars in a bar
graph are usually separated by spaces, whereas the bars in a histogram are not separated
by spaces.
9781133105060_App_C1.qxp
C4
12/27/11
Appendix C
■
1:31 PM
Page C4
Further Concepts in Statistics
Example 3
Constructing a Bar Graph
The data below give the average monthly precipitation (in inches) in Houston, Texas.
Construct a bar graph for these data. What can you conclude? (Source: National
Climatic Data Center)
January
April
July
October
3.7
3.6
3.2
4.5
February
May
August
November
3.0
5.2
3.8
4.2
March
June
September
December
3.4
5.4
4.3
3.7
SOLUTION To create a bar graph, begin by drawing a vertical axis to represent the
precipitation and a horizontal axis to represent the months. The bar graph is shown in
Figure C.2. From the graph, you can see that Houston receives a fairly consistent
amount of rain throughout the year—the driest month tends to be February and the
wettest month tends to be June.
Average monthly precipitation
(in inches)
6
5
4
3
2
1
J
F
M
A
M
J
J
A
S
O
N
D
Month
FIGURE C.2
Scatter Plots
Cable revenue (in
billions of dollars)
R
100
90
80
70
60
50
40
t
2 3 4 5 6 7 8 9
Year (2 ↔ 2002)
FIGURE C.3
Many real-life situations involve finding relationships between two variables, such as
the year and the total revenue of the cable television industry. In a typical situation, data
are collected and written as a set of ordered pairs. The graph of such a set is called a
scatter plot.
From the scatter plot in Figure C.3, it appears that the points describe a relationship
that is nearly linear. (The relationship is not exactly linear because the total revenue did
not increase by precisely the same amount each year.) A mathematical equation that
approximates the relationship between t and R is called a mathematical model. When
developing a mathematical model, you strive for two (often conflicting) goals—accuracy
and simplicity. For the data in Figure C.3, a linear model of the form R ⫽ at ⫹ b
appears to be best. It is simple and relatively accurate.
Consider a collection of ordered pairs of the form 共x, y兲. If y tends to increase as x
increases, the collection is said to have a positive correlation. If y tends to decrease as
x increases, the collection is said to have a negative correlation. Figure C.4, on the next
page, shows three examples: one with a positive correlation, one with a negative
correlation, and one with no (discernible) correlation.
9781133105060_App_C1.qxp
12/27/11
1:31 PM
Page C5
Appendix C.1
y
y
y
x
Positive Correlation
FIGURE C.4
C5
Representing Data and Linear Modeling
■
x
Negative Correlation
x
No Correlation
Fitting a Line to Data
Finding a linear model that represents the relationship described by a scatter plot is
called fitting a line to data. You can do this graphically by simply sketching the line
that appears to fit the points, finding two points on the line, and then finding the
equation of the line that passes through the two points.
Example 4
Fitting a Line to Data
Find a linear model that relates the year with the total revenue R (in billions of dollars)
for the U.S. cable television industry for the years 2002 through 2009. (Source: SNL
Kagan)
Year
2002
2003
2004
2005
2006
2007
2008
2009
Revenue, R
48.0
53.2
58.6
64.9
71.9
78.9
85.2
89.5
Let t represent the year, with t ⫽ 2 corresponding to 2002. After plotting
the data from the table, draw the line that you think best represents the data, as shown
in Figure C.5. Two points that lie on this line are 共3, 53.2兲 and 共8, 85.2兲. Using the
point-slope form, you can find the equation of the line to be
SOLUTION
The model in Example 4 is
based on the two data points
chosen. If different points are
chosen, the model may change
somewhat. For instance, if
you choose 共2, 48.0兲 and
共6, 71.9兲, the new model
is R ⫽ 5.975t ⫹ 36.05.
共t ⫺ 3兲 ⫹ 53.2 ⫽ 6.4t ⫹ 34.
冢85.28 ⫺⫺ 53.2
3 冣
Linear model
R
Cable revenue (in
billions of dollars)
STUDY TIP
R⫽
100
90
80
70
60
50
40
t
2 3 4 5 6 7 8 9
Year (2 ↔ 2002)
FIGURE C.5
9781133105060_App_C1.qxp
C6
12/27/11
Appendix C
■
1:31 PM
Page C6
Further Concepts in Statistics
Once you have found a model, you can measure how well the model fits the data
by comparing the actual values with the values given by the model, as shown in the
table below.
t
2
3
4
5
6
7
8
9
Actual
R
48.0
53.2
58.6
64.9
71.9
78.9
85.2
89.5
Model
R
46.8
53.2
59.6
66.0
72.4
78.8
85.2
91.6
The sum of the squares of the differences between the actual values and the model
values is the sum of the squared differences. The model that has the least sum is called
the least squares regression line for the data. For the model in Example 4, the sum of
the squared differences is 8.32. The least squares regression line for the data is
R ⫽ 6.17t ⫹ 34.8.
Best-fitting linear model
Its sum of squared differences is approximately 4.69.
Least Squares Regression Line
The least squares regression line, for the points 共x1, y1兲, 共x2, y2兲, 共x3, y3兲, . . . , is
given by y ⫽ ax ⫹ b. The slope a and y-intercept b are given by
n
n
a⫽
兺xy
i i
i⫽1
n
n
兺
xi 2 ⫺
i⫽1
Example 5
n
n
兺 x 兺y
⫺
i
i⫽1
n
i
i⫽1
2
and b ⫽
冢兺 冣
xi
1
n
冢兺
n
i⫽1
兺 x 冣.
n
yi ⫺ a
i
i⫽1
i⫽1
Finding the Least Squares Regression Line
Find the least squares regression line for the points 共⫺3, 0兲, 共⫺1, 1兲, 共0, 2兲, and 共2, 3兲.
SOLUTION
Begin by constructing a table like the one shown below.
x
y
xy
x2
⫺3
0
0
9
⫺1
1
⫺1
1
0
2
0
0
2
n
3
n
6
n
4
n
兺 x ⫽ ⫺2 兺 y ⫽ 6 兺 x y ⫽ 5 兺 x
i
i⫽1
i
i⫽1
2
i
i i
i⫽1
i⫽1
⫽ 14
9781133105060_App_C1.qxp
12/27/11
1:31 PM
Page C7
Appendix C.1
Representing Data and Linear Modeling
■
C7
Applying the formulas for the least squares regression line with n ⫽ 4 produces
n
n
n
i⫽1
n
a⫽
n
兺x y ⫺ 兺x 兺y
i i
i⫽1
n
i
兺x ⫺ 冢兺x 冣
2
i
n
i⫽1
2
i
⫽
4共5兲 ⫺ 共⫺2兲共6兲 32
8
⫽
⫽
4共14兲 ⫺ 共⫺2兲2
52 13
i
i⫽1
i⫽1
and
b⫽
1
n
冢兺
n
i⫽1
兺 x 冣 ⫽ 4冤6 ⫺ 13 共⫺2兲冥 ⫽ 52 ⫽ 26.
n
yi ⫺ a
1
8
94
47
i
i⫽1
8
So, the least squares regression line is y ⫽ 13
x ⫹ 47
26 , shown in Figure C.6.
y
5
47
8
y = x + 26
13
4
3
2
1
−3 −2 −1
−1
x
1
2
−2
FIGURE C.6
Many graphing utilities have built-in least squares regression programs. If your
calculator has such a program, try using it to duplicate the results shown in the following example.
Example 6
Finding the Least Squares Regression Line
The ordered pairs 共w, h兲 shown below represent the shoe sizes w and the heights h (in
inches) of 25 men. Use the regression feature of a graphing utility to find the least
squares regression line for these data.
共10.0, 70.5兲, 共10.5, 71.0兲, 共9.5, 69.0兲, 共11.0, 72.0兲, 共12.0, 74.0兲,
共8.5, 67.0兲, 共9.0, 68.5兲, 共13.0, 76.0兲, 共10.5, 71.5兲, 共10.5, 70.5兲,
共10.0, 71.0兲, 共9.5, 70.0兲, 共10.0, 71.0兲, 共10.5, 71.0兲, 共11.0, 71.5兲,
共12.0, 73.5兲, 共12.5, 75.0兲, 共11.0, 72.0兲, 共9.0, 68.0兲, 共10.0, 70.0兲,
共13.0, 75.5兲, 共10.5, 72.0兲, 共10.5, 71.0兲, 共11.0, 73.0兲, 共8.5, 67.5兲
FIGURE C.7
90
SOLUTION Enter the data into a graphing utility. Then, use the regression feature
of the graphing utility to obtain the model shown in Figure C.7. So, the least squares
regression line for the data is
8
14
50
FIGURE C.8
h ⫽ 1.84w ⫹ 51.9.
In Figure C.8, this line is plotted with the data. Notice that the plot does not have
25 points because some of the ordered pairs graph as the same point.
9781133105060_App_C1.qxp
C8
12/27/11
Appendix C
■
1:31 PM
Page C8
Further Concepts in Statistics
When you use a graphing utility or a computer program to find the least squares
regression line for a set of data, the output may include an r-value. For instance, the r-value
from Example 6 was r ⬇ 0.981. This number is called the correlation coefficient
of the data and gives a measure of how well the model fits the data. Correlation
coefficients vary between ⫺1 and 1. Basically, the closer r is to 1, the better the points
can be described by a line. Three examples are shown in Figure C.9.
ⱍⱍ
18
0
0
18
9
r = 0.981
0
0
18
9
r = −0.866
0
0
9
r = 0.190
FIGURE C.9
Exercises C.1
Exam Scores In Exercises 1 and 2, use the data below
which represent the scores on two 100-point exams for
a math class of 30 students. See Examples 1 and 2.
Exam #1: 77, 100, 77, 70, 83, 89, 87, 85, 81, 84, 81, 78,
89, 78, 88, 85, 90, 92, 75, 81, 85, 100, 98, 81, 78, 75, 85,
89, 82, 75
Exam #2: 76, 78, 73, 59, 70, 81, 71, 66, 66, 73, 68, 67,
63, 67, 77, 84, 87, 71, 78, 78, 90, 80, 77, 70, 80, 64, 74,
68, 68, 68
1. Use a stem-and-leaf plot to organize the scores for
Exam #1.
2. Construct a double stem-and-leaf plot to compare the
scores for Exam #1 and Exam #2. Which set of test
scores is better?
3. Cancer Incidence The table shows the estimated
numbers of new cancer cases (in thousands) in the 50
states in 2010. Use a stem-and-leaf plot to organize the
data. (Source: American Cancer Society, Inc.)
AL
CO
HI
KS
MA
MT
NM
OK
SD
VA
24
21
7
14
36
6
9
19
4
36
AK
CT
ID
KY
MI
NE
NY
OR
TN
WA
3
21
7
24
56
9
103
21
33
35
AZ
DE
IL
LA
MN
NV
NC
PA
TX
WV
30
5
64
21
25
12
45
75
101
11
AR
FL
IN
ME
MS
NH
ND
RI
UT
WI
15
107
33
9
14
8
3
6
10
30
CA
GA
IA
MD
MO
NJ
OH
SC
VT
WY
157
40
17
28
31
48
64
23
4
3
4. Snowfall The data give the seasonal snowfalls (in
inches) for Lincoln, Nebraska for the seasons 1970–1971
through 2009–2010 (the amounts are listed in order by
year). Use a frequency distribution and a histogram to
organize the data. (Source: National Weather Service)
49.0, 21.6, 29.2, 33.6, 42.1, 21.1, 21.8, 31.0, 34.4, 23.3,
13.0, 32.3, 38.0, 47.5, 21.5, 18.9, 15.7, 13.0, 19.1, 18.7,
25.8, 23.8, 32.1, 21.3, 21.7, 30.7, 29.0, 44.6, 24.4, 11.7,
37.9, 29.5, 31.7, 35.9, 16.3, 19.5, 31.0, 20.4, 19.2, 41.6
5. Bus Fares The data below give the base prices of bus
fare in selected U.S. cities. Construct a bar graph for
these data. (Source: American Public Transportation
Association)
Seattle
Houston
New York
Los Angeles
$2.25
$1.25
$2.25
$1.50
Atlanta
Dallas
Denver
Chicago
$2.00
$1.75
$2.25
$2.00
6. Melanoma Incidence The data below give the places
of origin and the estimated numbers of new melanoma
cases in 2010. Construct a bar graph for these data.
(Source: American Cancer Society, Inc.)
California
Michigan
Texas
8030
2240
3570
Florida
New York
Washington
4980
4050
1930
9781133105060_App_C1.qxp
12/27/11
1:31 PM
Page C9
Appendix C.1
Crop Yield In Exercises 7–10, use the data in the table,
where x is the number of units of fertilizer applied to
sample plots and y is the yield (in bushels) of a crop.
x
0
1
2
3
4
y
58
60
59
61 63
5
6
7
8
66
65
67
70
7. Sketch a scatter plot of the data.
8. Determine whether the points are positively correlated,
are negatively correlated, or have no discernible
correlation.
9. Sketch a linear model that you think best represents the
data. Find an equation of the line you sketched. Use the
line to predict the yield when 10 units of fertilizer are
used.
10. Can the model found in Exercise 9 be used to predict
yields for arbitrarily large values of x? Explain.
Speed of Sound In Exercises 11–14, use the data in
the table, where h is altitude (in thousands of feet) and v
is the speed of sound (in feet per second).
h
0
5
10
15
20
25
30
v
1116 1097 1077 1057 1037 1016 995
35
973
11. Sketch a scatter plot of the data.
12. Determine whether the points are positively correlated, are
negatively correlated, or have no discernible correlation.
13. Sketch a linear model that you think best represents the
data. Find an equation of the line you sketched. Use the
line to estimate the speed of sound at an altitude of
27,000 feet.
14. The speed of sound at an altitude of 70,000 feet is
approximately 968 feet per second. What does this
suggest about the validity of using the model in Exercise
13 to extrapolate beyond the data given in the table?
Fitting Lines to Data In Exercises 15 and 16, (a) sketch
a scatter plot of the points, (b) find an equation of the
linear model you think best represents the data and find
the sum of the squared differences, and (c) use the formulas in this section to find the least squares regression
line for the data and the sum of the squared differences.
See Examples 4 and 5.
15. 共⫺1, 0兲, 共0, 1兲, 共1, 3兲, 共2, 3兲
16. 共0, 4兲, 共1, 3兲, 共2, 2兲, 共4, 1兲
Finding the Least Squares Regression Line In
Exercises 17–20, sketch a scatter plot of the points, use
the formulas in this section to find the least squares
regression line for the data, and sketch the graph of the
line. See Example 5.
■
17.
18.
19.
20.
C9
Representing Data and Linear Modeling
共⫺2, 0兲, 共⫺1, 1兲, 共0, 1兲, 共2, 2兲
共⫺3, 1兲, 共⫺1, 2兲, 共0, 2兲, 共1, 3兲, 共3, 5兲
共1, 5兲, 共2, 8兲, 共3, 13兲, 共4, 16兲, 共5, 22兲, 共6, 26兲
共1, 10兲, 共2, 8兲, 共3, 8兲, 共4, 6兲, 共5, 5兲, 共6, 3兲
Finding the Least Squares Regression Line
In Exercises 21–24, use the regression feature of a
graphing utility to find the least squares regression
line for the data. Graph the data and the regression
line in the same viewing window. See Example 6.
21. 共0, 23兲, 共1, 20兲, 共2, 19兲, 共3, 17兲, 共4, 15兲, 共5, 11兲, 共6, 10兲
22. 共4, 52.8兲, 共5, 54.7兲, 共6, 55.7兲, 共7, 57.8兲, 共8, 60.2兲, 共9, 63.1兲,
共10, 66.5兲
23. 共⫺10, 5.1兲, 共⫺5, 9.8兲, 共0, 17.5兲, 共2, 25.4兲, 共4, 32.8兲, 共6, 38.7兲,
共8, 44.2兲, 共10, 50.5兲
24. 共⫺10, 213.5兲, 共⫺5, 174.9兲, 共0, 141.7兲, 共5, 119.7兲, 共8, 102.4兲,
共10, 87.6兲
25. Advertising The management of a department store
ran an experiment to determine if a relationship existed
between sales S (in thousands of dollars) and the amount
spent on advertising x (in thousands of dollars). The
following data were collected.
x
1
2
3
4
5
6
7
8
S
405 423 455 466 492 510 525 559
(a) Use the regression feature of a graphing utility to
find the least squares regression line for the data.
Use the model to estimate sales when $4500 is
spent on advertising.
(b) Make a scatter plot of the data and sketch the graph
of the regression line.
(c) Use a graphing utility or computer to determine the
correlation coefficient.
26. Horses The table shows the heights (in hands) and
corresponding lengths (in inches) of horses in a stable.
Height
17
16
16.2 15.3 15.1 16.3
Length
77
73
74
71
69
75
(a) Use the regression feature of a graphing utility to
find the least squares regression line for the data.
Let h represent the height and l represent the length.
Use the model to predict the length of a horse that
is 15.5 hands tall.
(b) Make a scatter plot of the data and sketch the graph
of the regression line.
(c) Use a graphing utility or computer to determine the
correlation coefficient.
9781133105060_App_C2.qxp
C10
12/27/11
Appendix C
■
1:32 PM
Page C10
Further Concepts in Statistics
C.2 Measures of Central Tendency and Dispersion
■ Find and interpret the mean, median, and mode of a set of data.
■ Determine the measure of central tendency that best represents a set of data.
■ Find the standard deviation of a set of data.
Mean, Median, and Mode
In many real-life situations, it is helpful to describe data by a single number that is most
representative of the entire collection of numbers. Such a number is called a measure
of central tendency. Here are three of the most commonly used measures of central
tendency.
1. The mean, or average, of n numbers is the sum of the numbers divided by n.
2. The median of n numbers is the middle number when the numbers are written in
order. If n is even, the median is the average of the two middle numbers.
3. The mode of n numbers is the number that occurs most frequently. If two numbers
tie for most frequent occurrence, the collection has two modes and is called bimodal.
Example 1
Finding Measures of Central Tendency
You are interviewing for a job. The interviewer tells you that the average income of the
company’s 25 employees is $60,849. The actual annual incomes of the 25 employees
are shown. What are the mean, median, and mode of the incomes? Was the interviewer
telling you the truth?
$17,305,
$25,676,
$12,500,
$34,983,
$32,654,
SOLUTION
$478,320,
$28,906,
$33,855,
$36,540,
$98,213,
$45,678,
$12,500,
$37,450,
$250,921,
$48,980,
$18,980,
$24,540,
$20,432,
$36,853,
$94,024,
$17,408,
$33,450,
$28,956,
$16,430,
$35,671
The mean of the incomes is
17,305 ⫹ 478,320 ⫹ 45,678 ⫹ 18,980 ⫹ . . . ⫹ 35,671
25
1,521,225
⫽
⫽ $60,849.
25
Mean ⫽
To find the median, order the incomes.
$12,500,
$18,980,
$28,956,
$35,671,
$48,980,
$12,500,
$20,432,
$32,654,
$36,540,
$94,024,
$16,430,
$24,540,
$33,450,
$36,853,
$98,213,
$17,305,
$25,676,
$33,855,
$37,450,
$250,921,
$17,408,
$28,906,
$34,983,
$45,678,
$478,320
9781133105060_App_C2.qxp
12/27/11
1:32 PM
Page C11
Appendix C.2
■
Measures of Central Tendency and Dispersion
C11
From this list, you can see the median income is $33,450. You can also see that $12,500
is the only income that occurs more than once. So, the mode is $12,500. Technically,
the interviewer was telling the truth because the average is (generally) defined to be the
mean. However, of the three measures of central tendency
Mean: $60,849
Median: $33,450
Mode: $12,500
it seems clear that the median is most representative. The mean is inflated by the two
highest salaries.
Choosing a Measure of Central Tendency
Which of the three measures of central tendency is the most representative? The answer
is that it depends on the distribution of the data and the way in which you plan to use
the data.
For instance, in Example 1, the mean salary of $60,849 does not seem very representative to a potential employee. To a city income tax collector who wants to estimate
1% of the total income of the 25 employees, however, the mean is precisely the right
measure.
Example 2
Choosing a Measure of Central Tendency
Which measure of central tendency is the most representative for each situation based
on the data shown in the frequency distribution?
a. The data represent a student’s scores on 20 labs assignments worth 10 points each.
The professor will use the data to determine the lab grade for the student.
Score
5
6
7
8
9 10
Frequency
3
0
3
2
1 11
b. The data represent the ages of the students in a class. The professor will use the data
to determine the typical age of a student in the class.
Age
18 19 20 21 22 82
Frequency
7
5
5
6
4
1
SOLUTION
a. For these data, the mean is 8.55, the median is 10, and the mode is 10. Of these, the
mean is the most representative measure because it takes into account all of the student’s lab assignment scores.
b. For these data, the mean is about 22.04, the median is 20, and the mode is 18. The
mean is greater than most of the data, because it is affected by the extreme value of
82. The mode corresponds to the youngest age in the class. The median appears to
be the most representative measure, because it is central to most of the data.
■
Page C12
Further Concepts in Statistics
Variance and Standard Deviation
Very different sets of numbers can have the same mean. You will now study two
measures of dispersion, which give you an idea of how much the numbers in the set
differ from the mean of the set. These two measures are called the variance of the set
and the standard deviation of the set.
Definitions of Variance and Standard Deviation
Consider a set of numbers 再x1, x2, . . . , xn冎 with a mean of x. The variance of the
set is
v⫽
共x1 ⫺ x兲2 ⫹ 共x2 ⫺ x兲2 ⫹ . . . ⫹ 共xn ⫺ x兲2
n
and the standard deviation of the set is
␴ ⫽ 冪v
(␴ is the lowercase Greek letter sigma).
The standard deviation of a set is a measure of how much a typical number in the
set differs from the mean. The greater the standard deviation, the more the numbers in
the set vary from the mean. For instance, each of the following sets has a mean of 5.
再5, 5, 5, 5冎, 再4, 4, 6, 6冎, and 再3, 3, 7, 7冎
The standard deviations of the sets are 0, 1, and 2.
冪共5 ⫺ 5兲
共4 ⫺ 5兲
⫽冪
共3 ⫺ 5兲
⫽冪
␴1 ⫽
␴2
␴3
Example 3
2
⫹ 共5 ⫺ 5兲2 ⫹ 共5 ⫺ 5兲2 ⫹ 共5 ⫺ 5兲2
⫽0
4
2
⫹ 共4 ⫺ 5兲2 ⫹ 共6 ⫺ 5兲2 ⫹ 共6 ⫺ 5兲2
⫽1
4
2
⫹ 共3 ⫺ 5兲2 ⫹ 共7 ⫺ 5兲2 ⫹ 共7 ⫺ 5兲2
⫽2
4
Estimations of Standard Deviation
Consider the three sets of data represented by the bar graphs in Figure C.10. Which set
has the smallest standard deviation? Which has the largest?
Set A
5
4
3
2
1
Set B
5
4
3
2
1
Set C
Frequency
Appendix C
1:32 PM
Frequency
C12
12/27/11
Frequency
9781133105060_App_C2.qxp
5
4
3
2
1
1 2 3 4 5 6 7
1 2 3 4 5 6 7
1 2 3 4 5 6 7
Number
Number
Number
FIGURE C.10
9781133105060_App_C2.qxp
12/27/11
1:32 PM
Page C13
Appendix C.2
■
Measures of Central Tendency and Dispersion
C13
Of the three sets, the numbers in set A are grouped most closely to
the center and the numbers in set C are the most dispersed. So, set A has the smallest
standard deviation and set C has the largest standard deviation.
SOLUTION
Example 4
Finding Standard Deviation
Find the standard deviation of each set shown in Example 3.
Because of the symmetry of each bar graph, you can conclude that each
has a mean of x ⫽ 4. The standard deviation of set A is
SOLUTION
冪
(⫺3兲2 ⫹ 2共⫺2兲2 ⫹ 3共⫺1兲2 ⫹ 5共0兲2 ⫹ 3共1兲2 ⫹ 2共2兲2 ⫹ 共3兲2
17
⬇ 1.53.
␴⫽
The standard deviation of set B is
␴⫽
冪
2共⫺3兲2 ⫹ 2共⫺2兲2 ⫹ 2共⫺1兲2 ⫹ 2共0兲2 ⫹ 2共1兲2 ⫹ 2共2兲2 ⫹ 2共3兲2
14
⫽ 2.
The standard deviation of set C is
冪
5共⫺3兲2 ⫹ 4共⫺2兲2 ⫹ 3共⫺1兲2 ⫹ 2共0兲2 ⫹ 3共1兲2 ⫹ 4共2兲2 ⫹ 5共3兲2
26
⬇ 2.22.
␴⫽
These values confirm the results of Example 3. That is, set A has the smallest standard
deviation and set C has the largest.
The following alternative formula provides a more efficient way to compute the
standard deviation.
Alternative Formula for Standard Deviation
The standard deviation of 再x1, x2, . . . , xn冎 is
␴⫽
冪x
2
1
⫹ x22 ⫹ . . . ⫹ x2n
⫺ x 2.
n
Because of lengthy computations, this formula is difficult to verify. Conceptually,
however, the process is straightforward. It consists of showing that the expressions
共x1 ⫺ x兲2 ⫹ 共x2 ⫺ x兲2 ⫹ . . . ⫹ 共xn ⫺ x兲2
冪
n
and
冪
x21 ⫹ x22 ⫹ . . . ⫹ x2n
⫺ x2
n
are equivalent. Try verifying this equivalence for the set 再x1, x2, x3冎 with
x ⫽ 共x1 ⫹ x2 ⫹ x3兲兾3.
9781133105060_App_C2.qxp
C14
12/27/11
Appendix C
■
1:32 PM
Page C14
Further Concepts in Statistics
Example 5
Using the Alternative Formula
Use the alternative formula for standard deviation to find the standard deviation of the
set of numbers.
5, 6, 6, 7, 7, 8, 8, 8, 9, 10
SOLUTION
Begin by finding the mean of the set, which is 7.4. So, the standard
deviation is
冪
冪
52 ⫹ 2共62兲 ⫹ 2共72兲 ⫹ 3共82兲 ⫹ 92 ⫹ 102
⫺ 共7.4兲2
10
568
⫽
⫺ 54.76
10
⫽ 冪2.04
⬇ 1.43.
␴⫽
You can use the statistical features of a graphing utility to check this result.
A well-known theorem in statistics, called Chebychev’s Theorem, states that at
least
1⫺
1
k2
of the numbers in a distribution must lie within k standard deviations of the mean.
So, 75% of the numbers in a collection must lie within two standard deviations of the
mean, and at least 88.9% of the numbers must lie within three standard deviations of
the mean. For most distributions, these percentages are low. For instance, in all three
distributions shown in Example 3, 100% of the numbers lie within two standard
deviations of the mean.
Example 6
Describing a Distribution
The table shows the numbers of nurses (per 100,000 people) in each state. Find the
mean and standard deviation of the data. What percent of the numbers lie within two
standard deviations of the mean? (Source: Bureau of Labor Statistics)
AL
899
CO
799
HI
680
KS
894
MA 1218
MT 773
NM 599
OK 734
SD 1244
VA
770
AK
CT
ID
KY
MI
NE
NY
OR
TN
WA
777
1010
710
958
866
1062
867
792
987
792
AZ
581
DE 1034
IL
847
LA
890
MN 1065
NV
610
NC
911
PA 1027
TX
676
WV 932
AR
FL
IN
ME
MS
NH
ND
RI
UT
WI
802
793
884
1065
930
992
988
1078
632
919
CA
657
GA
669
IA
1008
MD 897
MO 1009
NJ
873
OH
997
SC
819
VT
950
WY 807
9781133105060_App_C2.qxp
12/27/11
1:32 PM
Page C15
Appendix C.2
■
C15
Measures of Central Tendency and Dispersion
Begin by entering the numbers into a graphing utility that has a standard
deviation program. After running the program, you should obtain
SOLUTION
x ⬇ 875.3
and
␴ ⫽ 152.3.
The interval that contains all numbers that lie within two standard deviations of the
mean is
关875.3 ⫺ 2共152.3兲, 875.3 ⫹ 2共152.3兲兴 or 关570.7, 1179.9兴.
From the table, you can see that all but two of the numbers (96%) lie in this interval—
all but the numbers that correspond to the number of nurses (per 100,000 people) in
Massachusetts and South Dakota.
Exercises C.2
Finding Measures of Central Tendency In Exercises
1– 6, find the mean, median, and mode of the set of
measurements. See Example 1.
1.
2.
3.
4.
5.
6.
5, 12, 7, 14, 8, 9, 7
30, 37, 32, 39, 33, 34, 32
5, 12, 7, 24, 8, 9, 7
20, 37, 32, 39, 33, 34, 32
5, 12, 7, 14, 9, 7
30, 37, 32, 39, 34, 32
7. Reasoning Compare your answers for Exercises 1
and 3 with those for Exercises 2 and 4. Which of the
measures of central tendency is sensitive to extreme
measurements? Explain your reasoning.
8. Reasoning
(a) Add 6 to each measurement in Exercise 1 and
calculate the mean, median, and mode of the revised
measurements. How are the measures of central
tendency changed?
(b) If a constant k is added to each measurement in a
set of data, how will the measures of central tendency change?
9. Cost of Electricity A person’s monthly electricity
bills are shown for one year. What are the mean and
median of the collection of bills?
January
$67.92
February
$59.84
March
$52.00
April
$52.50
May
$57.99
June
$65.35
July
$81.76
August
$74.98
September $87.82
October
$83.18
November $65.35
December $57.00
10. Car Rental The numbers of miles of travel for a
rental car are shown for six consecutive days. What are
the mean, median, and mode of these data?
Monday
410
Tuesday
260
Wednesday 320
Thursday 320
Friday
460
Saturday 150
11. Six-Child Families A study was done on families
having six children. The table shows the numbers of
families in the study with the indicated number of girls.
Determine the mean, median, and mode of this set of
data.
Number of girls
0
1
2
3
4
5
6
Frequency
1
24
45
54
50
19
7
12. Baseball A fan examined the records of a baseball
player’s performance during his last 50 games. The
table shows the numbers of games in which the player
had 0, 1, 2, 3, and 4 hits.
Number of hits
0
1
2
3
4
Frequency
14
26
7
2
1
(a) Determine the average number of hits per game.
(b) The player had 200 at-bats in the 50 games.
Determine the player’s batting average for these
games.
13. Think About It Construct a collection of numbers
that has the following properties. If this is not possible,
explain why.
Mean ⫽ 6, median ⫽ 4, mode ⫽ 4
9781133105060_App_C2.qxp
C16
12/27/11
Appendix C
■
1:32 PM
Page C16
Further Concepts in Statistics
14. Think About It Construct a collection of numbers
that has the following properties. If this is not possible,
explain why.
34. Think About It Consider the four sets of data
represented by the histograms. Order the sets from the
smallest to the largest standard deviation.
Mean ⫽ 6, median ⫽ 6, mode ⫽ 4
Finding Standard Deviation In Exercises 17–24, find
the mean x, variance v, and standard deviation ␴ of the
set. See Example 4.
17.
18.
19.
20.
21.
22.
23.
24.
4, 10, 8, 2
3, 15, 6, 9, 2
0, 1, 1, 2, 2, 2, 3, 3, 4
2, 2, 2, 2, 2, 2
1, 2, 3, 4, 5, 6, 7
1, 1, 1, 5, 5, 5
49, 62, 40, 29, 32, 70
1.5, 0.4, 2.1, 0.7, 0.8
Using the Alternate Formula In Exercises 25–30, use
the alternative formula to find the standard deviation of
the data set. See Example 5.
25.
26.
27.
28.
29.
30.
2, 4, 6, 6, 13, 5
10, 25, 50, 26, 15, 33, 29, 4
246, 336, 473, 167, 219, 359
6.0, 9.1, 4.4, 8.7, 10.4
8.1, 6.9, 3.7, 4.2, 6.1
9.0, 7.5, 3.3, 7.4, 6.0
31. Reasoning Without calculating the standard deviation, explain why the set 再4, 4, 20, 20冎 has a standard
deviation of 8.
32. Reasoning When the standard deviation of a set of
numbers is 0, what does this imply about the set?
33. Test Scores An instructor adds five points to each
student’s exam score. Will this change the mean or
standard deviation of the exam scores? Explain.
Frequency
8
6
4
2
1
2
3
6
4
2
4
1
2
3
Number
Number
Set C
Set D
8
4
8
Frequency
Which measure of central tendency best describes these
test scores?
16. Shoe Sales A salesman sold eight pairs of men’s
dress shoes. The sizes of the eight pairs were 1012, 8, 12,
1012, 10, 912, 11, and 10 12. Which measure (or measures) of
central tendency best describes the typical shoe size for
these data?
Set B
8
Frequency
99, 64, 80, 77, 59, 72, 87, 79, 92, 88, 90, 42, 20, 89, 42,
100, 98, 84, 78, 91
Set A
Frequency
15. Test Scores A philosophy professor records the
following scores for a 100-point exam.
6
4
2
1
2
3
4
6
4
2
1
Number
2
3
4
Number
35. Test Scores The scores of a mathematics exam given
to 600 science and engineering students at a college had
a mean and standard deviation of 235 and 28, respectively. Use Chebychev’s Theorem to determine the
intervals containing at least 34 and at least 89 of the scores.
36. Price of Gold The data represent the average prices
of gold (in dollars per troy ounce) for the years 1985
through 2009. Use a graphing utility or computer to find
the mean, variance, and standard deviation of the data.
What percent of the data lie within two standard deviations of the mean? (Source: U.S. Bureau of Mines
and U.S. Geological Survey)
318,
385,
386,
280,
446,
368,
363,
389,
272,
606,
448,
345,
332,
311,
699,
438,
361,
295,
365,
768,
383,
385,
280,
411,
950
37. Price of Silver The data represent the average prices
of silver (in dollars per troy ounce) for the years 1990
through 2009. Use a graphing utility or computer to find
the mean, variance, and standard deviation of the data.
What percent of the data lie within one standard deviation of the mean? (Source: U.S. Bureau of Mines and
U.S. Geological Survey)
4.82,
5.15,
5.00,
7.34,
4.04,
5.19,
4.39,
11.61,
3.94,
4.89,
4.62,
13.43,
4.30,
5.54,
4.91,
15.02,
5.29,
5.25,
6.69,
13.37