Download Chapter 4 Displaying and Summarizing Quantitative Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 4
ST 101
Reiland
Displaying and Summarizing Quantitative Data
Chapter Objectives:
At the end of this chapter you should be able to:
1) Create appropriate displays to graphically depict quantitative data (frequency tables,
histograms, stem-and-leaf displays, dotplots, timeplots; the use of software will be
emphasized)
2) Describe the important features of the distribution of a quantitative variable: shape, center,
spread, and any unusual features such as outliers, gaps, or clusters.
Throughout the course we will emphasize the paradigm "Think, Show, Tell". The above objectives fit
into this paradigm as follows:
"Think" about what graphical display is appropriate for the data at hand; create the display to
"show" the data (objective 1).
"Tell" what characteristics of the data are conveyed by the graphical display (objective 2).
Reading Assignment:
Text: Chapter 4.
Histograms
A histogram shows three general types of information:
It provides visual indication of where the approximate center of the data is.
We can gain an understanding of the degree of spread, or variation, in the data.
We can observe the shape of the distribution.
Construction of a histogram (automate!):
i) identify the smallest and largest measurements in data set
ii) divide interval between smallest and largest measurements into between 5
and 20 subintervals (called bins in Excel.)
iii) count the number of data values that are in each bin (the bins and the count in each bin
give the distribution of the quantitative variable
iv) plot the bin counts as bars over the bins; the height of the bar over a bin indicates the
count for that bin
EXAMPLE: (Number of daily employee absences from a large corporation; 106 days)
106 obs. approx # of classes œ
146
144
140
140
138
140
148
140
129
153
143
141
140
140
143
136
148
142
139
143
148
143
139
138
141
143
138
140
133
158
148
144
148
140
139
143
149
144
140
140
135
138
138
141
145
147
134
136
136
139
141
132
149
150
145
141
139
146
141
145
139
145
148
146
148
141
142
141
134
143
143
144
148
142
141
138
131
137
142
143
137
138
139
145
142
145
142
141
133
141
142
146
136
145
144
145
140
132
149
140
146
153
141
121
137
142
ST 101
Displaying and Summarizing Quantitative Data
Histogram of Employee Absences
70
60
y 50
c
n 40
e
u
q 30
e
r
F 20
10
0
125.5
132.5
Statcrunch histogram
139.5
146.5
Absences from Work
153.5
160.5
page 2
ST 101
Displaying and Summarizing Quantitative Data
page 3
Heights of students in ST101
EXCEL
Student Heights ST 101
20
yc
n
e 10
u
q
e
rF 0
59
61
63
65
67
69
71
73
75 More
Height (inches)
DATADESK
Stem-and-Leaf Displays
Partition each number in data set into a “stem" and “leaf"
Constructing a stem and leaf display:
i) determine the stem and leaf you want to use; ( 5 - 20 stems)
ii) write stems in a column with smallest stem at top; include all stems in range of data, even
those without leaves;
iii) include only 1 digit in the leaves; drop digits after the first digit or round off;
iv) record the leaf for each measurement in the row corresponding to its stem;
ordering of leaves in a row is optional, but this does make the display more informative.
EXAMPLE: Below is a list of the number of home runs that Roger Maris hit during his
10 years in the American League. Make a stemplot of the data.
8 13 14 16 23 26 28 33 39 61
EXAMPLE: Number of touchdown passes thrown by each of the 31 teams in the NFL during the 2000
season.
37, 33, 33, 32, 29, 28, 28, 23, 22, 22, 22, 21, 21, 21, 20, 20, 19, 19, 18, 18, 18, 18, 16, 15, 14, 14,
14, 12, 12, 9, 6
ST 101
Displaying and Summarizing Quantitative Data
page 4
STEMS ARE 10'S DIGIT
stem
leaf
3 | 7
3 | 233
2 | 889
2 | 001112223
1 | 56888899
1 | 22444
0 | 69
EXAMPLE: Nielsen ratings for week of Aug. 8 - Aug. 14, 2005.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Program
CSI
WITHOUT A TRACE
CSI: MIAMI
60 MINUTES
TWO AND A HALF MEN 930P
TWO AND A HALF MEN
EXTREME MAKEOVER:HM ED-8P
NCIS
AFC-NFC HALL OF FAME GAME(S)
LAW AND ORDER:CRIM INTENT
AFC-NFC HALL-FME SHOWCASE(S)
EVERYBODY LOVES RAYMOND
LAW AND ORDER:SVU
MT&R: UNFORGET MOMNTS TV(S)
COLD CASE
CSI: NY
LAW AND ORDER
BIG BROTHER 6-TUE
CROSSING JORDAN
DATELINE FRI
Network
CBS
CBS
CBS
CBS
CBS
CBS
ABC
CBS
ABC
NBC
ABC
CBS
NBC
NBC
CBS
CBS
NBC
CBS
NBC
NBC
Time
9:00PM
10:01PM
10:00PM
7:00PM
9:30PM
9:00PM
8:00PM
8:00PM
8:08PM
9:00PM
8:00PM
8:30PM
10:00PM
8:30PM
8:00PM
10:00PM
10:00PM
9:00PM
10:00PM
8:00PM
Day
Thu
Thu
Mon
Sun
Mon
Mon
Sun
Tue
Mon
Sun
Mon
Mon
Tue
Wed
Sun
Wed
Wed
Tue
Sun
Fri
Rating
9.3
8
7.9
7.6
7
6.9
6.8
6.5
6.2
6
5.9
5.8
5.8
5.8
5.6
5.6
5.6
5.5
5.5
5.1
*There are an estimated 105.5 million television households in the USA.
A single /ratings/ point represents 1%, or 1,055,000 households for the
2005-06 season. /Share/ is the percentage of television sets in use
tuned to a specific program.
Stem-and-leaf for Shares
stems are 10's
0
1*
1t
1f
1s
1**
|9 9
|0 0 0 0 0 0 0 1 1 1 1
|2 2 3
|4
|6
|
Share
Households
16 10,225,000
14
8,742,000
13
8,668,000
14
8,368,000
11
7,659,000
11
7,540,000
12
7,423,000
12
7,175,000
11
6,846,000
10
6,625,000
11
6,478,000
10
6,390,000
10
6,409,000
10
6,331,000
10
6,174,000
10
6,104,000
10
6,121,000
9
5,981,000
9
6,065,000
10
5,592,000
ST 101
Displaying and Summarizing Quantitative Data
Stem-and-Leaf for Rating
stems are 1's
5*
5.
6*
6.
7*
7.
8*
8.
9*
|1
|556668889
|02
|589
|0
|69
|0
|
|3
EXAMPLE: (beginning of class pulses)
#
--.
3
9
10
23
23
16
23
10
10
4
2
4
.
1
BPULSE
Unit = 1.000000
n = 138. missing =
Stem Leaves . . .
---- -------------------------------------------------------------4* |
4. | 588
5* | 001233444
5. | 5556788899
6* | 00011111122233333344444
6. | 55556666667777788888888
7* | 0000011222233444
7. | 55555666666777888888999
8* | 0000112224
8. | 5555667789
9* | 0012
9. | 58
10* | 0223
10. |
11* | 1
Advantages of stem and leaf displays:
i) each measurement displayed
ii) ascending order
iii) relatively simple (if data set not too large)
Disadvantage:
i) display becomes unwieldy for large data sets
0.
page 5
ST 101
Displaying and Summarizing Quantitative Data
page 6
EXAMPLE Population of 185 US cities with between 100,000 and 500,000 residents.
Since a stem and leaf plot shows only two-place accuracy, we had to round the numbers to the
nearest 10,000. For example the largest number (493,559) was rounded to 490,000 and then
plotted with a stem of 4 and a leaf of 9. The fourth highest number (463,201) was rounded to
460,000 and plotted with a stem of 4 and a leaf of 6. Thus, the stems represent units of 100,000
and the leaves represent units of 10,000. Notice that each stem value is split into five parts: 0-1,
2-3, 4-5, 6-7, and 8-9.
Dotplots
simple display, it just places a dot along an axis for each case in the data.
similar to a stem-and-leaf display
Kentucky Derby winning times, plotting each race as its own dot.
ST 101
Timeplots
Displaying and Summarizing Quantitative Data
page 7
Winning Times in Olympic 100m Dash
13
12.5
12
11.5
11
10.5
10
9.5
9
1880
1900
1920
1940
1960
1980
11.46
More
Histogram
Frequency
15
10
5
0
9.84
10.38
10.92
Bin
The Shape of a Distribution
skewnessskewed to the right (positively skewed)
45
8
2006 Baseball Salaries
400
300
2006 Salary ($1,000's)
21325
19325
17325
15325
8
9
8
3
3
2
1
1
2
2
1
13325
11325
9325
33
16
17
23
16
15
14
7325
0
5325
100
3325
71
64
54
200
1325
Frequency
500
2000
2020
ST 101
Displaying and Summarizing Quantitative Data
page 8
skewed to the left (negatively skewed)
H istogram of Exam Scores
Fre que ncy
30
20
10
0
20
30
40
50 60 70 80
Ex a m S core s
90
100
symmetric
B a n k C u s to m e rs : 1 0 : 0 0 -1 1 : 0 0 a m
20
Fr e que ncy
15
10
5
e
2
3.
m
or
4
Nu m b e r o f Cu sto m e rs
13
5.
6
12
7.
8
11
9.
2
10
10
.2
94
86
.4
.6
78
70
.8
0
outliers
200 m Races 20.2 secs or less (approx. 700)
60
50
40
y
c
n
e
u 30
q
e
r
F
20
Usain Bolt
2008 19.30
Michael Johnson
1996 19.32
10
0
6
.2 3
9 .2 2
1 9 .9
1 1
9
.2
9
1
2
.3
9
1
5
.3
9
1
8
.3
9
1
1
.4
9
1
4
.4
9
1
7 5
. 3
5
.4 9
.
9 1 9
1
1
6
.5
9
1
9
.5
9
1
2
.6
9
1
5
.6
9
1
8
.6
9
1
1
.7
9
1
4
.7
9
1
7 .8
.7 9
9 1
1
3
.8
9
1
6
.8
9
1
9
.8
9
1
2
.9
9
1
5
.9
9
1
8
.9
9
1
1
.0
0
2
4
.0
0
2
7 .1
0 0
0 2
2
3
.1
0
2
6
.1
0
2
9
.1
0
2
TIMES
BIMODAL DISTRIBUTIONS (two peaks)
(frequently results from measurements on two populations, such as heights of male and
female adults).
ST 101
Displaying and Summarizing Quantitative Data
page 9
His to g ra m
Frequency
60
50
40
30
F re q ue nc y
20
10
More
73.5
71
68.5
66
63.5
61
58.5
56
53.5
51
0
B in
Describing Distributions Numerically
Section Objectives:
At the end of this section you should be able to:
1) Calculate appropriate numerical summaries of quantitative data to describe center (median,
mean, quartiles) and spread (range, interquartile range, standard deviation) [the use of
software will be emphasized!]
2) Describe the characteristics of various numerical summaries with emphasis on the affects of
outliers
3) Interpret the values of the numerical summaries for a particular data set.
4) Match graphical displays of quantitative data to the values of the summary statistics.
5) Apply graphical and numerical procedures to compare 2 or more sets of data
Throughout the course we will emphasize the paradigm "Think, Show, Tell". The above objectives fit
into this paradigm as follows:
"Think" about what numerical summaries of center and spread are appropriate for the data at
hand; calculate the values of the numerical summaries to "show" the center and spread.
"Tell" what characteristics of the data are conveyed by the values of the numerical
summaries.
Finding Center and Spread
Would like to numerically summarize two characteristics of quantitative data:
i) center
ii) spread
Ö Finding the center: the median
median: the value that falls in the middle when the data are arranged in order of magnitude
Calculating the Median
Given a set of 8 data values arranged in order of magnitude
Middle value
if 8 is odd
Median œ œ
Mean of the two middle values if 8 is even
graphically, the median splits the histogram of the data into two halves of equal area.
ST 101
Displaying and Summarizing Quantitative Data
page 10
EXAMPLES:
1) Below is a list of the home runs hit by Babe Ruth in each of his seasons as a Yankee:
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
median œ
2) student pulse rates - ordered values: 38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70,
70,70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79,
80,80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103
median =
3) Year 2002 baseball salaries: 8 œ 805
median œ $900,000;
maximum œ $25,000,000 (Alex Rodriguez)
minimum = $200,000
4) Median fan age:
MLB: 45; NFL: 43; NBA: 41
NHL: 39 (Scarborough Research)
Ö Measuring spread: home on the range
range = max  min
EXAMPLE:
Year 2002 baseball salaries: range = $25,000,000  $200,000 = $24,200,000
disadvantage of range: too crude and sensitive, a single extreme value can make the range very
large.
Ö Measuring spread: the interquartile range (IQR)
focus on the middle of the data instead of the extremes of the data
find the range of the middle half of the data:
i) divide the data in half at the median
ii) now divide both halves in half again, cutting the data into quarters
"
% of
the data lies below the lower quartile à
half the data lies between
"
of
the
data
lies
above
the
upper
quartile
ß
%
interquartile range
ST 101
Displaying and Summarizing Quantitative Data
page 11
IQR = upper quartile  lower quartile
quartiles are NOT well-defined, different software packages give different answers
FINDING QUARTILES BY HAND
when n is odd, include the overall median in both halves
when n is even, do NOT include the overall median in either
half
EXAMPLES:
1) odd number of observations in data set
Below is a list of the home runs hit by Babe Ruth in each of his seasons as a Yankee:
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22
ordered values:
22 25 34 35 41 41 46 46 46 47 49 54 54 59 60
median = 46
lower half (including median) 22 25 34 35 41 41 46 46
U" œ 69A/< ;?+<>36/ œ
$&  %"
œ $)
#
upper half (including median) 46 46 47 49 54 54 59 60
U$ œ ?::/< ;?+<>36/ œ
%*  &%
œ &"Þ&
#
IQR = 51.5  38 œ 13.5
software
Excel:
U" = 38; U$ = 51.5; IQR =
DataDesk: U" = 36.5; U$ = 52.75; IQR = 16.25
2) even number of observations in data set
ten "distance of hometown from NCSU campus" values:
300 500 65 180 200 120 270 10 100 10
ordered values:
10 10 65 100 120 180 200 270 300 500
median = "#!")!
œ 150
#
lower half: 10 10 65 100 120
U" œ 69A/< ;?+<>36/ œ '&
upper half: 180 200 270 300 500
U$ œ ?::/< ;?+<>36/ œ #(!
IQR = 270  65 œ 205
software
Excel:
U" = 73.75; U$ = 252.5; IQR = 252.5  73.75 œ 178.75
DataDesk: U" = 65; U$ = 270; IQR =
3) median, quartiles from stem and leaf plot
class beginning pulse rates
#
---
BPULSE
Unit = 1.000000
n = 138. missing =
Stem Leaves . . .
---- --------------------------------------------------------------
0.
ST 101
Displaying and Summarizing Quantitative Data
.
4* |
3 4. | 588
9 5* | 001233444
10 5. | 5556788899
23 6* | 00011111122233333344444
23 6. | 55556666667777788888888
16 7* | 0000011222233444
23 7. | 55555666666777888888999
10 8* | 0000112224
10 8. | 5555667789
4 9* | 0012
2 9. | 58
4 10* | 0223
. 10. |
1 11* | 1
page 12
median =
lower quartile =
upper quartile =
5-Number Summary
minimum Q" median
Q$ maximum
5-number summary for the above 138 student pulses
Summarizing Symmetric Distributions
EXAMPLE (body temperature of 93 adults)
median œ 98.2 beats per min.
mean œ 98.12 beats per minute
Ö Finding the center: the mean
median; determined by counting the data, doesn't care how large or how small the data
values are (except the middle one or two data values).
Often we do care about the actual data values; would like a measure of center that uses
each data value.
ST 101
Displaying and Summarizing Quantitative Data
NOTATION
C
8
C
page 13
represents an observation in a data set
number of observations in the data set
denotes the sample mean
consider any set of data values represented by C's; then
Cœ
!C
sum of C's
œ
8
8
IMPORTANT: the mean is an appropriate measure of the middle only when the shape is
approximately symmetric and there are no outliers.
Connection to histogram
A histogram balances when supported at
the mean
median = 57.7 years; mean = 55.26
years
Mean or median? It makes a difference (sometimes)
EXAMPLE: 2004 major league baseball salaries n œ 826
C œ $2,482,530 median œ $787,500
min œ $300,000
max œ $21,726,881
ST 101
Displaying and Summarizing Quantitative Data
page 14
2004 Major League Baseball Salaries
Frequency
500
423
400
300
200
100 50
61 51 48
33 19 13 22 22 13 11 15
3 10 5 8 2 2 0 1 4 3 2 2 0 1 1
202
187
171
156
141
125
110
95
80
64
49
34
18
3
0
Salary ($100,000's)
Mean , Median, and Maxim um B aseball S alaries
M ax
$27,000,000
$2,050,000
$22,000,000
$1,550,000
$17,000,000
$1,050,000
$12,000,000
2002
2000
1998
1996
1994
1992
1990
1988
1986
1984
$2,000,000
1982
$50,000
1980
$7,000,000
1978
$550,000
M a x im um S a la ry
M edian
$2,550,000
1976
M e a n, M e dia n S a la ry
M ean
Ye a r
Ö Finding spread: the standard deviation
IQR: uses only Q" and Q$ to measure spread
standard deviation: takes into account how far each observation is from the mean
!ÐC  CÑ œ ?
variance
=# œ
units: square gallons, square dollars
!ÐC  CÑ#
8"
ST 101
Displaying and Summarizing Quantitative Data
standard deviation
page 15
Í
Í!
Í ÐC  CÑ#
= œ Ì
8"
automate this calculation!
IMPORTANT: 1) the standard deviation is an appropriate measure of spread only when the
shape is approximately symmetric and there are no outliers.
2) Always (always!) report a spread along with any summary of the center.
EXAMPLE 1 3 5 9
Thinking about the standard deviation:
1) Note that = is always nonnegative, that is, = !Þ When does = œ !?
2) The larger the value of =, the greater the spread of the data. Given two data sets, the
standard deviation is useful as a relative measure of spread.
3) The standard deviation is the most commonly used measure of risk in many areas
such as finance, business, education, social sciences, etc.
4) Why divide by n  1 instead of n when computing the sample standard deviation?
i) to drive you crazy.
ii) dividing by 8 to find the standard deviation of a small group would underestimate the
variability present in the larger groups they represent.
iii) above formula for s includes the sample mean C. Since !(C3  C) œ 0, only n  1 of
n
i=1
the data values are free to vary.
example:
Reporting shape, center, and spread of quantitative data
1) when telling about a quantitative variable, always report shape, along with a center and a
spread
2) if the shape is skewed, report the median and IQR; the mean and standard deviation are
sensitive to outliers (you can include the mean and standard deviation, but you should
point out why the mean and median differ)
3) if the shape is symmetric, report the mean and standard deviation.
4) if there are obvious outliers and you are reporting the mean and standard deviation,
report them with the outliers included and the outliers removed (the median and IQR will
not be affected by the outliers).
ST 101
Displaying and Summarizing Quantitative Data
page 16
SUMMARY
We can now summarize distributions of quantitative variables numerically.
ñ The 5-number summary displays the min, Q1, median, Q3, and max.
ñ Measures of center include the mean and median.
ñ Measures of spread include the range, IQR, and standard deviation.
We know which measures to use for symmetric distributions and skewed distributions.
We can also display distributions with boxplots.
ñ While histograms better show the shape of the distribution, boxplots reveal the
center, middle 50%, and any outliers in the distribution.
ñ Boxplots are useful for comparing groups.