Download graphical methods for quantitative data

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Displaying and Summarizing
Quantitative Data
CHAPTER OBJECTIVES
At the conclusion of this chapter you should be able to:
 1) Construct graphs that appropriately describe
quantitative data
 2) Calculate and interpret numerical summaries of
quantitative data.
 3) Combine numerical methods with graphical
methods to analyze a data set.
 4) Apply graphical methods of summarizing data to
choose appropriate numerical summaries.
 5) Apply software and/or calculators to automate
graphical and numerical summary procedures.
Displaying Quantitative Data
Histograms
Stem and Leaf Displays
Relative frequency
Relative Frequency
Histogram of Exam Grades
.30
.25
.20
.15
.10
.05
0
40
50
60
70
80
Grade
90
100
Frequency Histogram
Histograms
A histogram shows three general types of
information:
 It provides visual indication of where
the approximate center of the data is.
 We can gain an understanding of the
degree of spread, or variation, in the
data.
 We can observe the shape of the
distribution.
30
19.2
19.23
19.26
19.29
19.32
19.35
19.38
19.41
19.44
19.47
19.5
19.53
19.56
19.59
19.62
19.65
19.68
19.71
19.74
19.77
19.8
19.83
19.86
19.89
19.92
19.95
19.98
20.01
20.04
20.07
20.1
20.13
20.16
20.19
Frequency
All 200 m Races 20.2 secs or
less
200 m Races 20.2 secs or less (approx. 700)
60
50
40
Usain Bolt
2008 19.30
Michael Johnson
1996 19.32
20
10
0
TIMES
Histograms Showing Different Centers
Histograms
Showing
Different
Centers
(football head
coach
salaries)
Histograms Same Center,
Different
Spread
(football head
coach
salaries)
369480
821544.6154
1273609.231
1725673.846
2177738.462
2629803.077
3081867.692
3533932.308
3985996.923
4438061.538
4890126.154
5342190.769
5794255.385
6246320
6698384.615
7150449.231
7602513.846
8054578.462
8506643.077
8958707.692
9410772.308
9862836.923
10314901.54
10766966.15
11219030.77
11671095.38
12123160
12575224.62
13027289.23
13479353.85
13931418.46
14383483.08
14835547.69
15287612.31
15739676.92
16191741.54
16643806.15
17095870.77
17547935.38
More
Frequency
Excel Example: 2012-13 NFL
Salaries
Histogram
1000
900
800
700
600
500
400
300
200
100
0
Bin
Statcrunch Example: 2012-13 NFL
Salaries
Grades on a statistics exam
Data:
75 66 77 66 64 73 91 65 59 86 61 86 61
58 70 77 80 58 94 78 62 79 83 54 52 45
82 48 67 55
Frequency Distribution of
Grades
Class Limits
40 up to 50
Frequency
2
50 up to 60
6
60 up to 70
8
70 up to 80
7
80 up to 90
5
90 up to 100
2
Total
30
Relative Frequency
Distribution of Grades
Class Limits
40 up to 50
Relative Frequency
2/30 = .067
50 up to 60
6/30 = .200
60 up to 70
8/30 = .267
70 up to 80
7/30 = .233
80 up to 90
5/30 = .167
90 up to 100
2/30 = .067
Relative frequency
Relative Frequency
Histogram of Grades
.30
.25
.20
.15
.10
.05
0
40
50
60
70
80
Grade
90
100
Based on the histogram, about what
percent of the values
are between 47.5 and
52.5?
1.
2.
3.
4.
50%
5%
17%
30%
0%
1
0%
2
0%
3
0%
10
4
Countdown
Stem and leaf displays

Have the following general appearance
stem
leaf
1
8 9
2
1 2 8 9 9
3
2 3 8 9
4
0 1
5
6 7
6
4
Stem and Leaf Displays
Partition each no. in data into a “stem” and
“leaf”
 Constructing stem and leaf display
1) deter. stem and leaf partition (5-20 stems)
2) write stems in column with smallest stem at
top; include all stems in range of data
3) only 1 digit in leaves; drop digits or round off
4) record leaf for each no. in corresponding
stem row; ordering the leaves in each row
helps

Example: employee ages at a small company
18 21 22 19 32 33 40 41 56 57 64 28 29 29 38
39; stem: 10’s digit; leaf: 1’s digit
 18: stem=1; leaf=8; 18 = 1 | 8
stem
leaf
1
8 9
2
1 2 8 9 9
3
2 3 8 9
4
0 1
5
6 7
6
4
Suppose a 95 yr. old is hired
stem
1
2
3
4
5
6
7
8
9
leaf
8 9
1 2 8 9 9
2 3 8 9
0 1
6 7
4
5
Number of TD passes by NFL teams:
2012-2013 season
(stems are 10’s digit)
stem
4
3
2
2
1
0
leaf
03
247
6677789
01222233444
13467889
8
Pulse Rates n = 138
#
3
9
10
23
23
16
23
10
10
4
2
4
1
Stem
4*
4.
5*
5.
6*
6.
7*
7.
8*
8.
9*
9.
10*
10.
11*
Leaves
588
001233444
5556788899
00011111122233333344444
55556666667777788888888
00000112222334444
55555666666777888888999
0000112224
5555667789
0012
58
0223
1
Advantages/Disadvantages of
Stem-and-Leaf Displays
Advantages
1) each measurement displayed
2) ascending order in each stem row
3) relatively simple (data set not too large)
 Disadvantages
display becomes unwieldy for large data
sets

Population of 185 US cities with
between 100,000 and 500,000

Multiply stems by 100,000
Back-to-back stem-and-leaf displays. TD
passes by NFL teams: 1999-2000, 2012-13
multiply stems by 10
1999-2000
2
6
2
6655
43322221100
9998887666
421
2012-13
4
3
3
2
2
1
1
0
03
7
24
6677789
01222233444
67889
134
8
Below is a stem-and-leaf display for the
pulse rates of 24 women at a health clinic.
How many pulses are between 67 and 77?
Stems are
10’s digits
1.
2.
3.
4.
5.
4
6
8
10
12
0%
1
0%
0%
2
3
0%
4
0%
10
5
Countdown
Interpreting Graphical Displays: Shape
Symmetric
distribution
A distribution is symmetric if the right and left

sides of the histogram are approximately mirror
images of each other.

A distribution is skewed to the right if the right
side of the histogram (side with larger values)
extends much farther out than the left side. It is
skewed to the left if the left side of the histogram
Skewed
distribution
extends much farther out than the right side.
Complex,
multimodal
distribution

Not all distributions have a simple overall shape,
especially when there are few observations.
Heights of Students in Recent Stats
Class
Shape (cont.)Female heart attack
patients in New York state
Age: left-skewed
Cost: right-skewed
Shape (cont.): Outliers
An important kind of deviation is an outlier. Outliers are observations
that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
The overall pattern is fairly
symmetrical except for 2
states clearly not belonging
to the main trend. Alaska
and Florida have unusual
representation of the
elderly in their population.
A large gap in the
distribution is typically a
sign of an outlier.
Alaska
Florida
Center: typical value of frozen
personal pizza? ~$2.65
Spread: fuel efficiency 4, 8
cylinders
4 cylinders: more spread
8 cylinders: less spread
Other Graphical Methods for
Quantitative Data

Time plots
plot observations in time order, with
time on the horizontal axis and the variable on the vertical axis
** Time series
measurements are taken at regular
intervals (monthly unemployment,
quarterly GDP, weather records,
electricity demand, etc.)
Unemployment Rate, by Educational
Attainment
Water Use During Super Bowl
Winning Times 100 M Dash
Numerical Summaries of
Quantitative Data
Numerical and More Graphical
Methods to Describe Univariate
Data
2 characteristics of a data set
to measure


center
measures where the “middle” of the
data is located
variability
measures how “spread out” the data is
The median: a measure of
center
Given a set of n measurements arranged in
order of magnitude,
Median= middle value
n odd
mean of 2 middle values, n even
 Ex. 2, 4, 6, 8, 10; n=5; median=6
 Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5
Student Pulse Rates (n=62)
38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70,
70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75,
75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80,
80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95,
96, 96, 96, 98, 98, 103
Median = (75+76)/2 = 75.5
Medians are used often
Year 2011 baseball salaries
Median $1,450,000 (max=$32,000,000
Alex Rodriguez; min=$414,000)
 Median fan age: MLB 45; NFL 43; NBA
41; NHL 39
 Median existing home sales price: May
2011 $166,500; May 2010 $174,600
 Median household income (2008
dollars) 2009 $50,221; 2008 $52,029

The median splits the histogram
into 2 halves of equal area
Examples
Example: n = 7
17.5 2.8 3.2 13.9 14.1 25.3 45.8
 Example n = 7 (ordered): m = 14.1
 2.8 3.2 13.9 14.1 17.5 25.3 45.8
 Example: n = 8
17.5 2.8 3.2 13.9 14.1 25.3 35.7 45.8
 Example n =8 (ordered) m = (14.1+17.5)/2 = 15.8
2.8 3.2 13.9 14.1 17.5 25.3 35.7 45.8

Below are the annual tuition charges at 7
public universities. What is the median
tuition?
4429
4960
4960
4971
5245
5546
7586
1.
2.
3.
4.
5245
4965.5
4960
4971
0%
1
0%
2.
0%
3
0%
10
4
Countdown
Below are the annual tuition charges at 7
public universities. What is the median
tuition?
4429
4960
5245
5546
4971
5587
7586
1.
2.
3.
4.
5245
4965.5
5546
4971
0%
1
0%
2.
0%
3
0%
10
4
Countdown
Measures of Spread

The range and interquartile
range
Ways to measure variability
range=largest-smallest
 OK sometimes; in general, too crude;
sensitive to one large or small data
value
 The range measures spread by
examining the ends of the data
 A better way to measure spread is to
examine the middle portion of the data
Quartiles: Measuring spread by
examining the middle
The first quartile, Q1, is the value in the
sample that has 25% of the data at or
below it (Q1 is the median of the lower
half of the sorted data).
The third quartile, Q3, is the value in the
sample that has 75% of the data at or
below it (Q3 is the median of the upper
half of the sorted data).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
6
5
4
3
2
1
2
3
4
5
6
7
6
5
4
3
2
1
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.3
m = median = 3.4
Q3= third quartile = 4.2
Quartiles and median divide data
into 4 pieces
1/4
1/4
Q1
1/4
M
1/4
Q3
Quartiles are common
measures of spread

http://www2.acs.ncsu.edu/UPA/admissi
ons/fresprof.htm

http://www2.acs.ncsu.edu/UPA/peers/cu
rrent/ncsu_peers/sat.htm

University of Southern California
Rules for Calculating Quartiles
Step 1: find the median of all the data (the median
divides the data in half)
Step 2a: find the median of the lower half; this median
is Q1;
Step 2b: find the median of the upper half; this
median is Q3.
Important:
when n is odd include the overall median in both
halves;
when n is even do not include the overall median in
either half.
11

Example
2 4 6 8 10 12 14 16 18 20
n = 10
Median
m
= (10+12)/2 = 22/2 = 11
Q1 :
Q3
median of lower half 2 4 6 8 10
Q1 = 6
: median of upper half 12 14 16 18 20
Q3 = 16
Quartile example: odd no. of data values

HR’s hit by Babe Ruth in each season as a Yankee
54 59 35 41 46 25 47 60 54 46 49 46 41 34 22

22 25 34 35 41 41 46 46 46 47 49 54 54 59 60

 Ordered values:
 Median: value in ordered position 8.
median = 46
 Lower half (including overall median):

22 25 34 35 41 41 46 46
Q1  lower quartile 
 Upper half (including overall median):

46 46 47 49 54 54 59 60
35  41
 38
2
49  54
Q3  upper quartile 
 51.5
2
Pulse Rates n = 138
#
3
9
10
23
23
16
23
10
10
4
2
4
1
Stem
4*
4.
5*
5.
6*
6.
7*
7.
8*
8.
9*
9.
10*
10.
11*
Leaves
Median: mean of pulses in
locations 69 & 70:
median= (70+70)/2=70
588
001233444
5556788899
00011111122233333344444
55556666667777788888888
00000112222334444
55555666666777888888999
0000112224
5555667789
0012
58
0223
1
Q1: median of lower half
(lower half = 69 smallest
pulses); Q1 = pulse in
ordered position 35;
Q1 = 63
Q3 median of upper half
(upper half = 69 largest
pulses); Q3= pulse in position
35 from the high end; Q3=78
Below are the weights of 31 linemen on
the NCSU football team. What is the
value of the first quartile Q1?
1.
2.
3.
4.
287
257.5
263.5
262.5
#
stemleaf
2
2255
4
2357
6
2426
7
257
10
26257
12
2759
(4)
281567
15
2935599
10
30333
7
3145
5
32155
2
336
1
340
0%
1
0%
2.
0%
3.
0%
10
4.
Countdown
Interquartile range




lower quartile Q1
middle quartile: median
upper quartile Q3
interquartile range (IQR)
IQR = Q3 – Q1
measures spread of middle 50% of the
data
Example: beginning pulse
rates

Q3 = 78; Q1 = 63

IQR = 78 – 63 = 15
Below are the weights of 31 linemen on
the NCSU football team. The first quartile
Q1 is 263.5. What is the value of the IQR?
1.
2.
3.
4.
23.5
39.5
46
69.5
#
stemleaf
2
2255
4
2357
6
2426
7
257
10
26257
12
2759
(4)
281567
15
2935599
10
30333
7
3145
5
32155
2
336
1
340
0%
1.
0%
2.
0%
3
0%
10
4.
Countdown
5-number summary of data

Minimum Q1 median Q3 maximum

Pulse data
45 63 70
78
111
End of General Numerical Summaries
Next: Numerical Summaries of
Symmetric Data
Numerical Summaries of
Quantitative Data (cont.)
Symmetric Data.
Measure of Center: Mean
Measure of Variability: Standard
Deviation
Symmetric Data
Body temp. of 93 adults
Recall: 2 characteristics of a
data set to measure


center
measures where the “middle” of the
data is located
variability
measures how “spread out” the data is
Measure of Center When Data
Approx. Symmetric
mean (arithmetic mean)
 notation
xi : ith measurement in a set of observations
x1 , x2 , x3 , , xn
n: number of measurements in data set; sample
size

n
 xi  x1  x2  x3    xn
i 1
Sample mean x
n
x
x1  x2  x3  xn i 1
x

n
n
i
Population mean  (value typically not known)
N = population size
N
x
  i 1
N
i
Connection Between Mean
and Histogram
A histogram balances when supported
at the mean. Mean x = 140.6
Histogram
70
60
50
40
Fr equency
30
20
10
Abs e nce s f rom Work
More
1 60.5
153.5
146.5
139 .5
132.5
125.5
0
118.5
Fre que ncy

Mean: balance point
Median: 50% area each half
right histo: mean 55.26 yrs, median 57.7yrs
Properties of Mean, Median
1. The mean and median are unique; that is, a
data set has only 1 mean and 1 median (the
mean and median are not necessarily equal).
2. The mean uses the value of every number in
the data set; the median does not.
20
46
Ex. 2, 4, 6, 8. x   5; m 
5
4
2
21 1
46
Ex. 2, 4, 6, 9. x   5 4 ; m 
5
4
2
Example: class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85
89 90 90 90 90 91 96 98 103 140
n  23
23
x 
x
i 1
i
 84.48;
23
m :location: 12th obs. m  85
2010, 2014 baseball salaries
2010
n = 845
 = $3,297,828
median = $1,330,000
max = $33,000,000

2014
n = 848
 = $3,932,912
median = $1,456,250
max = $28,000,000

Disadvantage of the mean

Can be greatly influenced by just a few
observations that are much greater or
much smaller than the rest of the data
Mean, Median, Maximum
Baseball Salaries 1985 - 2014
Baseball Salaries: Mean, Median and Maximum 1985-2014
Mean
Median
Maximum
35,000,000
3,200,000
25,000,000
2,700,000
20,000,000
2,200,000
15,000,000
1,700,000
10,000,000
1,200,000
Year
2013
2011
2009
2007
2005
2003
2001
1999
1997
1995
1993
0
1991
200,000
1989
5,000,000
1987
700,000
Maximum Salary
30,000,000
1985
Mean, Median Salary
3,700,000
Skewness: comparing the
mean, and median

Skewed to the right (positively skewed)
mean>median
2011 Baseball Salaries
600
490
500
Frequency

400
300
200
100
53
102
72
35 21 26 17
8
10
0
Salary ($1,000's)
2
3
1
0
0
1
Skewed to the left; negatively
skewed

Mean < median
mean=78; median=87;
Histogram of Exam Scores
30
Frequency

20
10
0
20
30
40
50 60 70 80
Exam Scores
90 100
Symmetric data
mean, median approx. equal
Bank Customers: 10:00-11:00 am
20
15
10
5
0
70
.8
78
.6
86
.4
94
.2
10
2
10
9.
8
11
7.
6
12
5.
4
13
3.
2
m
or
e
Frequency

Number of Customers
DESCRIBING VARIABILITY OF
SYMMETRIC DATA
Describing Symmetric Data
(cont.)

Measure of center for symmetric data:
Sample mean x
n
x1  x2  x3 
x
n

 xn

x
i 1
i
n
Measure of variability for symmetric
data?
Example

2 data sets:
x1=49, x2=51 x=50
y1=0, y2=100 y=50
On average, they’re both
comfortable
0 100
49 51
Ways to measure variability
range=largest-smallest
ok sometimes; in general, too crude;
sensitive to one large or small obs.
1.
2. measure spread from the middle, where
the middle is the mean x ;
 deviation of xi from the mean: xi  x

n
 (x
i 1
i
 x ); sum the deviations of all the xi 's from x ;
n
 ( x  x )  0 always; tells us nothing
i 1
i
Previous Example
sum of deviations from mean:
x1  49, x2  51; x  50 
( x1  x )  ( x2  x )  (49  50)  (51  50)  1  1  0;
y1  0, y2  100; y  50 
( y1  y )  ( y2  y )  (0  50)  (100  50)  50  50  0
The Sample Standard Deviation, a
measure of spread around the mean

Square the deviation of each
observation from the mean; find the
square root of the “average” of these
squared deviations
n
( x i  x ) ;  ( x i  x ) 2 and find the " average" ,
2
i 1
then take the square root of the average
n
s 
 (x
i 1
deviation
i
 x )2
n 1
called the sample standard
Calculations …
Women height (inches)
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
-4.4
19.0
2
60
63.4
-3.4
11.3
3
61
63.4
-2.4
5.6
4
62
63.4
-1.4
1.8
5
62
63.4
-1.4
1.8
6
63
63.4
-0.4
0.1
7
63
63.4
-0.4
0.1
8
63
63.4
-0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
Mean
63.4
x
Mean = 63.4
Sum of squared deviations from
mean = 85.2
(n − 1) = 13; (n − 1) is called degrees
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
-4.4
19.0
2
60
63.4
-3.4
11.3
3
61
63.4
-2.4
5.6
4
62
63.4
-1.4
1.8
We’ll
never
calculate
these by hand, so make sure to
5
62
63.4
-1.4
know
how
to get
the1.8standard deviation using your
6
63
63.4
-0.4
0.1
calculator,
Excel,
or
other software.
7
63
63.4
-0.4
0.1
x
8
63
63.4
-0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
Mean
63.4
Mean
± 1 s.d.
2. Then take the square root to get the
1. First calculate the variance s2.
s2 
n
1
( xi  x ) 2

n 1 1
standard deviation s.
1 n
2
s
(
x

x
)

i
n 1 1
Population Standard Deviation
N
 
2
(
x


)
 i
i 1
N
value of 
population standard deviation
typically not known;
use s to estimate value of 
Remarks
1. The standard deviation of a set of
measurements is an estimate of the
likely size of the chance error in a
single measurement
Remarks (cont.)
2. Note that s and  are always greater
than or equal to zero.
3. The larger the value of s (or  ), the
greater the spread of the data.
When does s=0? When does  =0?
When all data values are
the same.
Remarks (cont.)
4. The standard deviation is the most
commonly used measure of risk in
finance and business
– Stocks, Mutual Funds, etc.
5. Variance




s2 sample variance
 2 population variance
Units are squared units of the original data
square $, square gallons ??
Remarks (cont.): Why divide
by n-1 instead of n?



degrees of freedom
each observation has 1 degree of
freedom
however, when estimate unknown
population parameter like , you lose 1
degree of freedom
In formula for s , we use x to estimate the unkown
n
value of  ;
s 
2
(
x

x
)
 i
i 1
n 1
Remarks (cont.): Why divide
by n-1 instead of n? Example





Suppose we have 3 numbers whose
average is 9
Choose ANY values for x1
x2
x1=
x2= and
Since the average (mean)
is 9, x1 + x2 + x3 must
then x3 must be
equal 9*3 = 27, so x3 = 27
once we selected x1– and
(x1 + xx22) , x3 was
determined since the average was 9
3 numbers but only 2 “degrees of
freedom”
Computational Example
observations 1, 3, 5, 9; x  184 4.5
(1  4.5) 2  (3  4.5) 2  (5  4.5) 2  (9  4.5) 2
s 
4 1
(3.5) 2  (1.5) 2  (.5) 2  (4.5) 2

3
12.25  2.25  .25  20.25
35


 11.67 3.42;
3
3
s 2 11.67
class pulse rates
53 64 67 67 70 76 77 77 78 83 84 85 85 89 90
90 90 90 91 96 98 103 140
n  23 x  84.48 m  85
s  290.26(beats per minute)
s  17.037 beats per minute
2
2
Review: Properties of s and 



s and  are always greater than or
equal to 0
when does s = 0?  = 0?
The larger the value of s (or ), the
greater the spread of the data
the standard deviation of a set of
measurements is an estimate of the
likely size of the chance error in a single
measurement
Summary of Notation
SAMPLE
y sample mean
POPULATION
 population mean
m sample median
m population median
s sample variance  2 population variance
s sample stand. dev.  population stand. dev.
2
TA-DAAA! The End