Download Chapter 3 powerpoints only

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3
Displaying and Summarizing
Quantitative Data
Display: Histograms, Stem and Leaf Plots
Numerical Summaries: Median, Mean,
Quartiles, Standard Deviation
Relative frequency
Relative Frequency
Histogram of Exam Grades
.30
.25
.20
.15
.10
.05
0
40
50
60
70
80
Grade
90
100
Frequency Histograms
BAKER CITY HOSPITAL - LENGTH OF STAY
DISTRIBUTION
70
60
50
40
30
20
10
0
0<2
2<4
4<6
6<8
8<10
10<12
12<14
14<16
16<18
Frequency Histograms
A histogram shows three general types of
information:
 It provides visual indication of where
the approximate center of the data is.
 We can gain an understanding of the
degree of spread, or variation, in the
data.
 We can observe the shape of the
distribution.
30
19.2
19.23
19.26
19.29
19.32
19.35
19.38
19.41
19.44
19.47
19.5
19.53
19.56
19.59
19.62
19.65
19.68
19.71
19.74
19.77
19.8
19.83
19.86
19.89
19.92
19.95
19.98
20.01
20.04
20.07
20.1
20.13
20.16
20.19
Frequency
All 200 m Races 20.2 secs or
less
200 m Races 20.2 secs or less (approx. 700)
60
50
40
Usain Bolt
2008 19.30
Michael Johnson
1996 19.32
20
10
0
TIMES
Histograms Showing Different
Centers
70
60
50
40
30
20
10
0
0<2
2<4
4<6
6<8
8<10
10<12
12<14
14<16
16<18
0<2
2<4
4<6
6<8
8<10
10<12
12<14
14<16
16<18
70
60
50
40
30
20
10
0
Histograms - Same Center,
Different Spread
70
60
50
40
30
20
10
16
<
18
14
<
16
12
<
14
10
10
<
12
8
8<
6<
6
4<
4
2<
0<
2
0
70
60
50
40
30
20
10
0
0<2
2<4
4<6
6<8
8<10
10<12
12<14
14<16
16<18
Frequency and Relative
Frequency Histograms
identify smallest and largest values in
data set
 divide interval between largest and
smallest values into between 5 and 20
subintervals called classes
* each data value in one and only one
class
* no data value is on a boundary

How Many Classes?
Can choose from two formulas
2n
.3333
Sturges' Rule :
log( n)
1
log( 2)
n is the sample size
Histogram Construction (cont.)
* compute frequency or relative
frequency of observations in each class
* x-axis: class boundaries;
y-axis: frequency or relative frequency
scale
* over each class draw a rectangle with
height corresponding to the frequency
or relative frequency in that class
Example. Number of daily
employee absences from work
106 obs; approx. no of classes=
{2(106)}1/3 = {212}1/3 = 5.69
1+ log(106)/log(2) = 1 + 6.73 = 7.73
 There is no single “correct” answer for
the number of classes
 For example, you can choose 6, 7, 8, or
9 classes; don’t choose 15 classes

EXCEL Histogram
Histogram of Employee Absences
45
Frequency
40
35
30
25
20
15
10
5
0
Absences from Work
Absences from Work (cont.)
6 classes
 class width: (158-121)/6=37/6=6.17 7
 6 classes, each of width 7; classes span
6(7)=42 units
 data spans 158-121=37 units
 classes overlap the span of the actual
data values by 42-37=5
 lower boundary of 1st class: (1/2)(5)
units below 121 = 121-2.5 = 118.5

EXCEL histogram
Histogram of Employee Absences
70
Frequency
60
50
40
30
20
10
0
118.5
125.5
132.5 139.5 146.5
Absences from Work
153.5
160.5
Grades on a statistics exam
Data:
75 66 77 66 64 73 91 65 59 86 61 86 61
58 70 77 80 58 94 78 62 79 83 54 52 45
82 48 67 55
Frequency Distribution of
Grades
Class Limits
40 up to 50
Frequency
2
50 up to 60
6
60 up to 70
8
70 up to 80
7
80 up to 90
5
90 up to 100
2
Total
30
Relative Frequency
Distribution of Grades
Class Limits
40 up to 50
Relative Frequency
2/30 = .067
50 up to 60
6/30 = .200
60 up to 70
8/30 = .267
70 up to 80
7/30 = .233
80 up to 90
5/30 = .167
90 up to 100
2/30 = .067
Relative frequency
Relative Frequency
Histogram of Grades
.30
.25
.20
.15
.10
.05
0
40
50
60
70
80
Grade
90
100
Based on the histogram, about what
percent of the values
are between 47.5 and
52.5?
1.
2.
3.
4.
50%
5%
17%
30%
0%
1
0%
2
0%
3
0%
10
4
Countdown
Stem and leaf displays

Have the following general appearance
stem
leaf
1
8 9
2
1 2 8 9 9
3
2 3 8 9
4
0 1
5
6 7
6
4
Stem and Leaf Displays
Partition each no. in data into a “stem” and
“leaf”
 Constructing stem and leaf display
1) deter. stem and leaf partition (5-20 stems)
2) write stems in column with smallest stem at
top; include all stems in range of data
3) only 1 digit in leaves; drop digits or round off
4) record leaf for each no. in corresponding
stem row; ordering the leaves in each row
helps

Example: employee ages at a small company
18 21 22 19 32 33 40 41 56 57 64 28 29 29 38
39; stem: 10’s digit; leaf: 1’s digit
 18: stem=1; leaf=8; 18 = 1 | 8
stem
leaf
1
8 9
2
1 2 8 9 9
3
2 3 8 9
4
0 1
5
6 7
6
4
Suppose a 95 yr. old is hired
stem
1
2
3
4
5
6
7
8
9
leaf
8 9
1 2 8 9 9
2 3 8 9
0 1
6 7
4
5
Number of TD passes by NFL teams:
2010 season
(stems are 10’s digit)
stem
3
2
2
1
0
leaf
011337
5566667889
0123444
03447889
9
Pulse Rates n = 138
#
3
9
10
23
23
16
23
10
10
4
2
4
1
Stem
4*
4.
5*
5.
6*
6.
7*
7.
8*
8.
9*
9.
10*
10.
11*
Leaves
588
001233444
5556788899
00011111122233333344444
55556666667777788888888
00000112222334444
55555666666777888888999
0000112224
5555667789
0012
58
0223
1
Advantages/Disadvantages of
Stem-and-Leaf Displays
Advantages
1) each measurement displayed
2) ascending order in each stem row
3) relatively simple (data set not too large)
 Disadvantages
display becomes unwieldy for large data
sets

Population of 185 US cities with
between 100,000 and 500,000

Multiply stems by 100,000
Back-to-back stem-and-leaf displays.
TD passes by NFL teams: 1999, 2009
multiply stems by 10
1999
2
6
2
6655
43322221100
9998887666
421
2009
4
3
3
2
2
1
1
0444
6677788899
011113
55666788
0122
Below is a stem-and-leaf display for the
pulse rates of 24 women at a health clinic.
How many pulses are between 67 and 77?
Stems are
10’s digits
1.
2.
3.
4.
5.
4
6
8
10
12
0%
1
0%
0%
2
3
0%
4
0%
10
5
Countdown
Interpreting Graphical Displays: Shape
Symmetric
distribution
A distribution is symmetric if the right and left

sides of the histogram are approximately mirror
images of each other.

A distribution is skewed to the right if the right
side of the histogram (side with larger values)
extends much farther out than the left side. It is
skewed to the left if the left side of the histogram
Skewed
distribution
extends much farther out than the right side.
Complex,
multimodal
distribution

Not all distributions have a simple overall shape,
especially when there are few observations.
Shape (cont.)Female heart attack
patients in New York state
Age: left-skewed
Cost: right-skewed
Shape (cont.): Outliers
An important kind of deviation is an outlier. Outliers are observations
that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
The overall pattern is fairly
symmetrical except for 2
states clearly not belonging
to the main trend. Alaska
and Florida have unusual
representation of the
elderly in their population.
A large gap in the
distribution is typically a
sign of an outlier.
Alaska
Florida
Center: typical value of frozen
personal pizza? ~$2.65
Spread: fuel efficiency 4, 8
cylinders
4 cylinders: more spread
8 cylinders: less spread
Other Graphical Methods for
Economic Data

Time plots
plot observations in time order, with
time on the horizontal axis and the variable on the vertical axis
** Time series
measurements are taken at regular
intervals (monthly unemployment,
quarterly GDP, weather records,
electricity demand, etc.)
Unemployment Rate, by Educational
Attainment
Water Use During Super Bowl
Winning Times 100 M Dash
Annual Mean Temperature
End of Histograms, Stem and
Leaf plots
Describing Distributions
Numerically:
Medians and Quartiles
2 characteristics of a data set
to measure
center
measures where the “middle” of the
data is located
 variability
measures how “spread out” the data is

The median: a measure of
center
Given a set of n measurements arranged in
order of magnitude,
Median= middle value
n odd
mean of 2 middle values, n even
 Ex. 2, 4, 6, 8, 10; n=5; median=6
 Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5
Student Pulse Rates (n=62)
38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70,
70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78,
79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96,
96, 96, 98, 98, 103
Median = (75+76)/2 = 75.5
Medians are used often
Year 2011 baseball salaries
Median $1,450,000 (max=$32,000,000
Alex Rodriguez; min=$414,000)
 Median fan age: MLB 45; NFL 43; NBA
41; NHL 39
 Median existing home sales price: May
2011 $166,500; May 2010 $174,600
 Median household income (2008
dollars) 2009 $50,221; 2008 $52,029

The median splits the histogram
into 2 halves of equal area
Examples
Example: n = 7
17.5 2.8 3.2 13.9 14.1 25.3 45.8
 Example n = 7 (ordered): m = 14.1
 2.8 3.2 13.9 14.1 17.5 25.3 45.8
 Example: n = 8
17.5 2.8 3.2 13.9 14.1 25.3 35.7 45.8
 Example n =8 (ordered) m = (14.1+17.5)/2 = 15.8
2.8 3.2 13.9 14.1 17.5 25.3 35.7 45.8

Below are the annual tuition charges at 7
public universities. What is the median
tuition?
4429
4960
4960
4971
5245
5546
7586
1.
2.
3.
4.
5245
4965.5
4960
4971
Below are the annual tuition charges at 7
public universities. What is the median
tuition?
4429
4960
5245
5546
4971
5587
7586
1.
2.
3.
4.
5245
4965.5
5546
4971
Measures of Spread

The range and interquartile
range
Ways to measure variability
range=largest-smallest
 OK sometimes; in general, too crude;
sensitive to one large or small data
value
 The range measures spread by
examining the ends of the data
 A better way to measure spread is to
examine the middle portion of the data
Quartiles: Measuring spread by
examining the middle
The first quartile, Q1, is the value in the
sample that has 25% of the data at or
below it (Q1 is the median of the lower
half of the sorted data).
The third quartile, Q3, is the value in the
sample that has 75% of the data at or
below it (Q3 is the median of the upper
half of the sorted data).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
6
5
4
3
2
1
2
3
4
5
6
7
6
5
4
3
2
1
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.3
m = median = 3.4
Q3= third quartile = 4.2
Quartiles and median divide data
into 4 pieces
1/4
1/4
Q1
1/4
M
1/4
Q3
Quartiles are common
measures of spread

http://www2.acs.ncsu.edu/UPA/admissi
ons/fresprof.htm

http://www2.acs.ncsu.edu/UPA/peers/cu
rrent/ncsu_peers/sat.htm

University of Southern California

UNC-CH
Rules for Calculating Quartiles
Step 1: find the median of all the data (the median divides the data in
half)
Step 2a: find the median of the lower half; this median is Q1;
Step 2b: find the median of the upper half; this median is Q3.
Important:
when n is odd include the overall median in both halves;
when n is even do not include the overall median in either half.
11

Example
2 4 6 8 10 12 14 16 18 20
n = 10
Median
m
= (10+12)/2 = 22/2 = 11
Q1 :
Q3
median of lower half 2 4 6 8 10
Q1 = 6
: median of upper half 12 14 16 18 20
Q3 = 16
Pulse Rates n = 138
#
3
9
10
23
23
16
23
10
10
4
2
4
1
Stem
4*
4.
5*
5.
6*
6.
7*
7.
8*
8.
9*
9.
10*
10.
11*
Leaves
Median: mean of pulses in
locations 69 & 70:
median= (70+70)/2=70
588
001233444
5556788899
00011111122233333344444
55556666667777788888888
00000112222334444
55555666666777888888999
0000112224
5555667789
0012
58
0223
1
Q1: median of lower half
(lower half = 69 smallest
pulses); Q1 = pulse in
ordered position 35;
Q1 = 63
Q3 median of upper half
(upper half = 69 largest
pulses); Q3= pulse in position
35 from the high end; Q3=78
Below are the weights of 31 linemen on
the NCSU football team. What is the
value of the first quartile Q1?
1.
2.
3.
4.
287
257.5
263.5
262.5
#
stemleaf
2
2255
4
2357
6
2426
7
257
10
26257
12
2759
(4)
281567
15
2935599
10
30333
7
3145
5
32155
2
336
1
340
0%
1
0%
2.
0%
3.
0%
10
4.
Countdown
Interquartile range
lower quartile Q1
 middle quartile: median
 upper quartile Q3
 interquartile range (IQR)
IQR = Q3 – Q1
measures spread of middle 50% of the
data

Example: beginning pulse
rates

Q3 = 78; Q1 = 63

IQR = 78 – 63 = 15
Below are the weights of 31 linemen on
the NCSU football team. The first quartile
Q1 is 263.5. What is the value of the IQR?
1.
2.
3.
4.
23.5
39.5
46
69.5
#
stemleaf
2
2255
4
2357
6
2426
7
257
10
26257
12
2759
(4)
281567
15
2935599
10
30333
7
3145
5
32155
2
336
1
340
0%
1.
0%
2.
0%
3
0%
10
4.
Countdown
5-number summary of data

Minimum Q1 median Q3 maximum

Pulse data
45 63 70
78
111
End of Medians and Quartiles
Numerical Summaries of
Symmetric Data.
Measure of Center: Mean
Measure of Variability: Standard
Deviation
Symmetric Data
Body temp. of 93 adults
Recall: 2 characteristics of a
data set to measure
center
measures where the “middle” of the
data is located
 variability
measures how “spread out” the data is

Measure of Center When Data
Approx. Symmetric
mean (arithmetic mean)
 notation
xi : ith measurement in a set of observations
x1 , x2 , x3 , , xn
n: number of measurements in data set; sample
size

n
 xi  x1  x2  x3    xn
i 1
Sample mean x
n
x
x1  x2  x3  xn i 1
x

n
n
i
Population mean  (value typically not known)
N = population size
N
x
  i 1
N
i
Connection Between Mean
and Histogram
A histogram balances when supported
at the mean. Mean x = 140.6
Histogram
70
60
50
40
Fr equency
30
20
10
Abs e nce s f rom Work
More
1 60.5
153.5
146.5
139 .5
132.5
125.5
0
118.5
Fre que ncy

Mean: balance point
Median: 50% area each half
right histo: mean 55.26 yrs, median 57.7yrs
Properties of Mean, Median
1. The mean and median are unique; that is, a
data set has only 1 mean and 1 median (the
mean and median are not necessarily equal).
2. The mean uses the value of every number in
the data set; the median does not.
20
46
Ex. 2, 4, 6, 8. x   5; m 
5
4
2
21 1
46
Ex. 2, 4, 6, 9. x   5 4 ; m 
5
4
2
Example: class pulse rates

53 64 67 67 70 76 77 77 78 83 84 85 85
89 90 90 90 90 91 96 98 103 140
n  23
23
x 
x
i 1
i
 84.48;
23
m :location: 12th obs. m  85
2010, 2011 baseball salaries
2010
n = 845
 = $3,297,828
median = $1,330,000
max = $33,000,000

2011
n = 848
 = $3,305,393
median = $1,450,000
max = $32,000,000

Disadvantage of the mean

Can be greatly influenced by just a few
observations that are much greater or
much smaller than the rest of the data
Mean, Median, Maximum BB
Salaries
Baseball Salaries: Mean, Median and Maximum 1985-2006
Maximum
30,000,000
2,700,000
25,000,000
2,200,000
20,000,000
1,700,000
15,000,000
1,200,000
10,000,000
Year
2005
2003
2001
1999
1997
1995
1993
0
1991
200,000
1989
5,000,000
1987
700,000
Maximum Salary
Median
3,200,000
1985
Mean, Median Salary
Mean
Skewness: comparing the
mean, and median
Skewed to the right (positively skewed)
 mean>median

2011 Baseball Salaries
600
490
Frequency
500
400
300
200
100
53
102
72
35 21 26 17
8
10
0
Salary ($1,000's)
2
3
1
0
0
1
Skewed to the left; negatively
skewed
Mean < median
 mean=78; median=87;

Histogram of Exam Scores
Frequency
30
20
10
0
20
30
40
50 60 70 80
Exam Scores
90 100
Symmetric data
mean, median approx. equal
Bank Customers: 10:00-11:00 am
20
15
10
5
0
70
.8
78
.6
86
.4
94
.2
10
2
10
9.
8
11
7.
6
12
5.
4
13
3.
2
m
or
e
Frequency

Number of Customers
DESCRIBING VARIABILITY OF
SYMMETRIC DATA
Describing Symmetric Data
(cont.)

Measure of center for symmetric data:
Sample mean x
n
x1  x2  x3 
x
n

 xn

x
i 1
i
n
Measure of variability for symmetric
data?
Example

2 data sets:
x1=49, x2=51 x=50
y1=0, y2=100 y=50
On average, they’re both
comfortable
49 51
0 100
Ways to measure variability
range=largest-smallest
ok sometimes; in general, too crude;
sensitive to one large or small obs.
1.
2. measure spread from the middle, where
the middle is the mean x ;
 deviation of xi from the mean: xi  x

n
 (x
i 1
i
 x ); sum the deviations of all the xi 's from x ;
n
 ( x  x )  0 always; tells us nothing
i 1
i
Previous Example
sum of deviations from mean:
x1  49, x2  51; x  50 
( x1  x )  ( x2  x )  (49  50)  (51  50)  1  1  0;
y1  0, y2  100; y  50 
( y1  y )  ( y2  y )  (0  50)  (100  50)  50  50  0
The Sample Standard Deviation, a
measure of spread around the mean

Square the deviation of each
observation from the mean; find the
square root of the “average” of these
squared deviations
n
( x i  x ) ;  ( x i  x ) 2 and find the " average" ,
2
i 1
then take the square root of the average
n
s 
 (x
i 1
deviation
i
 x )2
n 1
called the sample standard
Calculations …
Women height (inches)
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
-4.4
19.0
2
60
63.4
-3.4
11.3
3
61
63.4
-2.4
5.6
4
62
63.4
-1.4
1.8
5
62
63.4
-1.4
1.8
6
63
63.4
-0.4
0.1
7
63
63.4
-0.4
0.1
8
63
63.4
-0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Mean = 63.4
Sum
0.0
Sum
85.2
Sum of squared deviations from mean = 85.2
Mean
63.4
x
(n − 1) = 13; (n − 1) is called degrees freedom (df)
s2 = variance = 85.2/13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
-4.4
19.0
2
60
63.4
-3.4
11.3
3
61
63.4
-2.4
5.6
4
62
63.4
-1.4 these
1.8by hand, so make sure to know how to get the
We’ll
never
calculate
standard
deviation
using your
calculator, Excel, or other software.
5
62
63.4
-1.4
1.8
6
63
63.4
-0.4
0.1
7
63
63.4
-0.4
0.1
8
63
63.4
-0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
Mean
63.4
1. First calculate the variance s2.
n
1
s 
( xi  x ) 2

n 1 1
2
x
Mean
± 1 s.d.
2. Then take the square root to get the
standard deviation s.
1 n
2
s
(
x

x
)

i
n 1 1
Population Standard Deviation
N
 
2
(
x


)
 i
i 1
N
value of 
population standard deviation
typically not known;
use s to estimate value of 
Remarks
1. The standard deviation of a set of
measurements is an estimate of the
likely size of the chance error in a
single measurement
Remarks (cont.)
2. Note that s and  are always greater
than or equal to zero.
3. The larger the value of s (or  ), the
greater the spread of the data.
When does s=0? When does  =0?
When all data values are the same.
Remarks (cont.)
4. The standard deviation is the most
commonly used measure of risk in
finance and business
– Stocks, Mutual Funds, etc.
5. Variance




s2 sample variance
 2 population variance
Units are squared units of the original data
square $, square gallons ??
Remarks 6):Why divide by n-1
instead of n?
degrees of freedom
 each observation has 1 degree of
freedom
 however, when estimate unknown
population parameter like , you lose 1
degree of freedom

In formula for s , we use x to estimate the unkown
n
value of  ;
s 
2
(
x

x
)
 i
i 1
n 1
Remarks 6) (cont.):Why divide
by n-1 instead of n? Example
Suppose we have 3 numbers whose
average is 9
Choose ANY values for x and x
 x1=
x2= Since the average (mean) is 9, x
x + x must equal 9*3 = 27, so x
 then x3 must be
27 – (x + x )
 once we selected x1 and x2, x3 was
determined since the average was 9
 3 numbers but only 2 “degrees of
freedom”

1
2
+
3 =
1
2
3
1
2
Computational Example
observations 1, 3, 5, 9; x  184 4.5
(1  4.5) 2  (3  4.5) 2  (5  4.5) 2  (9  4.5) 2
s 
4 1
(3.5) 2  (1.5) 2  (.5) 2  (4.5) 2

3
12.25  2.25  .25  20.25
35


 11.67 3.42;
3
3
s 2 11.67
class pulse rates
53 64 67 67 70 76 77 77 78 83 84 85 85 89 90
90 90 90 91 96 98 103 140
n  23 x  84.48 m  85
s  290.26(beats per minute)
s  17.037 beats per minute
2
2
Review: Properties of s and 
s and  are always greater than or
equal to 0
when does s = 0?  = 0?
 The larger the value of s (or ), the
greater the spread of the data
 the standard deviation of a set of
measurements is an estimate of the
likely size of the chance error in a single
measurement

Summary of Notation
SAMPLE
y sample mean
POPULATION
 population mean
m sample median
m population median
s sample variance  2 population variance
s sample stand. dev.  population stand. dev.
2
End of Chapter 3