Download Chapter 4 Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Transcript
Summary Statistics:
Mean, Median, Standard
Deviation, and More
“Seek simplicity and then distrust it.”
(Dr. Monticino)
Assignment Sheet


Read Chapter 4
Homework #3: Due Wednesday Feb. 9th

Chapter 4







exercise set A:
exercise set C:
exercise set D:
exercise set E:
1 -6, 8, 9
1, 2, 3
1 - 4, 8,
4, 5, 7, 8, 11, 12
Quiz #2 will be over Chapter 2
Quiz #3 on basic summary statistic calculations
– mean, median, standard deviation, IQR, SD
units
If you’d like a copy of notes - email me
Overview

Measures of central tendency
Mean (average)
 Median
 Outliers


Measures of dispersion

Standard deviation
 Standard deviation units
Range
 IQR


Review and applications
Central Tendency

Measures of central tendency - mean and
median - are useful in obtaining a single
number summary of a data set
Mean is the arithmetic average
 Median is a value such that at least 50% of the
data is less and at least 50% is greater

Example

Calculate mean and median for
following data sets
37
44
55
78
100
111
125
151
161
37
44
55
69
90
120
125
152
157
161
Outliers and Robustness

Mean can be sensitive to outliers in
data set
 Not
162
166
158
154
147
150
141
233
278
288
148
152
149
265
212
154
148
158
150
137
142
149
148
145
143
152
robust to data collection errors or
a single unusual measurement
 Blind calculation can give misleading
results
mean = 170.35
median = 151
Outliers and Robustness

Always a good idea to plot data in the
order that it was collected
 Spot
outliers
 Identify possible data collection errors
350
mean without
outliers = 150.14
300
Value
250
200
150
median without
outliers
100
50
0
0
5
10
15
Data
20
25
30
= 149
Outliers and Robustness

Median can be a more robust measure
of central tendency than mean
 Life
expectancy
 U.S. males: mean = 80.1, median = 83
 U.S. females: mean = 84.3, median = 87
 Household
income
 Mean = $51,855, median = $38,885
 .3% account for 12% of income
 Net
worth
 Mean = $282,500, median = $71,600
Which Central Tendency
Measure?





Calculate mean, median and mode
Plot data
Create histogram to inspect mode(s)
Do not delete data points
 If analyze data without outliers, report and
explain outliers
Many statistical studies involve studying the
difference between population means
 Reporting the mean may be dictated by
objective of study
Which Central Tendency
Measure?
If data is
 Unimodal
 Fairly symmetric
 Mean is approximately equal to median
 Then mean is a reasonable measure of central
tendency
80
Histogram
70
60
25
Value
50
20
15
Frequency
40
30
Bin
73
67
61
0
55
0
49
10
43
5
37
20
31
10
25
Frequency

0
20
40
60
Data Points
80
100
120
Which Central Tendency
Measure?
If data is
 Unimodal
 Asymmetric
 Then report both median and mean
 Difference between mean and median indicates asymmetry
 Median will usually be the more reasonable summary of
central tendency
Histogram
20
15
Frequency
10
5
or
e
M
99
90
81
72
63
54
45
0
Bin
Value
25
Frequency

110
100
90
80
70
60
50
40
30
20
10
0
0
20
40
60
Data Points
80
100
120
Which Central Tendency
Measure?
If data is
 Not unimodal
 Then report modes and cautiously mean and median
 Analyze data for differences in groups around the
modes
Histogram
80
70
18
16
14
12
10
8
6
4
2
0
60
Frequency
Value
50
40
30
20
10
69
or
e
M
Bin
60
51
42
33
0
24
15
Frequency

0
20
40
60
Data Points
80
100
120
Limitations of Central
Tendency

Any single number summary may not
adequately represent data and may hide
differences between data sets

Example
2
98
50
99
100
100
150
101
198
102
Measures of Dispersion

Including an additional statistic - a measure of
dispersion - can help distinguish between
data sets which have similar central
tendencies
Range: max - min
 Standard deviation: root mean square difference
from the mean

s
( x1  m) 2  ( x2  m) 2  ...  ( xn 1  m) 2  ( xn  m) 2
n
Measures of Dispersion

Examples
 Range
198  2  196
102  98  4
2
98
50
99
100
100
150
101
198
102
Measures of Dispersion

Examples
 Standard
deviation
SD 
 69.6
2
98
50
99
100
100
150
101
198
102
m = 100
m = 100
(2  100) 2  (50  100) 2  (100  100) 2  (150  100) 2  (198  100) 2
5
Measures of Dispersion
Both range and standard
deviation can be sensitive to
outliers
However, many data sets can
be characterized by mean and
SD
 If the values of the data set are
distributed in an
approximately bell shape, the
250

 ~68% of the data will be within 1
SD unit of mean, ~95% will be
within 2 SD units and nearly all
will be within 3 SD units
200
Count

150
100
50
-3.00
-1.00
1.00
3.00
Measures of Dispersion

Example
Suppose data set has mean = 35 and SD =
( 42  35)
7
1
7
 How many SD units away from the mean
is 42?
(38  35)
 .43
 How many SD units away from the mean
7
is 38?
 How many SD units away from the mean (30  35)  .714
is 30?
7


Assuming bell shape distribution, ~95%
are between what two values?
between (35  2 * 7)  21 and (35  2 * 7)  49
Measures of Dispersion

A robust measure of dispersion is the
interquartile range
Q1: value such that 25% of data less than, and
75% greater than
 Q3: value such that 75% less than, and 25%
greater than
 IQR = Q3 - Q1

Example

Calculate range, standard deviation and
interquartile range for the following
data sets
1
98
99
100
100
100
102
102
104
107
95
98
99
100
100
100
102
102
104
107
Assignment, Discussion,
Evaluation


Read Chapter 4
Discussion problems

Chapter 4





exercise set A:
exercise set C:
exercise set D:
exercise set E:
1 -6, 8, 9
1, 2, 3
1 - 4, 8,
4, 5, 7, 8, 11, 12
Quiz #3 on basic summary statistic calculations
– mean, median, standard deviation, IQR, SD
units
Review of Definitions

Measures of central tendency
 Mean
(average):
x1  x2    xn
n
 Median
 If odd number of data points, “middle” value
 If even number of data points, average of two
“middle” values
Question and Examples

Can mean be larger than median? Can
median be larger than mean?



Give examples
Can mean be a negative number? Can the
median?
The average height of three men is 69 inches.
Two other men enter the room of heights 73
and 70 inches. What is the average height of
all five men?
Questions and Examples

The average of a data set is 30.
A value of 8 is added to each element in the data
set. What is the new average?
 Each element of the data set is increased by 5%.
What is the new average?


Suppose that data consists of only 1’s and 0’s

What does the average represent?
 Application: an experiment is performed and only two
outcomes can occur
 Label one type of outcome 1 and the other 0

For the data set 31, 45, 72, 86, 62, 78, 50, find
the median, Q1 (25th percentile) and Q3 (75th
percentile)
Review of Definitions

Measures of dispersion
 Standard
SD 
deviation =
( x1  m) 2  ( x 2  m) 2  ...  ( x n 1  m) 2  ( x n  m) 2
n
 Range
= max - min
 IQR = Q3 - Q1
Questions and Examples




Can the SD be negative? Can the range? Can
the IQR?
Can the SD equal 0?
For the data set 3,1,5,2,1,6 find the SD, range
and IQR
The average weight for U.S. men is 175 lbs and
the standard deviation is 20 lbs
If a man weighs 190 lbs., how many standard
deviation units away from the mean weight is he?
 Assuming a normal (bell-shaped) distribution for
weight, ninety-five percent of U.S. men weigh
between what two values?

Questions and Examples

The average of a data set is 23 and the
standard deviation is 5
A
value of 8 is added to each element in
the data set. What is the new standard
deviation?
 Each element of the data set is increased by
5%. What is the new standard deviation?
(Dr. Monticino)