Download 2: Describing distributions with numbers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 2
Describing Distributions
with Numbers
5/24/2017
Chapter 2
1
Numerical Summaries of:
• Central location
– mean
– median
• Spread
– Range
– Quartiles
– Standard Deviation / variance
• Shape measures not covered
5/24/2017
Chapter 2
2
Arithmetic Mean
• Most common measure of central location
• Notation (“xbar”): x
n
1
1
x   x1  x 2  xn    xi
n
n i 1
Where
n is the sample size
∑ is the summation symbol
5/24/2017
Chapter 2
3
Example: Sample Mean
Data: Metabolic rates, calories / day:
1792 1666 1362 1614 1460 1867 1439
1792  1666  1362  1614  1460  1867  1439
x
7
11,200

7
 1600
5/24/2017
Chapter 2
4
Median (M)
• Half the values are less than the median,
half are greater
• If n is odd, the median is the middle
ordered value
• If n is even, the median is the average of
the two middle ordered values
5/24/2017
Chapter 2
5
Examples: Median
• Example 1: 2 4 6
Median = 4
• Example 2: 2 4 6 8
Median = 5 (average of 4 and 6)
• Example 3: 6 2 4
Median
2
(Values must first be ordered first 2 4 6 , Median
= 4)
5/24/2017
Chapter 2
6
Example: Median
The location of the median in ordered array:
L(M) = (n + 1) / 2
Data = metabolic rates in slide 4 (n = 7)
Ordered array:
1362 1439 1460 1614 1666 1792 1867

median
Value of median = 1614
5/24/2017
Chapter 2
7
The Median is robust to outliers
This data set:
1362 1439 1460 1614 1666 1792 1867
has median 1614 and mean 1600
This similar data with high outlier:
1362 1439 1460 1614 1666 1792 9867
still has median 1614 but now has mean 2742.9
5/24/2017
Chapter 2
8
The skew pulls the mean
• The average salary at a high tech firm is
$250K / year
• The median salary is $60K
• What does this tell you?
• Answer: There are some very highly paid
executives, but most of the workers make
modest salaries, i.e., there is a positive
skew to the distribution
5/24/2017
Chapter 2
9
Spread = Variability
• Amount of spread around the center!
• Statistical measures of spread
–Range
–Inter-Quartile Range
–Standard deviation
5/24/2017
Chapter 2
10
Range and IQR
• Range = maximum – minimum
• Easy, but NOT as good as the…
• Quartiles & Inter-Quartile Range (IQR)
– Quartile 1 (Q1) cuts off bottom 25% of data
(“25th percentile”)
– Quartile 2 (Q2) cuts off two-quarters of data
– same as the Median!
– Quartile 3 (Q3) cuts off three-quarters of the
data (“75th percentile”)
Obtaining Quartiles
• Order data
• Find the median
• Look at the lower half of data set
– Find “median” of this lower half
– This is Q1
• Look at the upper half of the data set.
– Find “median” of this upper half
– This is Q3
5/24/2017
Chapter 2
12
Example: Quartiles
Consider these 10 ages:
05 11 21 24 27
28

median
30
42
50
52
The median of the bottom half (Q1) = 21
05 11 21 24 27

The median of the top half (Q3) = 42
28 30 42 50 52

5/24/2017
Chapter 2
13
Example 2: Quartiles, n = 53
100
101
106
106
110
110
119
120
120
123
124
125
127
128
130
130
133
135
139
140
148
150
150
152
155
157
165
165
165
170
170
170
172
175
175
180
180
180
180
185
215
220
260
Median = 165
L(M)=(53+1) / 2 = 27
5/24/2017
185
185
186
187
192
194
195
203
210
212
Chapter 2
14
Example 2: Quartiles, n = 53
100
101
106
106
110
110
119
120
120
123
124
125
127
128
130
130
133
135
139
140
148
150
150
152
155
157
165
165
165
170
170
170
172
175
175
180
180
180
180
185
185
185
186
187
192
194
195
203
210
212
215
220
260
Bottom half has n* = 26  L(Q1)=(26 + 1) / 2= 13.5 from bottom
Q1 = avg(127, 128) = 127.5
5/24/2017
Chapter 2
15
Example 2: Quartiles, n = 53
100
101
106
106
110
110
119
120
120
123
124
125
127
128
130
130
133
135
139
140
148
150
150
152
155
157
165
165
165
170
170
170
172
175
175
180
180
180
180
185
185
185
186
187
192
194
195
203
210
212
215
220
260
Top half has n* = 26  L(Q3) = 13.5 from the top!
Q3 = avg(185, 185) = 185
5/24/2017
Chapter 2
16
Example 2
Quartiles
Q1 = 127.5
Q2 = 165
Q3 = 185
"5 point summary"
= {Min, Q1, Median, Q3, Max}
= {100, 127.5, 165, 185, 260}
5/24/2017
Chapter 2
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
0166
009
0034578
00359
08
00257
555
000255
000055567
245
3
025
0
0
17
Inter-quartile Range (IQR)
• Q1 = 127.5
• Q3 = 185
Inter-Quartile
Range (IQR)
= Q3  Q1
= 185 – 127.5
= 57.5
“spread of middle 50%”
5/24/2017
Chapter 2
18
Simple Box
5-point summary graphically
min
100
Q1
125
M
150
Q3
175
max
200
225
250
275
Weight
5/24/2017
Chapter 2
19
Boxplots are useful for comparing groups
5/24/2017
Chapter 2
20
Standard Deviation &
Variance
• Most popular measures of spread
• Each data value has a deviation, defined
as:
xi  x
5/24/2017
Chapter 2
21
Example: Deviations
Metabolic data (n = 7)
x1  x  1439  1600  161
x1  x  1792  1600  192
5/24/2017
Chapter 2
22
Variance
•
•
•
•
•
Find the mean
Find the deviation of each value
Square the deviations
Sum the squared deviations
Divide by (n − 1)
n
1
2
s 
( xi  x )

(n  1) i 1
2
5/24/2017
Chapter 2
23
Data
Data: Metabolic rates, n = 7
1792 1666 1362 1614 1460 1867 1439
5/24/2017
Chapter 2
24
“Sum of Squares”
Obs
xi
xi  x 
xi  x
2
(192)2 = 36,864
1792
17921600 = 192
1666
1666 1600 =
1362
1362 1600 = -238
1614
1614 1600 =
1460
1460 1600 = -140
(-140)2 = 19,600
1867
1867 1600 = 267
(267)2 = 71,289
1439
1439 1600 = -161
(-161)2 = 25,921
0
214,870
SUMS 11,200
11200
x
 1600
7
5/24/2017
Squared deviations
Deviations
66
14
(66)2 =
4,356
(-238)2 = 56,644
(14)2 =
196
2
(
x

x
)
 "Sum of Squares"
 i
Chapter 2
25
Variance
Sum of Squares
1
2
xi  x 
s 

n 1
1

 214,870
7 1
 35,811.67
2
5/24/2017
Chapter 2
26
Standard Deviation
Square root of variance
s
s
2
s  35,811.67  189.24
5/24/2017
Chapter 2
27
Standard Deviation
Direct Formula
1
2
xi  x 
s

n 1
1

 214,870
7 1
 189
5/24/2017
Chapter 2
28
Use calculator to check work!
I’m supporting the TI-30XIIS only
TI-30XIIS sequence:
• On > CLEAR > 2nd > STAT >
Scroll > Clear Data > Enter
• 2nd > STAT > 1-VAR or 2-VAR
• DATA > “enter data
• STATVAR key
Choosing Summary Statistics
• Use the mean and standard deviation to
describe symmetrical distributions &
distributions free of outliers
• Use the median and quartiles (IQR) to
describe distributions that are skewed or
have outliers
5/24/2017
Chapter 2
30
Example: Number of Books Read
0
0
0
0
0
0
0
0
0
1
5/24/2017
n = 52
1
1
1
1
2
2
2
2
2
2
L(M)=(52+1)/2=26.5
2
4
10
2
2
3
3
3
M
3
4
4
4
4
4
5
5
5
5
5
5
6
Chapter 2
10
12
13
14
14
15
15
20
20
30
99
31
Example: Books read, n = 52
5-point summary: 0, 1, 3, 5.5, 99
Highly asymmetric distribution
0
10
20
30
40
50
60
Number of books
70
80
90
100
The mean (“xbar” = 7.06) and standard deviation (s = 14.43) give false
impressions of location and spread for this distribution and are considered
inappropriate. Use the median and 5-point summary instead.
5/24/2017
Chapter 2
32
Related documents