Download Chapter 4: Summary Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 4:
Summary Statistics
Chapter 4
1
In Chapter 3…
… we used stemplots to look at
shape, central location, and spread of
a distribution.
In this chapter we use numerical
summaries to look at central location
and spread.
Chapter 4
2
Summary Statistics
• Central location statistics
– Mean
– Median
– Mode
• Spread statistics
– Range
– Interquartile range (IQR)
– Variance and standard deviation
• Shape statistics exist but are seldom
used in practice (not covered)
Chapter 4
3
Notation
•
•
•
•
•
n  sample size
X  variable (e.g., ages of subjects)
xi  value for individual i
  sum all values (“capital sigma”)
Example: Let X = AGE (n = 10):
21
42
5
11
30
50
28
27
24
52
x1= 21, x2= 42, …, x10= 52
xi = x1 + x2 + … + x10= 21 + 42 + … + 52 = 290
Chapter 4
4
Central Location: Sample Mean
Most common measure of central location
1
1 n
x  x1  x 2    xn    xi
n
n i 1
For the data on the previous slide:
1
1
x   xi  (290)
n
10
 29.0
Chapter 4
5
Example: Sample Mean
The mean is the gravitational center of a batch of numbers
0
10
20
30
40
50
60
Mean = 29
Chapter 4
6
Gravitational Center
A skew tips the distribution causing the
mean to shift toward the tail
Chapter 4
7
Uses of the Sample Mean
• Predicts value of an
observation drawn at
random from the sample
• Predicts value of an
observation drawn at
random from the
population
• Predicts population
mean µ
Chapter 4
8
Population Mean
x


1

N
N
i
x
i
• Same operation as sample mean applied
to entire population (N ≡ population size)
• Not readily (never?) available in practice
but conceptually important
Chapter 4
9
Central Location: Median
The value with a depth of (n+1) / 2
When n is even → median is obvious
When n is even → average the two middle values
Example (below): Depth (M) = (10+1) / 2 = 5.5
Median = Average (27 and 28) = 27.5
05
11
21
24
27
28
30
42
50

median
Average the adjacent
Chapter 4 values: M = 27.5
52
10
More Examples
• Example A: 2 4 6
Median = 4
• Example B: 2 4 6 8
Median = 5 (average of 4 and 6)
• Example C: 6 2 4
Median  2
(Values must be ordered first)
Chapter 4
11
The Median is Robust
The median is resistant to skews and outlier
This data set has x-bar = 1636:
1362 1439 1460 1614 1666 1792 1867
Same data set with a data entry error highlighted.
1362 1439 1460 1614 1666 1792
9867
This data has x-bar = 2743
The median is 1614 in both instances,
Chapter 4
12
Mode
• Most frequent value in the dataset
• This data set has a mode of 7
{4, 7, 7, 7, 8, 8, 9}
• This data set has no mode
{4, 6, 7, 8}
(each point appears once)
• The mode is useful only in large data sets
with repeating values
Chapter 4
13
Comparison of Mean, Median,
Mode
Mean gets “pulled” by
tail:
mean = median → symmetrical
mean > median → positive skew
mean < median → negative skew
Chapter 4
14
Spread ≡ extent to which data vary
around middle point
Sites have similar central locations
medians = 36 and 38, respectively
Site 1 exhibits much greater
spread (visually)
Chapter 4
Site 1| |Site 2
--------------42|2|
8|2|
2|3|234
86|3|6689
2|4|0
|4|
|5|
|5|
|6|
8|6|
×10
particulates in air (μg/m315
)
Spread: Range
• Range = maximum – minimum
• Illustrative example:
Site 1 range = 68 – 22 = 46
Site 2 range = 40 – 32 = 8
• The sample range is not a
good measure of spread: tends
to underestimate population
range
• Always supplement the range
with at least one addition
measure of spread
Chapter 4
Site 1| |Site 2
---------------42|2|
8|2|
2|3|234
86|3|6689
2|4|0
|4|
|5|
|5|
|6|
8|6|
×10
16
Spread: Interquartile Range
• Quartile 1 (Q1): marks bottom quarter of data
= middle of the lower half of the data set
• Quartile 3 (Q3): marks top quarter of data
= middle of the top half of data set
• Interquartile Range (IQR) = Q3 – Q1
covers middle 50% of \distribution
05
11
21

Q1
24
27
28

median
30
42
50
52

Q3
Q1 = 21, Q3 = 42, and IQR = 42 – 21 = 21
Chapter 4
17
Five-Point Summary
•
•
•
•
•
Q0 (the minimum)
Q1 (25th percentile)
Q2 (median)
Q3 (75th percentile)
Q4 (the maximum)
05
11
21

Q1
24
27
28

median
30
42
50
52

Q3
5 point summary = 5, 21, 27.5, 42, 52
Chapter 4
18
Quartiles: Tukey’s Hinges
Data: metabolic rates (cal/day), n = 7
1362 1439 1460 1614 1666 1792 1867

median
When n is odd, include the median in both “halves”
Q1 = 1449.5
Bottom half:
1362 1439 1460 1614
Top half:
1614
1666
1792
Q3 = 1729
Chapter 4
1867
19
§4.6 Boxplots
1. Draw box from Q1 to Q3
2. Draw line for median.
3. Calculate fences:
FenceLower = Q1 – 1.5(IQR)
FenceUpper = Q3 + 1.5(IQR)
4. Do not draw fences
5. Any values outside the fences (outside
values). are plotted separately.
6. Determine most extreme values still inside
the fences (inside values)
7. Draw whiskers quartiles
to
inside
values
Chapter 4
20
Example 1: Boxplot
Data: 05 11 21 24 27 28 30 42 50 52
1. 5 pt summary: {5, 21, 27.5, 42, 52};
Box from 21 to 42 with line @ 27.5
2. IQR = 42 – 21 = 21.
FU = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5
FL = Q1 – 1.5(IQR) = 21 – (1.5)(21) = –10.5
3. No values above upper fence
None values below lower fence
4. Upper inside value = 52
Lower inside value = 5
Draws whiskers
Chapter 4
60
50
40
Upper inside = 52
Q3 = 42
30
Q2 = 27.5
20
Q1 = 21
10
Lower inside = 5
0
21
Example 2: Boxplot
Data: 3 21 22 24 25 26 28 29 31 51
1. 5-point summary: 3, 22, 25.5,
29, 51; hinges at 22 and 29
2. IQR = 29 – 22 = 7
60
50
FU = Q3 + 1.5(IQR) = 29 + (1.5)(7) = 39.5
FL = Q1 – 1.5(IQR) = 22 – (1.5)(7) = 11.6
3. One upper outside value (51)
One lower outside value (3)
Outside value (51)
40
Inside value (31)
30
20
Upper hinge (29)
Median (25.5)
Lower hinge (22)
Inside value (21)
4. Upper inside value is 31
Lower inside value is 21
Draw whiskers
Chapter 4
10
Outside value (3)
0
22
Example 3: Boxplot
Seven metabolic rates (cal / day):
1362 1439 1460 1614 1666 1792 1867
1. 5-point summary: 1362, 1449.5,
1614, 1729, 1867
2000
2. IQR = 1729 – 1449.5 = 279.5
FU = Q3 + 1.5(IQR)
= 1729 + 1.5(279.5)
= 2148.25
FL = Q1 – 1.5(IQR)
= 1449.5 –1.5(279.5)
= 1030.25
3. None outside
4. Inside values: 1867 and 1362
Chapter 4
1900
1800
1700
1600
1500
1400
1300
N=
7
Data source: Moore,
23
Boxplots: Interpretation
• Central location: position
of median and box (IQR)
• Spread: Hinge-spread
(IQR), whisker spread,
range
• Shape: symmetry of
median within box and
box within whiskers, tail
length (kurtosis), outside
values
Chapter 4
24
Spread: Standard Deviation
• The standard
deviation is the most
popular measure of
spread
• σ ≡ population
standard deviation
• s = sample standard
deviation
• Based on deviations
around the mean
1
2
s
(
x

x
)

i
n 1
Chapter 4
25
Deviations
Deviation = distance from the mean = xi  x
This example shows a deviation of −3 for the data point 33
It show a deviation of 4 for data point 40
Chapter 4
26
“Sum of squares” SS   xi  x 
2
Observation
xi
Deviations
Sq. deviations
xi  x 
xi  x
2
36
38
39
36  36 = 0
38  36 = 2
39  36 = 3
02 = 0
22 = 4
32 = 9
40
36
34
40  36 = 4
36  36 = 0
34  36 = 2
42 = 16
02 = 0
22 = 4
33
32
33  36 = 3
32  36 = 4
32 = 9
42 = 16
SUMS 
Chapter 4
0
SS = 58 27
Sum of Squares (SS), variance (s2),
Standard Deviation (s)
SS    xi  x 
2
1
1
2
s 
( xi  x ) 
 SS

n 1
n 1
2
1
s s 
 SS
n 1
2
Chapter 4
28
Variance & Standard Deviation
Sample variance
1
2
s 
( xi  x )

n 1
2
1

 58
8 1
 8.286
Standard deviation
s s
2
 8.286  2.88
Chapter 4
29
Interpretation of Sample
Standard Deviation s
•
•
•
•
Measure spread
Estimator of population standard deviation 
68-95-99.7 rule (Normal distributions)
Chebychev’s rule (all distributions)
Chapter 4
30
68-95-99.7 Rule
Applies to Normal
distributions only!
• 68% of values within
μ±σ
• 95% within μ ± 2σ
• 99.7% within μ ± 3σ
Example: Normal distribution with
μ = 30 and σ = 10 :
68% of values in 30 ± 10 =
20 to 40
95% in 30 ± (2)(10) = 10 to
50
99.7% in 30 ± (3)(10) = 0 to
60
Chapter 4
31
Chebychev’s Rule
• Applies to all
distributions
• At least 75% of the
values within μ ± 2σ
• Example: Distribution
with μ = 30 and σ =
10 has at least 75%
of the values within
30 ± (2)(10)
= 30 ± 20
= 10 to 50
Chapter 4
32
Rounding
• There is no single rule for rounding.
• The number of significant digits should
reflect the precision of the measurement
• Use judgment and “be kind to your
reader”
• Rough guide: carry at least four significant
digits during calculations & round as final
step
Chapter 4
33
Choosing Summary Statistics
• Always report a measure of central
location, a measure of spread, and the
sample size
• Symmetrical distributions  report the
mean and standard deviation
• Asymmetrical distributions  report the 5point summaries (or median and IQR)
Chapter 4
34
Software and Calculators
Use ‘em
Chapter 4
35
Related documents