Download ProbConcepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 4:
Summary Statistics
May 17
In Chapter 4:
The prior chapter used stemplots and
histograms to look at the shape, location,
and spread of a distribution.
This chapter uses numerical summaries for
similar purposes.
Summary Statistics
• Central location
– Mean
– Median
– Mode
• Spread
– Range and interquartile range (IQR)
– Variance and standard deviation
• Shape summaries
– seldom used in practice
– not covered
Notation
• n  sample size
• X  the variable (e.g., ages of subjects)
• xi  the value of individual i for variable X
•   sum all values (capital sigma)
• Illustrative data (ages of participants):
21 42 5 11 30 50 28 27 24
n = 10
X = AGE variable
x1= 21, x2= 42, …, x10= 52
xi = x1 + x2 + … + x10= 21 + 42 + … + 52 = 290
52
§4.1: Central Location: Sample
Mean
•
•
•
•
“Arithmetic average”
Traditional measure of central location
Sum the values and divide by n
“xbar” refers to the sample mean
1
1
x  x1  x 2    xn  
n
n
n
x
i
i 1
Example: Sample Mean
Ten individuals selected at random have the
following ages:
21 42 5 11 30 50 28 27 24 52
Note that n = 10, xi = 21 + 42 + … + 52 = 290,
and
1
1
x
x

n
i

10
(290)  29.0
The sample mean is the gravitational center of a distribution
0
10
20
30
Mean = 29
40
50
60
Uses of the Sample Mean
The sample mean can be used to predict:
• The value of an observation drawn at
random from the sample
• The value of an observation drawn at
random from the population
• The population mean
Population Mean
x


1

N
N
i
x
i
• Same operation as sample mean except
based on entire population (N ≡ population
size)
• Conceptually important
• Usually not available in practice
• Sometimes referred to as the expected value
§4.2 Central Location: Median
The median is the value with a depth of (n+1)/2
When n is even, average the two values that
straddle a depth of (n+1)/2
For the 10 values listed below, the median has
depth (10+1) / 2 = 5.5, placing it between 27 and 28.
Average these two values to get median = 27.5
05
11
21
24
27
28
30
42
50

median
Average the adjacent values: M = 27.5
52
More Examples of Medians
• Example A: 2 4 6
Median = 4
• Example B: 2 4 6 8
Median = 5 (average of 4 and 6)
• Example C: 6 2 4
Median  2
(Values must be ordered first)
The Median is Robust
The median is more resistant to skews and
outliers than the mean; it is more robust.
This data set has a mean of 1636:
1362 1439 1460 1614 1666 1792 1867
Here’s the same data set with a data entry error “outlier”
(highlighted). This data set has a mean of 2743:
1362 1439 1460 1614 1666 1792
9867
The median is 1614 in both instances,
demonstrating its robustness in the face of outliers.
§4.3: Mode
• The mode is the most commonly
encountered value in the dataset
• This data set has a mode of 7
{4, 7, 7, 7, 8, 8, 9}
• This data set has no mode
{4, 6, 7, 8}
(each point appears only once)
• The mode is useful only in large data sets
with repeating values
§4.4: Comparison of Mean,
Median, Mode
Note how the mean gets pulled toward
the longer tail more than the median
mean = median → symmetrical distrib
mean > median → positive skew
mean < median → negative skew
§4.5 Spread: Quartiles
• Two distributions can be quite
different yet can have the same
mean
• This data compares particulate
matter in air samples (μg/m3) at
two sites. Both sites have a mean
of 36, but Site 1 exhibits much
greater variability. We would
miss the high pollution days if we
relied solely on the mean.
Site 1| |Site 2
---------------42|2|
8|2|
2|3|234
86|3|6689
2|4|0
|4|
|5|
|5|
|6|
8|6|
×10
Spread: Range
• Range = maximum – minimum
• Illustrative example:
Site 1 range = 86 – 22 = 64
Site 2 range = 40 – 32 = 8
• Beware: the sample range will
tend to underestimate the
population range.
• Always supplement the range
with at least one addition
measure of spread
Site 1| |Site 2
---------------42|2|
8|2|
2|3|234
86|3|6689
2|4|0
|4|
|5|
|5|
|6|
8|6|
×10
Spread: Quartiles
• Quartile 1 (Q1): cuts off bottom quarter of data
= median of the lower half of the data set
• Quartile 3 (Q3): cuts off top quarter of data
= median of the to half of the data set
• Interquartile Range (IQR) = Q3 – Q1
covers the middle 50% of the distribution
05
11
21

Q1
24
27
28

median
30
42
50

Q3
Q1 = 21, Q3 = 42, and IQR = 42 – 21 = 21
52
Quartiles (Tukey’s Hinges) – Example 2
Data are metabolic rates (cal/day), n = 7
1362 1439 1460 1614 1666 1792 1867

median
When n is odd, include the median in both halves
of the data set.
Bottom half:
1362 1439 1460
which has a median of 1449.5 (Q1)
Top half: 1614 1666 1792 1867
which has a median of 1729 (Q3)
1614
Five-Point Summary
•
•
•
•
•
Q0 (the minimum)
Q1 (25th percentile)
Q2 (median)
Q3 (75th percentile)
Q4 (the maximum)
§4.6 Boxplots
1. Calculate 5-point summary. Draw box from Q1 to
Q3 w/ line at median
2. Calculate IQR and fences as follows:
FenceLower = Q1 – 1.5(IQR)
FenceUpper = Q3 + 1.5(IQR)
Do not draw fences
3. Determine if any values lie outside the fences
(outside values). If so, plot these separately.
4. Determine values inside the fences (inside values)
Draw whisker from Q3 to upper inside value.
Draw whisker from Q1 to lower inside value
Illustrative Example: Boxplot
Data: 05 11 21 24 27 28 30 42 50 52
1. 5 pt summary: {5, 21, 27.5, 42, 52};
box from 21 to 42 with line @ 27.5
2. IQR = 42 – 21 = 21.
FU = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5
FL = Q1 – 1.5(IQR) = 21 – (1.5)(21) = –10.5
3. None values above upper fence
None values below lower fence
4. Upper inside value = 52
Lower inside value = 5
Draws whiskers
60
50
40
Upper inside = 52
Q3 = 42
30
Q2 = 27.5
20
Q1 = 21
10
Lower inside = 5
0
Illustrative Example: Boxplot 2
Data: 3 21 22 24 25 26 28 29 31 51
1. 5-point summary: 3, 22, 25.5,
29, 51: draw box
2. IQR = 29 – 22 = 7
FU = Q3 + 1.5(IQR) = 28 + (1.5)(7) = 39.5
FL = Q1 – 1.5(IQR) = 22 – (1.5)(7) = 11.6
3. One above top fence (51)
One below bottom fence (3)
60
50
Outside value (51)
40
Inside value (31)
30
20
Upper hinge (29)
Median (25.5)
Lower hinge (22)
Inside value (21)
4. Upper inside value is 31
Lower inside value is 21
Draw whiskers
10
Outside value (3)
0
Illustrative Example: Boxplot 3
Seven metabolic rates:
1362 1439 1460 1614 1666 1792 1867
1. 5-point summary: 1362, 1449.5,
1614, 1729, 1867
2000
2. IQR = 1729 – 1449.5 = 279.5
FU = Q3 + 1.5(IQR) = 1729 +
(1.5)(279.5) = 2148.25
1900
1800
1700
1600
FL = Q1 – 1.5(IQR) = 1449.5 –
(1.5)(279.5) = 1030.25
1500
1400
3. None outside
4. Whiskers end @ 1867 and 1362
1300
N=
7
Data source: Moore,
Boxplots: Interpretation
• Location
– Position of median
– Position of box
• Spread
– Hinge-spread (IQR)
– Whisker-to-whisker spread
– Range
• Shape
– Symmetry or direction of skew
– Long whiskers (tails) indicate leptokurtosis
Side-by-side boxplots
Boxplots are especially useful when comparing groups
§4.7 Spread: Standard
Deviation
• Most common
descriptive measures
of spread
• Based on deviations
around the mean.
• This figure
demonstrates the
deviations of two of its
values
This data set has a mean of
36.
The data point 33 has a
deviation of 33 – 36 = −3.
The data point 40 has a
deviation of 40 – 36 = 4.
Variance and Standard Deviation
Deviation = xi  x
Sum of squared deviations = SS 
 x  x 
2
i
SS
Sample variance = s 
n 1
2
Sample standard deviation = s 
s
2
Standard deviation (formula)
Sum of Squares
1
2
s
( xi  x )

n 1
Sample standard deviation s is the estimator of
population standard deviation .
See “Facts About the Standard Deviation” p. 80.
Illustrative Example: Standard Deviation (p. 79)
Observation
xi
Deviations
Squared deviations
xi  x 2
xi  x
36
36  36 = 0
02 = 0
38
38 36 = 2
22 = 4
39
39  36 = 3
32 = 9
40
40 36 = 4
42 = 16
36
36  36 = 0
02 = 0
34
34 36 = 2
22 = 4
33
33 36 = 3
32 = 9
32
32 36 = 4
42 = 16
0*
SS = 58
SUMS 
* Sum of deviations always equals zero
Illustrative Example (cont.)
Sample variance (s2)
SS
58
s 

 8.286 ( g/m 3 ) 2
n 1 8 1
2
Standard deviation (s)
s  s  8.286  2.88 g/m
2
3
Interpretation of Standard
Deviation
• Measure spread (e.g., if group was s1 =
15 and group 2 s2 = 10, group 1 has
more spread, i.e., variability)
• 68-95-99.7 rule (next slide)
• Chebychev’s rule (two slides hence)
68-95-99.7 Rule
Normal Distributions Only!
•
•
•
•
68% of data in the range μ ± σ
95% of data in the range μ ± 2σ
99.7% of data the range μ ± 3σ
Example. Suppose a variable has a Normal
distribution with = 30 and σ = 10. Then:
68% of values are between 30 ± 10 = 20 to 40
95% are between 30 ± (2)(10) = 30 ± 20 = 10 to 50
99.7% are between 30 ± (3)(10) = 30 ± 30 = 0 to 60
Chebychev’s Rule
All Distributions
• Chebychev’s rule says that at least 75% of
the values will fall in the range μ ± 2σ (for
any shaped distribution)
• Example: A distribution with μ = 30 and σ
= 10 has at least 75% of the values in the
range 30 ± (2)(10) = 10 to 50
Rules for Rounding
• Carry at least four significant digits during
calculations. (Click here to learn about
significant digits.)
• Round as last step of operation
• Avoid pseudo-precision
• When in doubt, use the APA Publication
Manual
• Always report units
Always use common sense and good judgment.
Choosing Summary Statistics
• Always report a measure of central
location, a measure of spread, and the
sample size
• Symmetrical mound-shaped distributions
 report mean and standard deviation
• Odd shaped distributions  report 5-point
summaries (or median and IQR)
Software and Calculators
Use software and calculators to check work.