Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Mean field particle methods wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Looking at Data—Distributions
1.2 Describing distributions
with numbers
© 2012 W.H. Freeman and Company
Objectives
1.2
Describing distributions with numbers

Measures of center: mean, median

Mean versus median

Measures of spread: quartiles, standard deviation

Five-number summary and boxplot

Choosing among summary statistics
Measure of center: the mean
The mean or arithmetic average
To calculate the average, or mean, add
all values, then divide by the number of
cases. It is the “center of mass” or the
"balance point" of the distribution of data.
Sum of heights is 1598.3
divided by 25 women = 63.9 inches
58 .2
59 .5
60 .7
60 .9
61 .9
61 .9
62 .2
62 .2
62 .4
62 .9
63 .9
63 .1
63 .9
64 .0
64 .5
64 .1
64 .8
65 .2
65 .7
66 .2
66 .7
67 .1
67 .8
68 .9
69 .6
woman
(i)
height
(x)
woman
(i)
height
(x)
i=1
x1= 58.2
i = 14
x14= 64.0
i=2
x2= 59.5
i = 15
x15= 64.5
i=3
x3= 60.7
i = 16
x16= 64.1
i=4
x4= 60.9
i = 17
x17= 64.8
i=5
x5= 61.9
i = 18
x18= 65.2
i=6
x6= 61.9
i = 19
x19= 65.7
i=7
x7= 62.2
i = 20
i=8
x8= 62.2
i = 21
i=9
x9= 62.4
i = 22
x22= 67.1
i = 10
x10= 62.9
i = 23
x23= 67.8
i = 11
x11= 63.9
i = 24
x24= 68.9
i = 12
x12= 63.1
i = 25
i = 13
x13= 63.9
n=25
x
x20= 66.2
x
Mathematical notation:
x1  x 2  ... x n
x
n
n
1
x   xi
n i1
= 66.7
21
= 69.6
25
1598.3
x
 63.9
25
S=1598.3
Learn right away how to get the mean using your calculator & JMP.
Your numerical summary must be meaningful.
Height of 25 women in a class
x  63.9
The distribution of women’s
heights appears coherent and
symmetrical. The mean is a good
numerical summary.

Here the shape of
the distribution is
wildly irregular.
Why?
Could we have
more than one
plant species or
phenotype?
x  69.6

Height of Plants by Color
x  63.9
5
x  70.5
x  78.3
red
Number of Plants
4
pink



blue
3
2
1
0
58
60
62
64
66
68
70
72
74
76
78
80
82
Height in centimeters
A single numerical summary here would not make sense.
84
Measure of center: the median
The median is the midpoint of a distribution—the number such
that half of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
1. Sort observations by size.
n = number of observations
______________________________
2.a. If n is odd, the median is
observation (n+1)/2 down the list
 n = 25
(n+1)/2 = 26/2 = 13
Median = 3.4
2.b. If n is even, the median is the
mean of the two middle observations.
n = 24 
n/2 = 12
Median = (3.3+3.4) /2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
Comparing the mean and the median
The mean and the median are the same only if the distribution is
symmetrical. The median is a measure of center that is resistant to skew
and outliers. The mean is not.
Mean and median for a
symmetric distribution
Mean
Median
Mean and median for
skewed distributions
Left skew
Mean
Median
Mean
Median
Right skew
Mean and median of a distribution with outliers
Percent of people dying
x  3.4

x  4.2
Without the outliers

With the outliers
The mean is pulled to the
The median, on the other hand,
right a lot by the outliers
is only slightly pulled to the right
(from 3.4 to 4.2).
by the outliers (from 3.4 to 3.6).
Impact of skewed data
Symmetric distribution…
Disease X:
x  3.4
M  3.4
Mean and median are the same.

… and a right-skewed distribution
Multiple myeloma:
x  3.4
M  2.5
The mean is pulled toward
the skew.

Measure of spread: the quartiles
The first quartile, Q1, is the value in the
sample that has 25% of the data at or
below it ( it is the median of the lower
half of the sorted data, excluding M).
M = median = 3.4
The third quartile, Q3, is the value in the
sample that has 75% of the data at or
below it ( it is the median of the upper
half of the sorted data, excluding M).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
Q3= third quartile = 4.35
Five-number summary and boxplot
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6.1
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
Largest = max = 6.1
BOXPLOT
7
Q3= third quartile
= 4.35
M = median = 3.4
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
4
3
2
1
Q1= first quartile
= 2.2
Smallest = min = 0.6
0
Disease X
Five-number summary:
min Q1 M Q3 max
Boxplots for skewed data
Years until death
Comparing box plots for a normal
and a right-skewed distribution
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Boxplots remain
true to the data and
depict clearly
symmetry or skew.
Disease X
Multiple Myeloma
Suspected outliers
Outliers are troublesome data points, and it is important to be able to
identify them.
One way to raise the flag for a suspected outlier is to compare the
distance from the suspicious data point to the nearest quartile (Q1 or
Q3). We then compare this distance to the interquartile range
(distance between Q1 and Q3).
We call an observation a suspected outlier if it falls more than 1.5
times the size of the interquartile range (IQR) above the first quartile or
below the third quartile. This is called the “1.5 * IQR rule for outliers.”
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
7.9
6.1
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
8
7
Q3 = 4.35
Distance to Q3
7.9 − 4.35 = 3.55
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
Interquartile range
Q3 – Q1
4.35 − 2.2 = 2.15
4
3
2
1
Q1 = 2.2
0
Disease X
Individual #25 has a value of 7.9 years, which is 3.55 years above
the third quartile. This is more than 3.225 years, 1.5 * IQR. Thus,
individual #25 is a suspected outlier.
Measure of spread: the standard deviation
The standard deviation “s” is used to describe the variation around the
mean. Like the mean, it is not resistant to skew or outliers.
1. First calculate the variance s2.
n
1
2
s2 
(x

x
)

n 1 1 i
x

Mean
±1
s.d.
2. Then take the square root to get
the standard deviation s.
1 n
2
s
(
x

x
)
 i
n 1 1
Calculations …
1
s
df
n
 (x
i
 x)
Women’s height (inches)
2
1
Mean = 63.4
Sum of squared deviations from mean = 85.2
Degrees freedom (df) = (n − 1) = 13
s2 = variance = 85.2/13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
We’ll never calculate these by hand, so make sure to know how to
get the standard deviation using your calculator & JMP software.
Variance and Standard Deviation

Why do we square the deviations?



Why do we emphasize the standard deviation rather than the
variance?



The sum of the squared deviations of any set of observations from their
mean is the smallest that the sum of squared deviations from any
number can possibly be.
The sum of the deviations of any set of observations from their mean is
always zero.
s, not s2, is the natural measure of spread for Normal distributions.
s has the same unit of measurement as the original observations.
Why do we average by dividing by n − 1 rather than n in calculating
the variance?


The sum of the deviations is always zero, so only n − 1 of the squared
deviations can vary freely.
The number n − 1 is called the degrees of freedom.
Properties of Standard Deviation

s measures spread about the mean and should be used only when
the mean is the measure of center.

s = 0 only when all observations have the same value and there is
no spread. Otherwise, s > 0.

s is not resistant to outliers.

s has the same units of measurement as the original observations.
Choosing among summary statistics

Because the mean is not
Height of 30 Women
resistant to outliers or skew, use
69
it to describe distributions that are
68
fairly symmetrical and don’t have
 Plot the mean and use the
standard deviation for error bars.

Otherwise use the median in the
five number summary which can
be plotted as a boxplot.
Height in Inches
outliers.
67
66
65
64
63
62
61
60
59
58
Box Plot
Boxplot
Mean ±
+/- SD
Mean
SD
What should you use, when, and why?
Arithmetic mean or median?

Middletown is considering imposing an income tax on citizens. City hall
wants a numerical summary of its citizens’ income to estimate the total tax
base.


Mean: Although income is likely to be right-skewed, the city government
wants to know about the total tax base.
In a study of standard of living of typical families in Middletown, a sociologist
makes a numerical summary of family income in that city.

Median: The sociologist is interested in a “typical” family and wants to
lessen the impact of extreme incomes.
Homework:
• Read
section 1.2, pay careful attention to the
examples, especially 1.24 (quartiles) and 1.29 (sd).
• Be sure to look over the Summary of each section.
Notice the words in bold print - these are the important
terms that you should be familiar with as we go forward.
• Do # 1.51, 1.52, 1.54, 1.56, 1.57, 1.62-1.68, 1.74,
1.75, 1.78 (use JMP), 1.83-1.87, 1.89
• Use JMP in as many of the above problems as you
can… Practice!