Download LN2_book

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
2. Numerical descriptors
The Practice of Statistics in the Life Sciences
Third Edition
© 2014 W.H. Freeman and Company
Objectives (PSLS Chapter 2)
Describing distributions with numbers

Measure of center: mean and median

Measure of spread: quartiles and standard deviation

The five-number summary and boxplots

IQR and outliers

Dealing with outliers

Choosing among summary statistics

Organizing a statistical problem
Measure of center: the mean
The mean, or arithmetic average
To calculate the average (mean) of a data set, add all values, then
divide by the number of individuals. It is the “center of mass.”
x1  x 2  ....  xn
x
n
1 n
x   xi
n i 1
Measure of center: the median
The median is the midpoint of a distribution—the number such that
half of the observations are smaller, and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25
6.1
1) Sort observations from smallest to largest.
n = number of observations
2) The location of the median is (n + 1)/2 in
the sorted list
______________________________
If n is odd, the median
is the value of the
center observation
If n is even, the median
is the mean of the two
center observations
 n = 25
(n+1)/2 = 13
Median = 3.4
n = 24 
(n+1)/2 = 12.5
Median = (3.3+3.4)/2
= 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
Comparing the mean and the median
The median is a measure of center that is resistant to skew and
outliers. The mean is not.
Mean and median for a
symmetric distribution
Mean
Median
Mean and median for
skewed distributions
Left skew
Mean
Median
Mean
Median
Right skew
A study of freely forming groups in
bars all over Europe recorded the
group size (number of individuals in
the group) of all 501 groups in the
study that were naturally laughing.
The median laughter group size is
A) 2
B) 2.5
C) 3
D) 3.5
E) 4
The average laughter group size is
A) smaller than the median.
B) about the same as the median.
C) larger than the median.
Measure of spread: quartiles
The first quartile, Q1, is the median
of the values below the median in the
sorted data set.
The third quartile, Q3, is the median
of the values above the median in the
sorted data set.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
M = median = 3.4
Q3= third quartile = 4.35
How fast do skin wounds heal?
Here are the skin healing rate data from 18 newts measured
in micrometers per hour:
28 12 23 14 40 18 22 33 26 27 29 11 35 30 34 22 23 35
Sorted data:
11 12 14 18 22 22 23 23 26 27 28 29 30 33 34 35 35 40
Median = ???
Quartiles = ???
Measure of spread: standard deviation
The standard deviation is used to describe the variation around the mean.
To get the standard deviation of a SAMPLE of data:
1) Calculate the variance s2
1 n
2
s 
(
x

x
)
 i
n 1 1
2
2) Take the square root to get the standard deviation s
1 n
2
s
(
x

x
)

i
n 1 1
Learn how to obtain the standard deviation of a sample using technology.
A person’s metabolic rate is the rate at which the body consumes energy.
Find the mean and standard deviation for the metabolic rates of a sample of 7 men
(in kilocalories, Cal, per 24 hours).
x   x1 / n  1600
2
(
x

x
)
 214,870
 i
df  n  1  6
s 2  (1 df ) ( xi  x ) 2
 214,870 6  35,811.7
s  35,811.7  189.2
*
Center and spread in boxplots
6.1
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
max = 6.1
Boxplot
7
Q3= 4.35
median = 3.4
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
4
3
2
1
Q1= 2.2
0
Disease X
min = 0.6
“Five-number summary”
IQR and suspected outliers
The interquartile range (IQR) is the distance between the first and third
quartiles (the length of the box in the boxplot)
IQR = Q3 – Q1
An outlier is an individual value that falls outside the overall pattern.
How far outside the overall pattern does a value have to fall to be
considered a suspected outlier?

Suspected low outlier: any value < Q1 – 1.5 IQR

Suspected high outlier: any value > Q3 + 1.5 IQR
7.9
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
8
7
Q3 = 4.35
*
Distance to Q3
7.9-4.35 = 3.55
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
Interquartile range
Q3 – Q1
4.35-2.2 = 2.15
4
3
2
1
Q1 = 2.2
0
Disease X
Individual #25 has a survival of 7.9 years, which is 3.55 years
above the third quartile. This is more than 1.5  IQR = 3.225 years.
 Individual #25 is a suspected outlier.
Anonymous class survey:
weight (lbs) and height (in) were used to compute BMI.
Height
60
15
19
23
27
31
BMI
35
39
Weight Sex
230
Male
BMI
44.9
43
Unusual individual or typo?
height of 60 in is the shortest for men
weight of 230 lbs is almost the heaviest
Dealing with outliers
What should you do if you find outliers in your data? It depends in part
on what kind of outliers they are:

Human error in recording information

Human error in experimentation or data collection

Unexplainable but apparently legitimate wild observations
 Are you interested in ALL individuals?
 Are you interested only in typical individuals?
Don’t discard outliers just to make your data look better, and don’t
act as if they did not exist.
Choosing among summary statistics

Because the mean is not
resistant to outliers or skew, use
it to describe distributions that are
fairly symmetrical and don’t have
outliers.
 Plot the mean and use the
standard deviation for error bars.
Otherwise, use the median and
the five-number summary, which
can be plotted as a boxplot.
Height of 30 women
69
68
67
Height in inches

66
65
64
63
62
61
60
59
58
Box plot
Boxplot
Mean +/Mean
± sd
s.d.
Deep-sea sediments.
Phytopigment concentrations in deep-sea sediments
collected worldwide show a very strong right-skew.

Which of these two values is the mean and which is the median?
0.015 and 0.009 grams per square meter of bottom surface

Which would be a better summary statistic for these data?
Researchers grafted human cancerous cells onto 20 healthy adult mice. Then 10
of the mice were injected with tumor-specific antibodies (anti-CD47) while the
other 10 mice were not (IgG). Here is what a table of the raw data would look like.
Mouse
Treatment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
IgG
IgG
IgG
IgG
IgG
IgG
IgG
IgG
IgG
IgG
anti-CD47
anti-CD47
anti-CD47
anti-CD47
anti-CD47
anti-CD47
anti-CD47
anti-CD47
anti-CD47
anti-CD47
Presence of
metastatses
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
no
no
no
no
no
no
no
no
no
yes
Number of
metastases
1
1
2
2
2
3
3
3
3
4
0
0
0
0
0
0
0
0
0
1
What summary statistics would
you use for each of these two
variables?
Organizing a statistical problem
1. State:
What is the practical question, in the context of a real-world setting?
2. Plan:
What specific statistical operations does this problem call for?
3. Solve:
Make the graphs and carry out the calculations needed for this problem.
4. Conclude:
Give your practical conclusion in the real-world setting.