Download Simple Descriptive Statistics and Univariate Displays of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Dual Tragedies in the B-ham Paper
Module 2
Simple Descriptive Statistics and
Univariate Displays of Data
A Tale of Three Cities
George Howard, DrPH
A Tale of Three Cities
Background
• There were substantial differences in
cancer rates between regions of Alabama
– Birmingham 143/100,000
– Mobile 110/100,000
– Montgomery 94/100,000
• Could these differences be due to the
horrible air pollution largely caused by
highway 280 in Birmingham?
• The suspect agent is suspended
particulate matter
A Tale of Three Cities
Mobile (n=25)
139
160
126
168
140
142
Birmingham (n=15)
211
150
152
131
170
136
103
149
170
126
141
141
139
122
121
135
178
110
165
123
123
87
178
116
219
128
131
130
174
127
112
160
168
162
Collection of Data
• Sampled suspended
particulate matter
(ppm) in the three cities
on randomly selected
days.
• What are the patterns
here?
• What are the differences
between these cities?
• Describe the variables in
this analysis
Montgomery (n=28)
113
155
100
94
146
111
145
92
173
100
105
110
106
114
136
151
98
94
118
137
123
159
96
128
127
120
80
230
Types of Statistical Tests and
Approaches
Type of Independent Data
Type of Dependent Data
One
Sample
(focus
usually on
estimation)
Categorical
Continuous
Two Samples
Multiple Samples
Independent
Matched
Independent
3
4
McNemar Chi Square
Test
Test
Repeated
Measures
Single
Multiple
5
Generalized
Estimating
Equations
(GEE)
6
Logistic
Regression
7
Logistic
Regression
Categorical (dichotomous)
1
Estimate
proportion
(and
confidence
limits)
2
Chi-Square
Test
Continuous
8
Estimate
mean (and
confidence
limit)
9
10
Independent t- Paired ttest
test
11
Analysis of
Variance
12
Multivariate
Analysis of
Variance
13
14
Simple linear Multiple
regression & Regression
correlation
coefficient
Right Censored (survival)
15
Kaplan
Meier
Survival
16
Kaplan Meier
Survival for
both curves,
with tests of
difference by
Wilcoxon or
log-rank test
18
Kaplan-Meier
Survival for
each group,
with tests by
generalized
Wilcoxon or
Generalized
Log Rank
19
Very
unusual
20
Proportional
Hazards
analysis
17
Very
unusual
21
Proportional
Hazards
analysis
Consider the Birmingham Data
• Place the data in equally spaced categories
Interval
82.5<X<97.5
97.5<X<112.5
112.5<X<127.5
127.5<X<142.5
142.5<X<157.5
Mid
90
105
120
135
150
#
1
1
5
6
2
%
6.7
6.7
33.3
40.0
13.3
Birmingham (n=15)
150
131
136
149
126
141
122
135
110
123
87
116
128
130
127
• Clustering of points around 112-142
categories, with fewer points on either side
A Tale of Three Cities
Description of Birmingham SPM
Frequency
Birmingham
7
6
5
4
3
2
1
0
90
105
120
SPM (ppm)
135
150
A Tale of Three Cities
Description of Birmingham SPM
• How do you choose how many intervals
to have in a histogram?
– Rule of thumb: 3+ observations per category
• Remember where you make the cutpoints
is also an arbitrary decision --- that
changes how the histogram looks
Birmingham
7
6
5
4
3
2
1
0
6
5
Frequency
Frequency
Birmingham
4
3
2
1
0
90
105
120
SPM (ppm)
135
150
90
100
110
120
130
SPM (ppm)
140
150
A Tale of Three Cities
Comparison of the three cities
(what’s wrong with this picture?)
Mobile
12
7
6
5
4
3
2
1
0
Frequency
10
8
6
4
2
0
90
105
120
135
113
150
138
163
SPM (ppm)
SPM (ppm)
Montgomery
Frequency
Frequency
Birmingham
16
14
12
10
8
6
4
2
0
75
105
135
165
SPM (ppm)
195
225
188
213
A Tale of Three Cities
Comparison of the three cities
(now drawn on same scales)
Mobile
% of Days
40
35
30
25
20
15
10
5
0
80
40
35
30
25
20
15
10
5
0
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
80
SPM (ppm)
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
SPM (ppm)
Montgomery
% of Days
% of Days
Birmingham
40
35
30
25
20
15
10
5
0
80
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
SPM (ppm)
How do we describe these cities
with a few simple numbers?
• Where is the middle of the data (that is
an “average” value)?
• How spread out are the numbers?
• Are there other measures that may be
important to describe these data?
Gee, what do we mean by
“average” anyway
• Measures of “central tendency”
• There are MANY ways to calculate an
average
• Two most common ways
– The arithmetic mean
– The median
• There are other approaches
The Arithmetic Mean
• Step 1: Add up the numbers
• Step 2: Divide the sum by the
number of observations
X
X
i
n
i
150  131  136 ... 127 1911


 127.4
15
15
Birmingham (n=15)
150
131
136
149
126
141
122
135
110
123
87
116
128
130
127
The Median
• The point where half the data are bigger
(and half less)
• There are at least 4 rules to find the
median (and other percentiles)
• The rules differ if there are an odd or
even number of data points
– If odd, then the “middle” data point
– If even, then the average of the “two middle”
data points
The Median
(continued)
• Step 1: Sort the data
• Step 2: Pick the median
• Consider Birmingham
data (note that there
are an odd number of
data points)
• Median is 128
Birmingham (n=15)
87
110
116
122
123
126
127
8th of 15 data points==> 128
130
131
135
136
141
149
150
The Median
(continued)
• Suppose we only had 14
data points in
Birmingham
• Step 1: Find the middle
two data points
• Step 2: Take the average
difference between these
two observations
• Median = 127.5
Birmingham (n=now with 14 points)
87
110
116
122
123
126
7th of 14 data points==> 127
8th of 14 data points==> 128
130
131
135
136
141
149
A Tale of Three Cities
Measures of Central Tendency
Mobile
% of Days
40
35
30
25
20
15
10
5
0
80
Mean = 154.0
Median = 154
40
35
30
25
20
15
10
5
0
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
80
SPM (ppm)
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
SPM (ppm)
Montgomery
% of Days
% of Days
Birmingham
Mean = 127.4
Median = 128
40
35
30
25
Mean = 123.6
Median = 116
20
15
10
5
0
80
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
SPM (ppm)
Measures of Central Tendency
• Birmingham and Montgomery have lower
measures of central tendency than Mobile
• For Birmingham and Mobile, the mean and
median are almost the same value
– This happens when distributions are symmetric
• For Montgomery, the mean is quite a bit
higher than the median
– The mean is “pulled up” by outliers
– The median is not sensitive to outliers
How “spread out” are the
measures
• Measures of “dispersion”
• The range is the most simple measure
– Birmingham: 150 - 87 = 63
– Mobile: 219 - 103 = 116
– Montgomery: 230 - 80 = 150
• It appears that data from Montgomery are
very spread out, Mobile is not as spread
out, and Birmingham is very “compact”
• Range is influenced by the outliers
How “spread out” are the
measures (continued)
• The range is influenced by outliers (just
like the mean) --– But the median is not influenced by the
outliers
– Is there some measure of dispersion that will
not be so affected by 1 (or 2) points
Measures of Dispersion
Percentiles
• The kth percentile is that place in the data
where k-% of the data are below the
cutpoint
• There are many alternative approaches
to define percentiles
• In one approach, they are determined by
the function k*(n+1)
– If integer, then pick that data point
– If non-integer, then average the two data
points around that point
Measures of Dispersion
Percentiles (continued)
• For example, consider the 25%-tile from
Birmingham
– Step 1: calculate k*(n+1) = 0.25*(15+1) = 4
– Step 2: since this is integer, then pick the 4th
data point
– 25%-tile is 122
• Consider the 33%tile from Birmingham
Birmingham (n=15)
87
110
116
122
123
126
127
128
130
131
135
136
141
149
– Step 1: calculate k*(n+1) = 0.33*(15+1) = 5.3
– Step 2: average the 5th and 6th data points
– 33%-tile is 1/2 way between 123 and 126 or 124.5
Percentiles from the 3 Cities
Birmingham Mobile Montgomery
th
110
121
94
th
122
139
100
th
128
160
116
th
136
170
141
th
150
178
159
10
25
50
75
90
Measures of Dispersion
Percentiles (continued)
• Special names for percentiles
–
–
–
–
The 50th percentile is called the median
The 25th, 50th and 75th percentiles are called the quartiles
the 33rd and 67th percentiles are called the tertiles
the 10th, 20th, … and 90th are called the deciles
• The percentile rule picks the 8th data point for the
median (0.5*(15+1) = 8), so we get the “right answer”
• Is there a way to use these percentiles as a simple
measure of dispersion?
Percentiles from the 3 Cities
Birmingham
Mobile
Montgomery
10
th
110
121
94
25
th
122
139
100
50
th
128
160
116
75
th
136
170
141
90
th
150
178
159
Interquartile
Range
136 – 122
= 14
170 – 139
= 31
141 – 100
= 41
Interdecile
range
150 – 110
= 40
178 – 121
= 57
159 – 94
= 65
Percentiles from the 3 Cities
• Percentiles are relatively insensitive to
“outliers”
• How do we define outliers
– Rule of thumb --- If a data point is an “outlier”
• Above 1.5 interquartile ranges over the 75th percentile
• Below 1.5 interquartile ranges under the 25th percentile
– Consider Montgomery data
•
•
•
•
Interquartile range is 41
75th percentile is 141
Outliers are above 141+1.5*41=202.5
The value at 230 is an “outlier”
Percentiles from the 3 Cities
• So, percentiles are “neat”
– But with even 3 cities we have to think about
21 or more numbers
• 10th, 25th, 50th, 75th, 90th, percentiles
• interquartile range, interdecile range
• Isn’t there some way to look at these
graphically and to see the outliers
• Box and whisker plots
Percentiles from the 3 Cities
Box and Whisker Plots
• Draw box
– Top of box is the 75th-ptile (136)
– Bottom of box is 25th- ptile (122)
– Line is 50th ptile (median=128)
• Find outliers
– Below 122-1.5*14=101
– Above 136+1.5*14= 157
– Plot outlier(s) as a point (87)
• Draw “whiskers” to the the highest
non-outlier (149) and lowest nonoutlier (110) points
• Plot outliers as single data points
Birmingham (n=15)
87
110
116
122
123
126
127
128
130
131
135
136
141
149
160
150
140
130
120
110
100
90
11
80
N=
15
SPM
Percentiles from the 3 Cities
Box and Whisker Plots
• Box and Whisker plots
make for easy
comparison of groups
68
34
200
100
11
SPM
– B-ham doesn’t have
much spread
– Mobile is considerably
above B-ham or
Montgomery
– B-ham and Mobile are
fairly symmetric
300
0
N=
City
15
25
28
Birmingham
Mobile
Montgomery
Measures of Dispersion
Standard Deviation (and Variance)
• So far we have two measures of dispersion
– Range
– Percentiles (and differences between percentiles)
• Is there another single number that
summarizes how spread out the data are?
• Consider measures of how far the data are
from the mean
– If data are far from the mean, then they are
really spread out
– This is the idea for the Standard Deviation
Measures of Dispersion
Standard Deviation (and Variance)
• Idea #1 (a logical but dumb one)
– Calculate the average distance each data
point is from the mean (absolute value)
– Take the average of these numbers
– Mean absolute deviation
MAD 
 |X  X |
i
i
n
|127.4  87| |127.4  110| ...|127.4  149| 40.4  14.4 ... 216
.
1616
.



 10.8
15
15
15
Measures of Dispersion
Standard Deviation (and Variance)
• Idea #2 (a great one --- although it seems
illogical)
• Take the square root of the sum of the
squared deviations divided by the n-1
SD 
 (X  X )
i
i
n 1
2

(127.4  87) 2  (127.4  110) 2  ...(127.4  149) 2

15  1
1632  303 ... 511

14
• The variance is the standard deviation
squared (15.6)2=245.0
3430
 15.6
14
A Tale of Three Cities
Descriptive Statistics
Mobile
% of Days
40
35
30
25
20
15
10
5
0
80
40
35
30
25
20
15
10
5
0
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
80
SPM (ppm)
Mean = 127.4
Median = 128
Range = 63
IQR = 14
SD = 15.6
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
SPM (ppm)
Montgomery
% of Days
% of Days
Birmingham
Mean = 154.0
Median = 154
Range = 116
IQR = 31
SD = 28.0
40
35
30
25
20
15
10
5
0
80
90 100 110 120 130 140 150 160 170 180 190 200 210 220 230
SPM (ppm)
Mean = 123.6
Median = 116
Range = 150
IQR = 41
SD = 31.3
Summary: Descriptive
Statistics and Simple Graphs
• What we have talked about
– Histogram
– Measures of Central Tendency
• Mean
• Median
– Measures of Dispersion
• Range
• Percentiles
– Interquartile range
– Interdecile range
• Standard deviation
– Box and Whisker plots
Summary: Descriptive
Statistics and Simple Graphs
• What we have not talked
about
– Simple descriptive statistics
to describe skew
– Simple descriptive statistics
to describe kurtosis
• There are many other kinds
of graphs not discussed
10
8
6
4
2
Std. Dev = 91.44
Mean = 112.4
N = 50.00
0
0.0
50.0
25.0
NEW
100.0
75.0
150.0 200.0 250.0
125.0
175.0 225.0
300.0 350.0
400.0
275.0 325.0 375.0
Summary: Descriptive
Statistics and Simple Graphs
• Don’t be fooled by simple looks at the data
• Consider two populations
– Box plots ----->
– Descriptive Stats
Mean
SD
25th-ptile
Median
75-ptile
10.0
5.8
4.3
10.5
15.3
20
9.9
5.5
5.1
9.8
15.0
10
VAR00002
•
•
•
•
•
30
0
-10
N=
40
40
1.00
2.00
VAR00001
• These two groups sure look alike!!!
But --Here are the two distributions
VAR00001:
1.00
VAR00001:
8
2.00
7
6
6
5
4
4
3
2
2
Std. Dev = 5.83
Std. Dev = 5.49
1
Mean = 10.0
N = 40.00
0
0.0
2.0
VAR00002
4.0
6.0
8.0
10.0 12.0 14.0 16.0 18.0 20.0
Mean = 9.9
N = 40.00
0
0.0
2.0
VAR00002
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
A Tale of 3 Cities
Conclusions
• B-ham appeared to have consistently
lower levels of SPM than either Mobile or
Montgomery
– Lower measures of central tendency
– Less dispersion
• It would seem hard to argue that high
levels of SPM is the cause of the higher
cancer rates
Dual Tragedies in the B-ham Paper