Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dual Tragedies in the B-ham Paper Module 2 Simple Descriptive Statistics and Univariate Displays of Data A Tale of Three Cities George Howard, DrPH A Tale of Three Cities Background • There were substantial differences in cancer rates between regions of Alabama – Birmingham 143/100,000 – Mobile 110/100,000 – Montgomery 94/100,000 • Could these differences be due to the horrible air pollution largely caused by highway 280 in Birmingham? • The suspect agent is suspended particulate matter A Tale of Three Cities Mobile (n=25) 139 160 126 168 140 142 Birmingham (n=15) 211 150 152 131 170 136 103 149 170 126 141 141 139 122 121 135 178 110 165 123 123 87 178 116 219 128 131 130 174 127 112 160 168 162 Collection of Data • Sampled suspended particulate matter (ppm) in the three cities on randomly selected days. • What are the patterns here? • What are the differences between these cities? • Describe the variables in this analysis Montgomery (n=28) 113 155 100 94 146 111 145 92 173 100 105 110 106 114 136 151 98 94 118 137 123 159 96 128 127 120 80 230 Types of Statistical Tests and Approaches Type of Independent Data Type of Dependent Data One Sample (focus usually on estimation) Categorical Continuous Two Samples Multiple Samples Independent Matched Independent 3 4 McNemar Chi Square Test Test Repeated Measures Single Multiple 5 Generalized Estimating Equations (GEE) 6 Logistic Regression 7 Logistic Regression Categorical (dichotomous) 1 Estimate proportion (and confidence limits) 2 Chi-Square Test Continuous 8 Estimate mean (and confidence limit) 9 10 Independent t- Paired ttest test 11 Analysis of Variance 12 Multivariate Analysis of Variance 13 14 Simple linear Multiple regression & Regression correlation coefficient Right Censored (survival) 15 Kaplan Meier Survival 16 Kaplan Meier Survival for both curves, with tests of difference by Wilcoxon or log-rank test 18 Kaplan-Meier Survival for each group, with tests by generalized Wilcoxon or Generalized Log Rank 19 Very unusual 20 Proportional Hazards analysis 17 Very unusual 21 Proportional Hazards analysis Consider the Birmingham Data • Place the data in equally spaced categories Interval 82.5<X<97.5 97.5<X<112.5 112.5<X<127.5 127.5<X<142.5 142.5<X<157.5 Mid 90 105 120 135 150 # 1 1 5 6 2 % 6.7 6.7 33.3 40.0 13.3 Birmingham (n=15) 150 131 136 149 126 141 122 135 110 123 87 116 128 130 127 • Clustering of points around 112-142 categories, with fewer points on either side A Tale of Three Cities Description of Birmingham SPM Frequency Birmingham 7 6 5 4 3 2 1 0 90 105 120 SPM (ppm) 135 150 A Tale of Three Cities Description of Birmingham SPM • How do you choose how many intervals to have in a histogram? – Rule of thumb: 3+ observations per category • Remember where you make the cutpoints is also an arbitrary decision --- that changes how the histogram looks Birmingham 7 6 5 4 3 2 1 0 6 5 Frequency Frequency Birmingham 4 3 2 1 0 90 105 120 SPM (ppm) 135 150 90 100 110 120 130 SPM (ppm) 140 150 A Tale of Three Cities Comparison of the three cities (what’s wrong with this picture?) Mobile 12 7 6 5 4 3 2 1 0 Frequency 10 8 6 4 2 0 90 105 120 135 113 150 138 163 SPM (ppm) SPM (ppm) Montgomery Frequency Frequency Birmingham 16 14 12 10 8 6 4 2 0 75 105 135 165 SPM (ppm) 195 225 188 213 A Tale of Three Cities Comparison of the three cities (now drawn on same scales) Mobile % of Days 40 35 30 25 20 15 10 5 0 80 40 35 30 25 20 15 10 5 0 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 80 SPM (ppm) 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 SPM (ppm) Montgomery % of Days % of Days Birmingham 40 35 30 25 20 15 10 5 0 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 SPM (ppm) How do we describe these cities with a few simple numbers? • Where is the middle of the data (that is an “average” value)? • How spread out are the numbers? • Are there other measures that may be important to describe these data? Gee, what do we mean by “average” anyway • Measures of “central tendency” • There are MANY ways to calculate an average • Two most common ways – The arithmetic mean – The median • There are other approaches The Arithmetic Mean • Step 1: Add up the numbers • Step 2: Divide the sum by the number of observations X X i n i 150 131 136 ... 127 1911 127.4 15 15 Birmingham (n=15) 150 131 136 149 126 141 122 135 110 123 87 116 128 130 127 The Median • The point where half the data are bigger (and half less) • There are at least 4 rules to find the median (and other percentiles) • The rules differ if there are an odd or even number of data points – If odd, then the “middle” data point – If even, then the average of the “two middle” data points The Median (continued) • Step 1: Sort the data • Step 2: Pick the median • Consider Birmingham data (note that there are an odd number of data points) • Median is 128 Birmingham (n=15) 87 110 116 122 123 126 127 8th of 15 data points==> 128 130 131 135 136 141 149 150 The Median (continued) • Suppose we only had 14 data points in Birmingham • Step 1: Find the middle two data points • Step 2: Take the average difference between these two observations • Median = 127.5 Birmingham (n=now with 14 points) 87 110 116 122 123 126 7th of 14 data points==> 127 8th of 14 data points==> 128 130 131 135 136 141 149 A Tale of Three Cities Measures of Central Tendency Mobile % of Days 40 35 30 25 20 15 10 5 0 80 Mean = 154.0 Median = 154 40 35 30 25 20 15 10 5 0 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 80 SPM (ppm) 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 SPM (ppm) Montgomery % of Days % of Days Birmingham Mean = 127.4 Median = 128 40 35 30 25 Mean = 123.6 Median = 116 20 15 10 5 0 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 SPM (ppm) Measures of Central Tendency • Birmingham and Montgomery have lower measures of central tendency than Mobile • For Birmingham and Mobile, the mean and median are almost the same value – This happens when distributions are symmetric • For Montgomery, the mean is quite a bit higher than the median – The mean is “pulled up” by outliers – The median is not sensitive to outliers How “spread out” are the measures • Measures of “dispersion” • The range is the most simple measure – Birmingham: 150 - 87 = 63 – Mobile: 219 - 103 = 116 – Montgomery: 230 - 80 = 150 • It appears that data from Montgomery are very spread out, Mobile is not as spread out, and Birmingham is very “compact” • Range is influenced by the outliers How “spread out” are the measures (continued) • The range is influenced by outliers (just like the mean) --– But the median is not influenced by the outliers – Is there some measure of dispersion that will not be so affected by 1 (or 2) points Measures of Dispersion Percentiles • The kth percentile is that place in the data where k-% of the data are below the cutpoint • There are many alternative approaches to define percentiles • In one approach, they are determined by the function k*(n+1) – If integer, then pick that data point – If non-integer, then average the two data points around that point Measures of Dispersion Percentiles (continued) • For example, consider the 25%-tile from Birmingham – Step 1: calculate k*(n+1) = 0.25*(15+1) = 4 – Step 2: since this is integer, then pick the 4th data point – 25%-tile is 122 • Consider the 33%tile from Birmingham Birmingham (n=15) 87 110 116 122 123 126 127 128 130 131 135 136 141 149 – Step 1: calculate k*(n+1) = 0.33*(15+1) = 5.3 – Step 2: average the 5th and 6th data points – 33%-tile is 1/2 way between 123 and 126 or 124.5 Percentiles from the 3 Cities Birmingham Mobile Montgomery th 110 121 94 th 122 139 100 th 128 160 116 th 136 170 141 th 150 178 159 10 25 50 75 90 Measures of Dispersion Percentiles (continued) • Special names for percentiles – – – – The 50th percentile is called the median The 25th, 50th and 75th percentiles are called the quartiles the 33rd and 67th percentiles are called the tertiles the 10th, 20th, … and 90th are called the deciles • The percentile rule picks the 8th data point for the median (0.5*(15+1) = 8), so we get the “right answer” • Is there a way to use these percentiles as a simple measure of dispersion? Percentiles from the 3 Cities Birmingham Mobile Montgomery 10 th 110 121 94 25 th 122 139 100 50 th 128 160 116 75 th 136 170 141 90 th 150 178 159 Interquartile Range 136 – 122 = 14 170 – 139 = 31 141 – 100 = 41 Interdecile range 150 – 110 = 40 178 – 121 = 57 159 – 94 = 65 Percentiles from the 3 Cities • Percentiles are relatively insensitive to “outliers” • How do we define outliers – Rule of thumb --- If a data point is an “outlier” • Above 1.5 interquartile ranges over the 75th percentile • Below 1.5 interquartile ranges under the 25th percentile – Consider Montgomery data • • • • Interquartile range is 41 75th percentile is 141 Outliers are above 141+1.5*41=202.5 The value at 230 is an “outlier” Percentiles from the 3 Cities • So, percentiles are “neat” – But with even 3 cities we have to think about 21 or more numbers • 10th, 25th, 50th, 75th, 90th, percentiles • interquartile range, interdecile range • Isn’t there some way to look at these graphically and to see the outliers • Box and whisker plots Percentiles from the 3 Cities Box and Whisker Plots • Draw box – Top of box is the 75th-ptile (136) – Bottom of box is 25th- ptile (122) – Line is 50th ptile (median=128) • Find outliers – Below 122-1.5*14=101 – Above 136+1.5*14= 157 – Plot outlier(s) as a point (87) • Draw “whiskers” to the the highest non-outlier (149) and lowest nonoutlier (110) points • Plot outliers as single data points Birmingham (n=15) 87 110 116 122 123 126 127 128 130 131 135 136 141 149 160 150 140 130 120 110 100 90 11 80 N= 15 SPM Percentiles from the 3 Cities Box and Whisker Plots • Box and Whisker plots make for easy comparison of groups 68 34 200 100 11 SPM – B-ham doesn’t have much spread – Mobile is considerably above B-ham or Montgomery – B-ham and Mobile are fairly symmetric 300 0 N= City 15 25 28 Birmingham Mobile Montgomery Measures of Dispersion Standard Deviation (and Variance) • So far we have two measures of dispersion – Range – Percentiles (and differences between percentiles) • Is there another single number that summarizes how spread out the data are? • Consider measures of how far the data are from the mean – If data are far from the mean, then they are really spread out – This is the idea for the Standard Deviation Measures of Dispersion Standard Deviation (and Variance) • Idea #1 (a logical but dumb one) – Calculate the average distance each data point is from the mean (absolute value) – Take the average of these numbers – Mean absolute deviation MAD |X X | i i n |127.4 87| |127.4 110| ...|127.4 149| 40.4 14.4 ... 216 . 1616 . 10.8 15 15 15 Measures of Dispersion Standard Deviation (and Variance) • Idea #2 (a great one --- although it seems illogical) • Take the square root of the sum of the squared deviations divided by the n-1 SD (X X ) i i n 1 2 (127.4 87) 2 (127.4 110) 2 ...(127.4 149) 2 15 1 1632 303 ... 511 14 • The variance is the standard deviation squared (15.6)2=245.0 3430 15.6 14 A Tale of Three Cities Descriptive Statistics Mobile % of Days 40 35 30 25 20 15 10 5 0 80 40 35 30 25 20 15 10 5 0 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 80 SPM (ppm) Mean = 127.4 Median = 128 Range = 63 IQR = 14 SD = 15.6 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 SPM (ppm) Montgomery % of Days % of Days Birmingham Mean = 154.0 Median = 154 Range = 116 IQR = 31 SD = 28.0 40 35 30 25 20 15 10 5 0 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 SPM (ppm) Mean = 123.6 Median = 116 Range = 150 IQR = 41 SD = 31.3 Summary: Descriptive Statistics and Simple Graphs • What we have talked about – Histogram – Measures of Central Tendency • Mean • Median – Measures of Dispersion • Range • Percentiles – Interquartile range – Interdecile range • Standard deviation – Box and Whisker plots Summary: Descriptive Statistics and Simple Graphs • What we have not talked about – Simple descriptive statistics to describe skew – Simple descriptive statistics to describe kurtosis • There are many other kinds of graphs not discussed 10 8 6 4 2 Std. Dev = 91.44 Mean = 112.4 N = 50.00 0 0.0 50.0 25.0 NEW 100.0 75.0 150.0 200.0 250.0 125.0 175.0 225.0 300.0 350.0 400.0 275.0 325.0 375.0 Summary: Descriptive Statistics and Simple Graphs • Don’t be fooled by simple looks at the data • Consider two populations – Box plots -----> – Descriptive Stats Mean SD 25th-ptile Median 75-ptile 10.0 5.8 4.3 10.5 15.3 20 9.9 5.5 5.1 9.8 15.0 10 VAR00002 • • • • • 30 0 -10 N= 40 40 1.00 2.00 VAR00001 • These two groups sure look alike!!! But --Here are the two distributions VAR00001: 1.00 VAR00001: 8 2.00 7 6 6 5 4 4 3 2 2 Std. Dev = 5.83 Std. Dev = 5.49 1 Mean = 10.0 N = 40.00 0 0.0 2.0 VAR00002 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 20.0 Mean = 9.9 N = 40.00 0 0.0 2.0 VAR00002 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0 A Tale of 3 Cities Conclusions • B-ham appeared to have consistently lower levels of SPM than either Mobile or Montgomery – Lower measures of central tendency – Less dispersion • It would seem hard to argue that high levels of SPM is the cause of the higher cancer rates Dual Tragedies in the B-ham Paper