Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MD 5108 Summarizing data: statistical indices (Modified from the notes of Prof. A. Kuk) Topic 4: Summary index (summary statistic) Using a single value to summarize some characteristic of a dataset. For example, the arithmetic mean (or average) is a summary statistic because it gives the average value of a dataset such as average blood pressure readings 4.1 Indices of Central Tendency (or location) (Arithmetic) Mean: average of a set of values Blood Pressure Readings n Xi Xi 95 X1 Arithmetic Mean X = i 1 98 X2 n 101 X3 87 X4 = 486 / 5 105 X5 = 97.2 mm Hg ---------------486 Sum Approx 4 million singleton births, 1991 : Variable Mean Mother’s age 26.4 years Gestational age 39.15 weeks Birth weight 3358.6 grams Weight gain* 30.4 lbs For those who die (31,417 of them) in the first year : Survival 49.4 days 26.4 years years 4.2 Robust Measure of Location Mean is very sensitive (not robust) to extreme values Blood Pressure Readings 87 95 98 101 105.0 Mean = 97.2 87 95 98 101 1050 Xi X1 X2 X3 X4 X5 Decimal overlooked, Mean = 286.2 Robust measure of location The median (the middle value of an ordered data set) is less sensitive (robust) to extreme values in the data Blood Pressure Readings 87 95 98 101 105 median value = 98 is unchanged 87 95 98 101 1050 Xi X1 X2 X3 X4 X5 Trimmed mean (e.g. 10% trimmed mean is the average after deleting 10% of the data at both ends) is also less affected by extreme values Intervals between failures of an air conditioner (in operating hours) 413, 14, 58, 37, 100, 65, 9, 169, 447, 184, 36, 201, 118, 34, 31, 18, 19, 67, 57, 62, 7, 22, 34, 90, 10 Mean = ? 8% trimmed mean = ? Median = ? Ordered values 7, 9, 10, 14, 18, 19, 22, 31, 34, 34, 36, 37, 57, 58, 62, 65, 67, 90, 100, 118, 169, 184, 201, 413, 447 Measures of location Sample size = 25 mean = 2302/25 = 92.1 hrs 8% of 25 = 2, leave out 2 obs at both ends 8% trimmed mean = 1426/21 = 67.9 hrs median = 13th ordered value = 57 < 67.9 <92.1 hrs Desirable properties of the median • Not sensitive to extreme values in data • More suitable for describing skewed distributions • • (e.g., median income vs average income) The relative positions of the data points are unchanged when log-transformed. As a result, the median of the log-transformed data is just the log of the median of the original data Not so for the mean, the mean of logX is not obtainable from the mean of X 87 < 95 < 98 < 101 < 105 Ln87 Ln95 Ln98 Ln101 Ln105 Med = 98 Med Ln98 4.585 Relative positions of median and mean for skewed distributions Positively-skewed or skewed to the right (where the longer tail is) Mean > Median Negatively-skewed or skewed to the left (where the longer tail is) Mean < Median Singleton births, 1991 : Variable Mean Median Mom’s age (yrs) 26.4 25 Gest. Age (wks) 39.2 39 Birth weight (gms) 3359 3374 Weight gain (lbs) 30.4 30 Survival (days) 49.4 7 Mean = 39.2 Median = 39 weeks Mean = 3359 Median = 3374 Extremely fluctuatory due to the use of narrow class interval Mean = 30.4 Median = 30 Mortality in the first year of baby's life (for those who die in their first year) Proportion 0.40 0.30 0.20 0.10 0.00 0 60 121 182 244 (survival days) 305 Mortality in the first year of baby's life (for those who die in their first year) Mean = 49.4 Median=7 Proportion 1.0000 0.1000 0 60 121 182 244 305 0.0100 0.0010 0.0001 (survival days) By M. Pagano When to use mean or median: Use both by all means. Mean performs best when we have a normal or symmetric distribution with thin tails. If skewed or when we want to limit the influence of outliers, use the median. 4.3 Indices of Dispersion / Spread Besides indices of central tendency (location), it is also useful to have indices that summarise the spread (or dispersion) of values in a dataset. These indices give a measure of variation. Indices of Dispersion or Spread Range: difference between the largest and the smallest value Problem: does not consider how values in between are scattered. In the following, for both sets of data, the numbers of observations, means, medians and ranges are all equal. Which one has more scatter? datasets with same range but different scatter of values 10, 12, 13, 14, 15, 16, 17, 18, 20 10, 15, 15, 15, 15, 15, 15, 15, 20 range Spread: Variable Singleton births, 1991 : Min Max Range Mom’s age 10 49 39 Gest. Age 17 47 30 227 8164 7937 Weight gain 0 98 98 Survival 0 363 363 Birth weight Indices of Dispersion A good index of dispersion should be one that summarises the dispersion of individual values from some central value like the mean mean X X X X X X Indices of Dispersion Problem with averaging deviations of individual values from the mean is that it is always 0 _ (Xi X ) 87 - 97.2 = -10.2 95 - 97.2 = -2.2 98 - 97.2 = 0.8 101-97.2 = 3.8 105-97.2 = 7.8 --0 where 97.2 is the mean of values 87, 95, 98, 101, 105 average of deviations of individual values from the mean Indices of Dispersion Usual approach: consider square deviations from the mean and take their average _ (Xi X ) 87 - 97.2 = -10.2 95 - 97.2 = -2.2 98 - 97.2 = 0.8 101-97.2 = 3.8 105-97.2 = 7.8 --0 _ ( X i X )2 104.04 4.84 0.64 14.44 60.84 ---------184.80 sum of squares of deviations from the mean Variance calculation from a sample: customary to divide by n-1 (default option in most software) rather than by n _ (Xi X ) _ 2 104.04 4.84 0.64 14.44 60.84 ---------184.80 2 ( X X ) i n 1 = 184.8 / 4 = 46.2 effective sample size - also called degrees of freedom Variance of a sample Can be shown mathematically: _ 2 ( X X ) i n 1 (X X ) 2 2 n 1 n Why subtract 1 ? • Results in a better estimator of the population • • variance Acknowledge the fact that the population mean is unknown and has to be estimated by the sample mean (effective sample size decreased by 1 for every parameter estimated) No need to subtract 1 if we calculate variance using deviations from the population mean Variance of a sample • Problem with variance is its awkward unit of measurement as values have been squared • Problem overcome by taking square root of variance - revert back to original unit of measurement Square root of the variance gives the standard deviation Sample Standard Deviation The Sample Standard Deviation (S or SD) (X X ) 2 2 n 1 n Singleton births, 1991 : Variable Mean Std dev Mom’s age (yrs) 26.4 5.84 Gest. Age (wks) 39.2 2.61 Birth weight (gms) 3359 227 Weight gain (lbs) 30.4 12.13 Survival (days) 49.4 76.1 Empirical Rule: If dealing with a unimodal and symmetric distribution, then Mean ± 1 sd covers approx 67% obs. Mean ± 2 sd covers approx 95% obs Mean ± 3 sd covers approx all obs Exact for normal distribution Mother’s age: mean = 26.4 yrs s.d. = 5.84 yrs Table of x ± k s.d.s k left limit right limit Emp. 1 By M. Pagano Mother’s age: mean = 26.4 yrs s.d. = 5.84 yrs Table of x ± k s.d.s k left limit right limit Emp. 1 20.56 By M. Pagano Mother’s age: mean = 26.4 yrs s.d. = 5.84 yrs Table of x ± k s.d.s k left limit right limit Emp. 1 20.56 32.24 By M. Pagano Mother’s age: mean = 26.4 yrs s.d. = 5.84 yrs Table of x ± k s.d.s k left limit right limit Emp. 1 20.56 32.24 67% By M. Pagano Mother’s age: mean = 26.4 yrs s.d. = 5.84 yrs Table of x ± k s.d.s k left limit right limit Emp. 1 20.56 32.24 67% 2 14.72 38.08 95% 3 8.88 43.92 all By M. Pagano Area = 0.6475 20.56 years 32.4 Area = 0.963 14.72 years 38.08 Mother’s age: mean = 26.4 yrs s.d. = 5.84 yrs Table of x ± k s.d.s k left limit right limit Emp. Actual 1 20.56 32.24 67% 64.75% 2 14.72 38.08 95% 3 8.88 43.92 all 96.3% 99.89% Chebychev’s Inequality Table of x ± k s.d.s Proportion is at least 1-1/k2 (true for any distribution.) Chebychev’s Inequality Table of x ± k s.d.s k 1/k2 1 1 2 0.25 3 0.11 Proportion is at least 1-1/k2 (true for any distribution.) By M. Pagano Chebychev’s Inequality Table of x ± k s.d.s k 1/k2 1-1/k2 1 1 0 2 0.25 0.75 3 0.11 0.89 Proportion is at least 1-1/k2 (true for any distribution.) By M. Pagano Chebychev’s Inequality Table of x ± k s.d.s k 1/k2 1-1/k2 Emp. Actual 1 1 0 2 0.25 0.75 95% 96.3% 3 0.11 0.89 all 99.89% 67% 64.75% Proportion is at least 1-1/k2 (true for any distribution.) When not to use the standard deviation? •Heavy tailed distribution •Presence of outliers •Skewed distribution •Comparing variables with vastly different magnitude or different units of measurements Weights of newborn elephants (kg) 929 878 895 937 801 853 939 972 841 826 Weights of newborn mice (kg) 0.72 0.63 0.59 0.79 1.06 0.42 0.31 0.38 0.96 0.89 n = 10 n = 10 _ _ X = 887.1 SD = 56.50 X = 0.68 SD = 0.255 Difference in magnitude of measurement has contributed to large difference in the SD Solution is to use Coefficient of Variation (CV) which is given by: SD _ X The CV expresses the standard deviation (s) relative to its mean. Also known as relative dispersion. As a ratio, it is unit-free. Indices of Dispersion Weights of newborn elephants (kg) 929 878 895 937 801 853 939 972 841 826 n = 10 _ X = 887.1 SD = 56.50 CV = 0.0637 Weights of newborn mice (kg) 0.72 0.63 0.59 0.79 1.06 0.42 0.31 0.38 0.96 0.89 n = 10 X = 0.68 SD = 0.255 CV = 0.375 _ newborn mice shows greater birthweight variation! When to use Coefficient of Variation (cv): • when means of comparison groups have large differences (CV suitable as it expresses the std dev relative to its corresponding mean) • when different units of measurements are involved i.e. group 1 units are mm and group 2 units are gm (CV suitable for comparison as they are unitfree, being a ratio) 4.4 Robust Measure of Dispersion • Variance is defined as the mean of the squared deviations and as such is even more nonrobust to extreme values than the mean (an extreme deviation becomes even more extreme after squaring) • A robust measure of dispersion is IQR/1.35 where IQR = 3rd quartile – 1st quartile = Inter-quartile range The reason for dividing IRQ by 1.35 is to make it compatible with the standard deviation when the underlying distribution is normal Intervals between failures of an air conditioner (in operating hours) 413, 14, 58, 37, 100, 65, 9, 169, 447, 184, 36, 201, 118, 34, 31, 18, 19, 67, 57, 62, 7, 22, 34, 90, 10 Mean = ? 8% trimmed mean = ? SD=? IQR/1.35 = ? Median = ? Ordered values 7, 9, 10, 14, 18, 19, 22, 31, 34, 34, 36, 37, 57, 58, 62, 65, 67, 90, 100, 118, 169, 184, 201, 413, 447 Measures of location Sample size = 25 mean = 2302/25 = 92.1 hrs 8% of 25 = 2, leave out 2 obs at both ends 8% trimmed mean = 1426/21 = 67.9 hrs median = 13th ordered value = 57 < 67.9 <92.1 hrs Measures of dispersion SD = 115.5 hrs 1st quartile = 7th ordered value = 22 hrs 3rd quartile = 19th ordered value = 100 hrs IQR/1.35 = 78/1.35 = 57.8 hrs 5-Number Summary of a data set • Min, • 1st quartile • Median • 3rd quartile, • Max Represent graphically by a box plot