Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan Frequency Distributions and Scales Nominal Ordinal Numerical Numerical/Continuous f Distrib. Groupedf Distrib. Apparent/Real Limits Relativef F (Accumulativef) Percentile, Per. Ranks Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2 Characteristics of Distributions Shape, Central Tendency, Variability Different Central Tendency Empirical Methods in Computer Science Different Variability © 2006-now Gal Kaminka / Ido Dagan 3 This Lesson Examine measures of central tendency Examine measures of variability (dispersion) Mode (Nominal) Median (Ordinal) Mean (Numerical) Entropy (Nominal) Variance (Numerical), Standard Deviation Standard scores (z-score) Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4 Centrality/Variability Measures and Scales Nominal Ordinal Numerical Numerical/Continuous Mode Median Mean Entropy Variance, Std. Deviation z-Scores Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5 The Mode (Mo) השכיח The mode of a variable is the value that is most frequent Mo = argmax f(x) For categorical variable: The category that appeared most For grouped data: The midpoint of the most frequent interval Under the assumption that values are evenly distributed in the interval Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6 Finding the Mode: Example 1 The collection of values that a variable X took during the measurement Student Grade X1 60 X2 43 X3 57 X4 82 X5 75 X6 32 X7 82 X8 60 Empirical Methods in Computer Science System Windows Linux BSD MacOS Algorithm Name Round-Robin Round-Robin Prioritized Scheduling Preemptive Scheduling Trial Run-Time #1 23.234 #2 15.471 #3 12.220 #4 23.100 ? Depends on Grouping © 2006-now Gal Kaminka / Ido Dagan 7 Finding the Mode: Example 2 The mode of a grouped frequency distribution depends on grouping 86 Score 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 f 1 0 0 0 0 1 1 3 2 2 6 2 2 1 3 2 2 2 Score 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 Empirical Methods in Computer Science f 4 2 1 1 0 1 2 0 3 1 2 1 0 0 0 0 1 1 88 i=5 Score 96-100 91-95 86-90 81-85 76-80 71-75 66-70 61-65 N= f 1 1 14 10 11 4 7 2 50 87 © 2006-now Gal Kaminka / Ido Dagan i=5 Score 95-99 90-94 85-89 80-84 75-79 70-74 65-69 60-64 N= f 1 2 15 10 10 6 4 2 50 8 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another (for real limits): Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 between 7 and 8 + (¼ * 1.0) Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) 1 of four 8's Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) Width of interval containing 8's (real limits) Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14 The Arithmetic Mean ממוצע חשבוני Arithmetic mean (mean, for short) Average is colloquial: Not precisely defined when used, so we avoid the term. Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15 Properties of Central Tendency Measures Mo: Relatively unstable between samples Problematic in grouped distributions Can be more than one: Distributions that have more than one sometimes called multi-modal For uniform distributions, all values are possible modes Typically used only on nominal data Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16 Properties of Central Tendency Measures Mean: Responsive to exact value of each score Only interval and ratio scales Takes total of scores into account: Does not ignore any value Sum of deviations from mean is always zero: Because of this: sensitive to outliers Presence/absence of scores at extreme values Stable between samples, and basis for many other statistical measures Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17 Properties of Central Tendency Measures Median: Robust to extreme values Only cares about ordering, not magnitude of intervals Often used with skewed distributions Mo Mdn Mean Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18 Properties of Central Tendency Measures Contrasting Mode, Median, Mean Mo Empirical Methods in Computer Science Mdn Mean © 2006-now Gal Kaminka / Ido Dagan 19 Properties of Central Tendency Measures Contrasting Mode, Median, Mean Mo Mdn Mean Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20 Dispersion and Variability Mode, Median, Mean: Only give central tendencies We need to measure the spread of the distribution Mo Mdn Mean Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21 Dispersion as Ranges Range: max(X) - min(X) Semi-Interquartile Range: Q= Q3 Q1 = P75 P25 2 2 Half the range where 50% of the scores are Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22 Dispersion as Deviation Look at dispersion as a function of the central tendency (mean) We know sum of deviations from mean is zero But what if we look at sum of absolute deviations? Xi X = 0 Smaller sum indicates more clustering of the distribution around the mean Empirical Methods in Computer Science Xi X © 2006-now Gal Kaminka / Ido Dagan 23 Variance Statisticians prefer a different way to use absolute values Sum of squares Shorthand for: Sum of squared deviations from the mean SS X = Xi X 2 And normalizing for the size of the sample SS X Xi X S =σ = = N N 2 2 2 This is called the variance of the distribution Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24 Standard Deviation (std.) Square root of variance 2 Xi X SS X σ= = N N Robust to sampling variation: Does not change very much with new samples of the population Perhaps the most common measure of dispersion Std is defined for population; standard-error for sample is a bit different We ignore this for now; return to this later Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25 Standard Scores Mean, median, etc. are robust to constant translations We may need to also compare distributions changing in range For instance, what's better: Adding V to each value is the same as adding V to the central tendency measures Score of 50, when mean is 60 Score of 60, when mean is 80 .... Can compute z-scores of the raw scores Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26 z Scores Key idea: Express all values in units of standard deviation XX z= σX This allows comparison of values from different distributions But only if shapes of distributions are similar Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27 Measuring Dispersion in Nominal Scales Entropy rX is rel f of the value X Entropy of 0 means that all values X are the same rel f = 1.0 for some value X Entropy grows positive when values become more dispersed H X = rX log rX Where e.g., Entropy of 1 means all scores split evenly between two values Entropy is maximal when rX = 1/N for all values X i.e., uniform distribution Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28 Normalizing Entropy Can normalize by dividing by maximal entropy given N. 1 1 1 1 1 maxH = log = n log = log n n n n n This allows comparing the entropy of distributions of different size Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29