Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan Frequency Distributions and Scales Nominal Ordinal Numerical Numerical/Continuous f Distrib. Groupedf Distrib. Apparent/Real Limits Relativef F (Accumulativef) Percentile, Per. Ranks Empirical Methods in Computer Science © 2006-now Gal Kaminka 2 Characteristics of Distributions Shape, Central Tendency, Variability Different Central Tendency Empirical Methods in Computer Science Different Variability © 2006-now Gal Kaminka 3 This Lesson Examine measures of central tendency Examine measures of variability (dispersion) Mode (Nominal) Median (Ordinal) Mean (Numerical) Entropy (Nominal) Variance (Numerical), Standard Deviation Standard scores (z-score) Empirical Methods in Computer Science © 2006-now Gal Kaminka 4 Centrality/Variability Measures and Scales Nominal Ordinal Numerical Numerical/Continuous Mode Median Mean Entropy Variance, Std. Deviation z-Scores Empirical Methods in Computer Science © 2006-now Gal Kaminka 5 The Mode (Mo) השכיח The mode of a variable is the value that is most frequent Mo = argmax f(x) For categorical variable: The category that appeared most For grouped data: The midpoint of the most frequent interval Under the assumption that values are evenly distributed in the interval Empirical Methods in Computer Science © 2006-now Gal Kaminka 6 Finding the Mode: Example 1 The collection of values that a variable X took during the measurement Student Grade X1 60 X2 43 X3 57 X4 82 X5 75 X6 32 X7 82 X8 60 Empirical Methods in Computer Science System Windows Linux BSD MacOS Algorithm Name Round-Robin Round-Robin Prioritized Scheduling Preemptive Scheduling Trial Run-Time #1 23.234 #2 15.471 #3 12.220 #4 23.100 © 2006-now Gal Kaminka ? Depends on Grouping 7 Finding the Mode: Example 2 The mode of a grouped frequency distribution depends on grouping 86 Score 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 f 1 0 0 0 0 1 1 3 2 2 6 2 2 1 3 2 2 2 Score 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 Empirical Methods in Computer Science f 4 2 1 1 0 1 2 0 3 1 2 1 0 0 0 0 1 1 88 i=5 Score 96-100 91-95 86-90 81-85 76-80 71-75 66-70 61-65 N= f 1 1 14 10 11 4 7 2 50 © 2006-now Gal Kaminka 87 i=5 Score 95-99 90-94 85-89 80-84 75-79 70-74 65-69 60-64 N= f 1 2 15 10 10 6 4 2 50 8 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). Empirical Methods in Computer Science © 2006-now Gal Kaminka 9 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? Empirical Methods in Computer Science © 2006-now Gal Kaminka 10 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) Empirical Methods in Computer Science © 2006-now Gal Kaminka 11 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 between 7 and 8 + (¼ * 1.0) Empirical Methods in Computer Science © 2006-now Gal Kaminka 12 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) 1 of four 8's Empirical Methods in Computer Science © 2006-now Gal Kaminka 13 The Median (Mdn) החציון The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall Requires ordering: Only ordinal and the numerical scales Examples: 0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18). 5,7,8,8,8,8 ==> Mdn = ? One method: Halfway between first and second 8, Mdn = 8 Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) Width of interval containing 8's (real limits) Empirical Methods in Computer Science © 2006-now Gal Kaminka 14 The Arithmetic Mean ממוצע חשבוני Arithmetic mean (mean, for short) Average is colloquial: Not precisely defined when used, so we avoid the term. Empirical Methods in Computer Science © 2006-now Gal Kaminka 15 Properties of Central Tendency Measures Mo: Relatively unstable between samples Problematic in grouped distributions Can be more than one: Distributions that have more than one sometimes called multi-modal For uniform distributions, all values are possible modes Typically used only on nominal data Empirical Methods in Computer Science © 2006-now Gal Kaminka 16 Properties of Central Tendency Measures Mean: Responsive to exact value of each score Only interval and ratio scales Takes total of scores into account: Does not ignore any value Sum of deviations from mean is always zero: Because of this: sensitive to outliers Presence/absence of scores at extreme values Stable between samples, and basis for many other statistical measures Empirical Methods in Computer Science © 2006-now Gal Kaminka 17 Properties of Central Tendency Measures Median: Robust to extreme values Only cares about ordering, not magnitude of intervals Often used with skewed distributions Mo Mdn Mean Empirical Methods in Computer Science © 2006-now Gal Kaminka 18 Properties of Central Tendency Measures Contrasting Mode, Median, Mean Mo Empirical Methods in Computer Science Mdn Mean © 2006-now Gal Kaminka 19 Properties of Central Tendency Measures Contrasting Mode, Median, Mean Mo Mdn Mean Empirical Methods in Computer Science © 2006-now Gal Kaminka 20