Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Descriptive Statistics I: Central Tendency 9/1 Attendance Question How many courses are you enrolled in this semester? A–1 B–2 C–3 D–4 E – 5 (or more) Units Frequency (students) Histograms # observations for this value 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 Number of Courses Variable Label Values Continuous vs. Discrete Variables • Discrete variable – Can only take certain values (usually integers) – Counts: people, classes, stories, … • Continuous variable – Infinite set of values, in principle – Height, weight, temp, IQ, … 5959 5858 <60 6363 6262 6161 6060 60-64 65-69 Height (Inches) Height (Inches) 70 6969 6868 6767 6666 6565 6464 72 72 7171 70+ 74 74 7373 77.24 76.06 75.27 74.38 73.56 71.98 71.49 70.98 70.26 68.90 67.84 67.27 66.62 65.87 64.78 63.23 61.54 60.34 59.43 57.65 56.34 54.69 53.93 51.65 50.01 49.26 19 8 7 6 65 054 4 3 3 3 2 22 1 11 0 00 70 FrequencyFrequency (students)(students) (students) Frequency Histograms of Continuous Variables • Bins or intervals 2 – Ranges for grouping continuous variables – Width depends on number of data 10 Height (Inches) Central Tendency • Descriptive statistics – simplify a large set of data to a single (meaningful) number • Central Tendency – One useful kind of summary information – Intuitively: typical, average, normal value • Three statistics for central tendency – Mean – Median – Mode Mean • Sum of scores divided by number of scores • Sample mean: IQ: X = [94, 108, 145, 121, 88, 133] X M SX = 94 + 108 + 145 + 121 + 88 + 133 = 689 n M = 689/n = 689/6 = 114.83 • Equal apportionment – If everyone had mean score, total would be the same • Balance point, seesaw analogy (Fig 3.3) • Equal upward and downward distances 88 94 108 M 121 133 145 Population Mean • Finite population X = [1,1,2,2,2,2,3,3,3,3,3,4,4,4,5,5,6] X x=3 x=1 x=2 f(x) = f(x) 2 = 4 f(x) = 5 SX = 2*4SX = 3*5 SX = 1*2 N • Infinite population Frequency: f (x ) 6 5 X x f ( x) x p( x) 4 x x 3 X x f ( x) 2 N 1 0 1 2 3 4 Possible values: x 5 6 x N x p( x) x Comprehension Check • X = [2, 3, 3, 8] • What is the mean of X? A–2 B–3 C–4 D–5 E – 16 Comprehension Check • X = [2, 3, 3, 8] • What is N? A–3 B–4 C–5 D–6 E–7 Comprehension Check • X = [2, 3, 3, 8] • What is p(3)? A – 1/2 B – 3/4 C–2 D–3 E–6 Median • Middle value – Higher than half the scores, lower than other half • Not average of minimum and maximum • Sorting approach X = [4,7,5,8,6,2,1,4,3,5,6,8,7,4,3,6,9] X = [1,2,3,3,4,4,4,5,5,6,6,6,7,7,8,8,9] Mean vs. Median 74 40 73 72 71 69 70 12 Height (Inches) Household Income Digit Span 11 68 67 66 Mean 9 65 64 8 63 62 7 61 6 60 59 Mean excluding outlier Mean ($63k) 5 0 Median ($46k) 10 Frequency (.5 M households) Skew Outliers 58 10 – 9 –8 7 6 6 5 5 4 4 3 3 2 2 1 1 0 4 Frequency (subjects)(students) Frequency • Both based on a notion of balance • Mean sensitive to each datum's distance from middle • Median better for irregular distributions Median Mean Mean Mode Most common value Peak in the distribution for continuous variables Simple and insensitive 10 Most useful when mean, median not definable Height (Inches) 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 College majors, sex, favorite color 59 9 – 8 7 6 5 4 3 2 1 0 58 Frequency (students) • • • • Scale types • Nominal (=) – Non-numeric, labels – Can't do math except to count • Ordinal (>) – Values are ordered, but differences aren’t meaningful – Education, contest, preferences, income? – 1st-2nd 2nd-3rd • Interval (+,-) – Differences meaningful, zero not – 2° not twice as hot as 1° • Ratio (*,/) – Zero is meaningful – Weight, time, etc. • Numbers are just a model or analogy for real world – Some properties relevant, some superfluous • Little effect on statistics – (X – M): changes ratio to interval – All computations done on differences 0 Choosing the Right Statistic Mean Median Mode Sophistication Sensitivity Precision Simplicity Interval – Richness of Data Ordinal > Robustness Skew, outliers Nominal =