Download 09-01 lecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Descriptive Statistics I:
Central Tendency
9/1
Attendance Question
How many courses are you enrolled in this
semester?
A–1
B–2
C–3
D–4
E – 5 (or more)
Units
Frequency (students)
Histograms
# observations
for this value
45
40
35
30
25
20
15
10
5
0
1
2
3
4
5
Number of Courses
Variable Label
Values
Continuous vs. Discrete Variables
• Discrete variable
– Can only take certain values (usually integers)
– Counts: people, classes, stories, …
• Continuous variable
– Infinite set of values, in principle
– Height, weight, temp, IQ, …
5959
5858
<60
6363
6262
6161
6060
60-64
65-69
Height (Inches)
Height (Inches)
70
6969
6868
6767
6666
6565
6464
72
72
7171
70+
74
74
7373
77.24
76.06
75.27
74.38
73.56
71.98
71.49
70.98
70.26
68.90
67.84
67.27
66.62
65.87
64.78
63.23
61.54
60.34
59.43
57.65
56.34
54.69
53.93
51.65
50.01
49.26
19
8
7
6
65
054
4
3
3
3
2
22
1
11
0
00
70
FrequencyFrequency
(students)(students)
(students)
Frequency
Histograms of Continuous Variables
• Bins
or intervals
2
– Ranges for grouping continuous variables
– Width depends on number of data
10
Height (Inches)
Central Tendency
• Descriptive statistics – simplify a large set of
data to a single (meaningful) number
• Central Tendency
– One useful kind of summary information
– Intuitively: typical, average, normal value
• Three statistics for central tendency
– Mean
– Median
– Mode
Mean
• Sum of scores divided by number of scores
• Sample mean:
IQ: X = [94, 108, 145, 121, 88, 133]
X

M 
SX = 94 + 108 + 145 + 121 + 88 + 133 = 689
n
M = 689/n = 689/6 = 114.83
• Equal apportionment
– If everyone had mean score, total would be the same
• Balance point, seesaw analogy (Fig 3.3)
• Equal upward and downward distances
88
94
108
M
121
133
145
Population Mean
• Finite population

X = [1,1,2,2,2,2,3,3,3,3,3,4,4,4,5,5,6]
X
x=3
x=1 x=2
f(x) = f(x)
2 = 4 f(x) = 5
SX = 2*4SX = 3*5
SX = 1*2
N
• Infinite population
Frequency: f (x )
6
5
 X   x  f ( x)
   x  p( x)
4
x
x
3
 X  x  f ( x)

2
N
1
0
1
2
3
4
Possible values: x
5
6
x
N
  x  p( x)
x
Comprehension Check
• X = [2, 3, 3, 8]
• What is the mean of X?
A–2
B–3
C–4
D–5
E – 16
Comprehension Check
• X = [2, 3, 3, 8]
• What is N?
A–3
B–4
C–5
D–6
E–7
Comprehension Check
• X = [2, 3, 3, 8]
• What is p(3)?
A – 1/2
B – 3/4
C–2
D–3
E–6
Median
• Middle value
– Higher than half the scores,
lower than other half
• Not average of minimum and maximum
• Sorting approach
X = [4,7,5,8,6,2,1,4,3,5,6,8,7,4,3,6,9]
X = [1,2,3,3,4,4,4,5,5,6,6,6,7,7,8,8,9]
Mean vs. Median
74
40
73
72
71
69
70
12
Height
(Inches)
Household
Income
Digit
Span
11
68
67
66
Mean
9
65
64
8
63
62
7
61
6
60
59
Mean excluding outlier
Mean ($63k)
5
0
Median ($46k)
10
Frequency
(.5 M households)
Skew
Outliers
58
10
–
9
–8
7
6 6
5 5
4 4
3
3
2
2 1
1 0
4
Frequency
(subjects)(students)
Frequency
• Both based on a notion of balance
• Mean sensitive to each datum's distance from middle
• Median better for irregular
distributions
Median
Mean
Mean
Mode
Most common value
Peak in the distribution for continuous variables
Simple and insensitive
10
Most
useful when mean, median not definable
Height (Inches)
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
College majors, sex, favorite color
59
9
–
8
7
6
5
4
3
2
1
0
58
Frequency (students)
•
•
•
•
Scale types
• Nominal (=)
– Non-numeric, labels
– Can't do math except to count
• Ordinal (>)
– Values are ordered, but differences aren’t meaningful
– Education, contest, preferences, income?
– 1st-2nd  2nd-3rd
• Interval (+,-)
– Differences meaningful, zero not
– 2° not twice as hot as 1°
• Ratio (*,/)
– Zero is meaningful
– Weight, time, etc.
• Numbers are just a model or analogy for real world
– Some properties relevant, some superfluous
• Little effect on statistics
– (X – M): changes ratio to interval
– All computations done on differences
0
Choosing the Right Statistic
Mean
Median
Mode
Sophistication
Sensitivity
Precision
Simplicity
Interval
–
Richness of Data
Ordinal
>
Robustness
Skew, outliers
Nominal
=
Related documents