Download Empirical Methods in Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistical Methods in
Computer Science
Data 2:
Central Tendency & Variability
Ido Dagan
Frequency Distributions and Scales
Nominal
Ordinal
Numerical Numerical/Continuous
f Distrib.
Groupedf
Distrib.
Apparent/Real Limits
Relativef
F
(Accumulativef)
Percentile,
Per. Ranks
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
2
Characteristics of Distributions
Shape, Central Tendency, Variability
Different Central
Tendency
Empirical Methods in Computer Science
Different
Variability
© 2006-now Gal Kaminka / Ido Dagan
3
This Lesson

Examine measures of central tendency




Examine measures of variability (dispersion)



Mode (Nominal)
Median (Ordinal)
Mean (Numerical)
Entropy (Nominal)
Variance (Numerical), Standard Deviation
Standard scores (z-score)
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
4
Centrality/Variability Measures
and Scales
Nominal
Ordinal
Numerical Numerical/Continuous
Mode
Median
Mean
Entropy
Variance,
Std. Deviation
z-Scores
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
5
The Mode (Mo)
‫השכיח‬

The mode of a variable is the value that is most frequent
Mo = argmax f(x)

For categorical variable: The category that appeared most

For grouped data: The midpoint of the most frequent interval
 Under the assumption that values are evenly distributed in the
interval
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
6
Finding the Mode: Example 1
The collection of values that a variable X took during the
measurement
Student Grade
X1
60
X2
43
X3
57
X4
82
X5
75
X6
32
X7
82
X8
60
Empirical Methods in Computer Science
System
Windows
Linux
BSD
MacOS
Algorithm Name
Round-Robin
Round-Robin
Prioritized Scheduling
Preemptive Scheduling
Trial Run-Time
#1 23.234
#2 15.471
#3 12.220
#4 23.100
?
Depends on
Grouping
© 2006-now Gal Kaminka / Ido Dagan
7
Finding the Mode: Example 2
The mode of a grouped frequency distribution depends on
grouping
86
Score
96
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
f
1
0
0
0
0
1
1
3
2
2
6
2
2
1
3
2
2
2
Score
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
Empirical Methods in Computer Science
f
4
2
1
1
0
1
2
0
3
1
2
1
0
0
0
0
1
1
88
i=5
Score
96-100
91-95
86-90
81-85
76-80
71-75
66-70
61-65
N=
f
1
1
14
10
11
4
7
2
50
87
© 2006-now Gal Kaminka / Ido Dagan
i=5
Score
95-99
90-94
85-89
80-84
75-79
70-74
65-69
60-64
N=
f
1
2
15
10
10
6
4
2
50
8
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:


0,8,8,11,15,16,20
==> Mdn = 11
12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18).
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
9
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:



0,8,8,11,15,16,20
==> Mdn = 11
12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18).
5,7,8,8,8,8
==> Mdn = ?
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
10
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:



0,8,8,11,15,16,20
==> Mdn = 11
12,14,15,18,19,20
==> Mdn = 16.5 (halfway between 15 and
18).
5,7,8,8,8,8
==> Mdn = ?
 One method: Halfway between first and second 8, Mdn = 8
 Another (for real limits): Use linear interpolation as we did in
intervals, Mdn = 7.75
 7.75 = 7.5 + (¼ * 1.0)
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
11
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:
0,8,8,11,15,16,20
==> Mdn = 11
 12,14,15,18,19,20
==> Mdn = 16.5 (halfway between 15 and
18).
 5,7,8,8,8,8
==> Mdn = ?
 One method: Halfway between first and second 8, Mdn = 8
 Another: Use linear interpolation as we did in intervals, Mdn =
7.75
 7.75
= 7.5
between
7 and
8 + (¼ * 1.0)

Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
12
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:
0,8,8,11,15,16,20
==> Mdn = 11
 12,14,15,18,19,20
==> Mdn = 16.5 (halfway between 15 and
18).
 5,7,8,8,8,8
==> Mdn = ?
 One method: Halfway between first and second 8, Mdn = 8
 Another: Use linear interpolation as we did in intervals, Mdn =
7.75
 7.75 = 7.5 + (¼ * 1.0)
1 of four 8's

Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
13
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:



0,8,8,11,15,16,20
==> Mdn = 11
12,14,15,18,19,20
==> Mdn = 16.5 (halfway between 15 and
18).
5,7,8,8,8,8
==> Mdn = ?
 One method: Halfway between first and second 8, Mdn = 8
 Another: Use linear interpolation as we did in intervals, Mdn =
7.75
 7.75 = 7.5 + (¼ * 1.0)
Width of interval containing 8's (real limits)
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
14
The Arithmetic Mean
‫ממוצע חשבוני‬
Arithmetic mean (mean, for short)
Average is colloquial: Not precisely defined when used, so we avoid the
term.
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
15
Properties of
Central Tendency Measures
Mo:
 Relatively unstable between samples
 Problematic in grouped distributions
 Can be more than one:


Distributions that have more than one sometimes called multi-modal
For uniform distributions, all values are possible modes
Typically used only on nominal data
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
16
Properties of
Central Tendency Measures
Mean:
 Responsive to exact value of each score

Only interval and ratio scales

Takes total of scores into account: Does not ignore any value
Sum of deviations from mean is always zero:

Because of this: sensitive to outliers



Presence/absence of scores at extreme values
Stable between samples, and basis for many other statistical
measures
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
17
Properties of
Central Tendency Measures
Median:
 Robust to extreme values


Only cares about ordering, not magnitude of intervals
Often used with skewed distributions
Mo
Mdn
Mean
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
18
Properties of
Central Tendency Measures
Contrasting Mode, Median, Mean
Mo
Empirical Methods in Computer Science
Mdn
Mean
© 2006-now Gal Kaminka / Ido Dagan
19
Properties of
Central Tendency Measures
Contrasting Mode, Median, Mean
Mo
Mdn
Mean
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
20
Dispersion and Variability
Mode, Median, Mean: Only give central tendencies
We need to measure the spread of the distribution
Mo
Mdn
Mean
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
21
Dispersion as Ranges

Range:


max(X) - min(X)
Semi-Interquartile Range:
Q=

Q3  Q1  = P75  P25 
2
2
Half the range where 50% of the scores are
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
22
Dispersion as Deviation

Look at dispersion as a function of the central tendency
(mean)
We know sum of deviations from mean is zero

But what if we look at sum of absolute deviations?

Xi  X = 0

Smaller sum indicates more clustering of the distribution around the
mean
Empirical Methods in Computer Science
Xi  X 
© 2006-now Gal Kaminka / Ido Dagan
23
Variance


Statisticians prefer a different way to use absolute values
Sum of squares

Shorthand for: Sum of squared deviations from the mean
SS X =   Xi  X 
2

And normalizing for the size of the sample
SS X  Xi  X 
S =σ =
=
N
N
2
2
2
This is called the variance of the distribution
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
24
Standard Deviation (std.)
Square root of variance
2




Xi

X
 SS X 


σ= 
= 


N
 N 



Robust to sampling variation:



Does not change very much with new samples of the population
Perhaps the most common measure of dispersion
Std is defined for population; standard-error for sample is a bit different

We ignore this for now; return to this later
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
25
Standard Scores

Mean, median, etc. are robust to constant translations



We may need to also compare distributions changing in
range
For instance, what's better:




Adding V to each value is the same as adding V to the central
tendency measures
Score of 50, when mean is 60
Score of 60, when mean is 80
....
Can compute z-scores of the raw scores
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
26
z Scores

Key idea: Express all values in units of standard deviation
XX
z=
σX

This allows comparison of values from different distributions

But only if shapes of distributions are similar
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
27
Measuring Dispersion in Nominal
Scales
Entropy

rX is rel f of the value X
Entropy of 0 means that all values X are the same


rel f = 1.0 for some value X
Entropy grows positive when values become more dispersed


H  X  =  rX log rX Where
e.g., Entropy of 1 means all scores split evenly between two values
Entropy is maximal when rX = 1/N for all values X

i.e., uniform distribution
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
28
Normalizing Entropy

Can normalize by dividing by maximal entropy given N.
1
 1  1 
1
1
maxH =   log  = n  log  = log  
n
n
n
 n  n 

This allows comparing the entropy of distributions of different
size
Empirical Methods in Computer Science
© 2006-now Gal Kaminka / Ido Dagan
29
Related documents