Download Empirical Methods in Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistical Methods in
Computer Science
Data 2:
Central Tendency & Variability
Ido Dagan
Frequency Distributions and Scales
Nominal
Ordinal
Numerical Numerical/Continuous
f Distrib.
Groupedf
Distrib.
Apparent/Real Limits
Relativef
F
(Accumulativef)
Percentile,
Per. Ranks
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
2
Characteristics of Distributions
Shape, Central Tendency, Variability
Different Central
Tendency
Empirical Methods in Computer Science
Different
Variability
© 2006-now Gal Kaminka
3
This Lesson

Examine measures of central tendency




Examine measures of variability (dispersion)



Mode (Nominal)
Median (Ordinal)
Mean (Numerical)
Entropy (Nominal)
Variance (Numerical), Standard Deviation
Standard scores (z-score)
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
4
Centrality/Variability Measures
and Scales
Nominal
Ordinal
Numerical Numerical/Continuous
Mode
Median
Mean
Entropy
Variance,
Std. Deviation
z-Scores
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
5
The Mode (Mo)
‫השכיח‬

The mode of a variable is the value that is most frequent
Mo = argmax f(x)

For categorical variable: The category that appeared most

For grouped data: The midpoint of the most frequent interval
 Under the assumption that values are evenly distributed in the
interval
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
6
Finding the Mode: Example 1
The collection of values that a variable X took during the
measurement
Student Grade
X1
60
X2
43
X3
57
X4
82
X5
75
X6
32
X7
82
X8
60
Empirical Methods in Computer Science
System
Windows
Linux
BSD
MacOS
Algorithm Name
Round-Robin
Round-Robin
Prioritized Scheduling
Preemptive Scheduling
Trial Run-Time
#1 23.234
#2 15.471
#3 12.220
#4 23.100
© 2006-now Gal Kaminka
?
Depends on
Grouping
7
Finding the Mode: Example 2
The mode of a grouped frequency distribution depends on
grouping
86
Score
96
95
94
93
92
91
90
89
88
87
86
85
84
83
82
81
80
79
f
1
0
0
0
0
1
1
3
2
2
6
2
2
1
3
2
2
2
Score
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
Empirical Methods in Computer Science
f
4
2
1
1
0
1
2
0
3
1
2
1
0
0
0
0
1
1
88
i=5
Score
96-100
91-95
86-90
81-85
76-80
71-75
66-70
61-65
N=
f
1
1
14
10
11
4
7
2
50
© 2006-now Gal Kaminka
87
i=5
Score
95-99
90-94
85-89
80-84
75-79
70-74
65-69
60-64
N=
f
1
2
15
10
10
6
4
2
50
8
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:


0,8,8,11,15,16,20
==> Mdn = 11
12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18).
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
9
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:



0,8,8,11,15,16,20
==> Mdn = 11
12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15 and 18).
5,7,8,8,8,8
==> Mdn = ?
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
10
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:



0,8,8,11,15,16,20
==> Mdn = 11
12,14,15,18,19,20
==> Mdn = 16.5 (halfway between 15 and
18).
5,7,8,8,8,8
==> Mdn = ?
 One method: Halfway between first and second 8, Mdn = 8
 Another: Use linear interpolation as we did in intervals, Mdn =
7.75
 7.75 = 7.5 + (¼ * 1.0)
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
11
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:
0,8,8,11,15,16,20
==> Mdn = 11
 12,14,15,18,19,20
==> Mdn = 16.5 (halfway between 15 and
18).
 5,7,8,8,8,8
==> Mdn = ?
 One method: Halfway between first and second 8, Mdn = 8
 Another: Use linear interpolation as we did in intervals, Mdn =
7.75
 7.75
= 7.5
between
7 and
8 + (¼ * 1.0)

Empirical Methods in Computer Science
© 2006-now Gal Kaminka
12
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:
0,8,8,11,15,16,20
==> Mdn = 11
 12,14,15,18,19,20
==> Mdn = 16.5 (halfway between 15 and
18).
 5,7,8,8,8,8
==> Mdn = ?
 One method: Halfway between first and second 8, Mdn = 8
 Another: Use linear interpolation as we did in intervals, Mdn =
7.75
 7.75 = 7.5 + (¼ * 1.0)
1 of four 8's

Empirical Methods in Computer Science
© 2006-now Gal Kaminka
13
The Median (Mdn)
‫החציון‬


The median of a variable is its 50th percentile, P50.
The point below which 50% of all measurements fall


Requires ordering: Only ordinal and the numerical scales
Examples:



0,8,8,11,15,16,20
==> Mdn = 11
12,14,15,18,19,20
==> Mdn = 16.5 (halfway between 15 and
18).
5,7,8,8,8,8
==> Mdn = ?
 One method: Halfway between first and second 8, Mdn = 8
 Another: Use linear interpolation as we did in intervals, Mdn =
7.75
 7.75 = 7.5 + (¼ * 1.0)
Width of interval containing 8's (real limits)
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
14
The Arithmetic Mean
‫ממוצע חשבוני‬
Arithmetic mean (mean, for short)
Average is colloquial: Not precisely defined when used, so we avoid the
term.
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
15
Properties of
Central Tendency Measures
Mo:
 Relatively unstable between samples
 Problematic in grouped distributions
 Can be more than one:


Distributions that have more than one sometimes called multi-modal
For uniform distributions, all values are possible modes
Typically used only on nominal data
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
16
Properties of
Central Tendency Measures
Mean:
 Responsive to exact value of each score

Only interval and ratio scales

Takes total of scores into account: Does not ignore any value
Sum of deviations from mean is always zero:

Because of this: sensitive to outliers



Presence/absence of scores at extreme values
Stable between samples, and basis for many other statistical
measures
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
17
Properties of
Central Tendency Measures
Median:
 Robust to extreme values


Only cares about ordering, not magnitude of intervals
Often used with skewed distributions
Mo
Mdn
Mean
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
18
Properties of
Central Tendency Measures
Contrasting Mode, Median, Mean
Mo
Empirical Methods in Computer Science
Mdn
Mean
© 2006-now Gal Kaminka
19
Properties of
Central Tendency Measures
Contrasting Mode, Median, Mean
Mo
Mdn
Mean
Empirical Methods in Computer Science
© 2006-now Gal Kaminka
20
Related documents