Download slide show

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
VARIABILITY
Distributions
Measuring dispersion
Variance and standard deviation
Review: Distribution
An arrangement of cases according to their
score or value on one or more variables
•
Categorical
variable
•
Continuous
variable
Case no.
Age
Height
M/F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
23
22
23
25
27
22
24
23
23
25
21
21
24
27
21
25
22
22
25
26
21
31
24
31
23
27
25
26
22
29
24
68
64
69
71
64
72
65
66
66
68
68
62
71
66
62
56
71
70
66
60
52
70
71
61
72
71
71
64
66
69
67
M
F
F
M
F
M
F
M
F
F
M
F
M
F
F
F
M
M
F
F
F
F
M
F
M
F
M
F
F
M
F
Summary
statistics
mean = 24
mean = 67
%M 39
%F 61
Dispersion
officers
How do cases “disperse” (arrange themselves) around the mean?
Three statistics that measure dispersion
Measure how cases “disperse” (arrange themselves) around the mean
–
–
–
–
•
 (x - )
----------n
Average distance between the mean
and the values (scores) for each case
Uses absolute distances (no + or -)
Affected by extreme scores
We’ll never use it in class
Average deviation
officers
•
Variance (s2): A sample’s cumulative dispersion
 (x - )2
----------n  we always use n-1 (our sample sizes are always small)
•
Standard deviation (s): A standardized form of variance, comparable between samples
 (x - )2
----------n  we always use n-1 (our sample sizes are always small)
– Square root of the variance
– Expresses dispersion in units of equal size for that particular distribution
– Less affected by extreme scores
Variability
exercise
Sample 1 (n=10)
Officer
Score
Mean
Diff.
Sq.
1
3
2.9
.1
.01
2
3
2.9
.1
.01
3
3
2.9
.1
.01
4
3
2.9
.1
.01
5
3
2.9
.1
.01
6
3
2.9
.1
.01
7
3
2.9
.1
.01
8
1
2.9
-1.9
3.61
9
2
2.9
-.9
.81
10
5
2.9
2.1
4.41
____________________________________________________
Sum 8.90
Variance (sum of squares / n-1)
s2
.99
Standard deviation (sq. root of variance)
s
.99
Random sample of
patrol officers,
each scored 1-5 on a
cynicism scale
This is not an acceptable graph – it’s only to illustrate dispersion
Sample 2 (n=10)
Another random
sample of patrol
officers,
each scored 1-5 on a
cynicism scale
Officer
Score
Mean
Diff.
Sq.
1
2
3
4
5
6
7
8
9
10
2
1
1
2
3
3
3
3
4
2
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
Sum ____
Variance s2 ____
Standard deviation s ____
Compute ...
Two random samples of patrol officers, each scored 1-5 on a cynicism scale
Sample 1 (n=10)
Officer
1
2
3
4
5
6
7
8
9
10
Score
3
3
3
3
3
3
3
1
2
5
Mean
2.9
2.9
2.9
2.9
2.9
2.9
2.9
2.9
2.9
2.9
Variance (sum of squares / n-1)
Standard deviation (sq. root of variance)
Sample 2 (n=10)
Diff.
.1
.1
.1
.1
.1
.1
.1
-1.9
-.9
2.1
Sq.
.01
.01
.01
.01
.01
.01
.01
3.61
.81
4.41
Sum
s2
s
8.90
.99
.99
Officer
1
2
3
4
5
6
7
8
9
10
Score
2
1
1
2
3
3
3
3
4
2
Mean
2.4
2.4
2.4
2.4
2.4
2.4
2.4
2.4
2.4
2.4
Variance (sum of squares / n-1)
Standard deviation (sq. root of variance)
Diff.
-.4
-1.4
-1.4
-.4
.6
.6
.6
.6
1.6
-.4
Sq.
.16
1.96
1.96
.16
.36
.36
.36
.36
2.56
.16
Sum
s2
s
8.40
.93
.97
These are not acceptable graphs – they’re only used here to illustrate how the scores disperse around the mean
VARIABILITY
Shape of distributions
Flat, peaked, normal
“Flat” distributions
•
•
Dispersion (aka,
“variability”): How
scores or values arrange
themselves around the
mean
When scores are more
dispersed (i.e.,
“variability” is greater) a
distribution’s shape gets
flatter
– Greater distance
between most
scores and the
mean
– Many scores are at a
considerable
distance from the
mean
– The mean loses
value as a
“summary statistic”
Arrests
Mean A poor
3.65 descriptor
“Peaked” and “normal” distributions
•
•
•
Dispersion (aka, “variability”):
How scores or values arrange
themselves around the mean
Peaked: If most scores cluster
about a certain value the shape
of the distribution is called
“peaked”
Normal: If the clustering of
scores is around the mean the
distribution is called “normal”
– In social science research it
turns out that scores or
values for many variables
are normally or nearnormally distributed
– This allows use of the mean
to describe the underlying
datasets
– That’s why means are called
a “summary statistic” - they
can “summarize” the values
of samples or populations
Arrests
Mean Not a good
2.3  descriptor
Peaked distribution (but not “normal”)
Arrests
Mean A good
3.0  descriptor
Peaked and “normal” distribution
Characteristics of normal distributions
•
•
Unimodal and symmetrical: shapes on both sides of the mean are identical
– 68.26 percent of the area “under” the curve – meaning 68.26 percent of the
cases – falls within one “standard deviation” (+/- 1 ) from the mean
The fact that a distribution is “normal” or “near-normal” does NOT imply that
the mean is of any particular value. All it implies is that scores distribute
themselves around the mean “normally”.
– Means depend on the data. In this distribution the mean could be any value.
– By definition, the standard deviation score that corresponds with the mean
of a normal distribution - whatever the mean might be - is zero. ( = 0)
Mean (whatever it is)
Standard deviation (always 0 at the mean)
How well do means represent
(summarize) a sample?
If variable “no. of tickets” was “normally” distributed most cases
would fall inside a bell-shaped curve. Here they don’t.
13 officers scored on numbers
of tickets written in one week
In a normal distribution about 66%
of cases would fall within 1 SD of the
mean.
Frequency
13 X .66 = 9 cases
But here only 7 cases (Officers D-J)
do, while nearly as many (6) don’t.
Scores are very dispersed, making
the distribution mostly flat. So here
the mean is NOT a good shortcut for
describing how officers performed.
Number of tickets
A
B
C
2.13
-1 SD
D
E
F
G
H
I
4.46
mean
J
K
L
6.79
+1 SD
M
Officer A: 1 ticket
Officers B & C: 2 tickets each
Officers D & E: 3 tickets each
Officers F & G: 4 tickets each
Officers H & I: 5 tickets each
Officer J: 6 tickets
Officers K & L: 7 tickets each
Officer M: 9 tickets
Mean = 4.46
SD = 2.33
13 officers scored on numbers
of tickets written in one week
Here most cases do fall inside the bellshaped curve. Variable “no. of tickets”
seems near-normally distributed
Frequency
Here, 9 of 13 cases (officers C-K)
do fall within 1 SD of the mean.
The distribution is near-normal
because most officers wrote
close to the same number of
tickets. The cases “cluster”
around the mean.
So, for this sample the mean is a
decent summary statistic - a
good shortcut for describing
officer performance
Number of tickets
A
B
C
2.59
-1 SD
D
E
F
G
H
I
4.69
mean
J
K
L
6.79
+1 SD
M
Officer A: 1 ticket
Officer B: 2 tickets
Officer C: 3 tickets
Officers D, E, F: 4 tickets each
Officers G, H, I: 5 tickets each
Officers J & K: 6 tickets each
Officer L: 7 tickets
Officer M: 9 tickets
Mean = 4.69
SD = 2.1
Going beyond description…
•
•
•
•
•
•
When variables are normally or near-normally distributed,
the mean, variance and standard deviation can help
describe datasets
But they are also useful in explaining why things change;
that is, in testing hypotheses
You want to test the hypothesis that college-educated cops
are more effective: college  greater effectiveness
– Independent variable: college (Y/N)
– Dependent variable: effectiveness (scale 1-5)
You go to the XYZ police dept., draw two samples of patrol
officers - one of college grads, the other of non-college
grads - and test each officer for effectiveness.
On a scale of 1 (ineffective) to 5 (highly effective) this is
how they scored:
– 10 college grads (mean 3.7)
– 10 non-college (mean 2.8)
The difference between means is in the hypothesized
direction. But does that “prove” that college grads are more
effective? To determine whether the difference in means is
“statistically significant,” meaning large enough to prove
the value of education, we need to know each sample’s
variance. Don’t worry - we’ll cover this later!
Are collegeeducated cops
more effective?
College grads
Non-college grads
Exam information
• You must bring a regular, non-scientific calculator with no functions
beyond a square root key.
• You will be asked to apply concepts including research question,
hypothesis and variables to the “college education and police job
performance" article.
• You will be given data and asked to create graph(s) depicting the
distribution of a single variable.
• You will compute basic statistics, including mean, median, mode and
standard deviation. All computations must be shown on the answer
sheet.
• You will be given the formula for variance (s2). You must use and
display the procedure described in the slides and practiced in class for
manually calculating variance (s2) and its square root, known as
standard deviation (s).
• This is a relatively brief exam. You will have one hour to complete it.
We will then take a break and move on to the next topic.