Download 2030Lecture2

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Distributions
• When comparing two groups of people or
things, we can almost never rely on a single
comparison
• Example: Are men taller than women?
Distributions
• We almost always measure several or many
representative people or things
Distributions
• We almost always measure several or many
representative people or things
• We also almost never measure every person
or thing
Distributions
• We almost always measure several or many
representative people or things
• We also almost never measure every person
or thing
• Instead, we measure some of them
Distributions
• We almost always measure several or many
representative people or things
• We also almost never measure every person
or thing
• Instead, we measure some of them
• The “some of them” that you measure is
called a sample because we have “sampled”
the entire population
Distributions
• The population is every possible person or
thing that could have been part of the
sample (e.g. all of the men in the world, all
of the women, etc.)
Distributions
• The population is every possible person or
thing that could have been part of the
sample (e.g. all of the men in the world, all
of the women, etc.)
• We can tell a lot about a population by
looking at a sample (e.g. you don’t need to
eat a whole container of ice cream to know
if you like it!)
Distributions
• When you measure several different things
you get (no surprise!) different numbers
Distributions
• When you measure several different things
you get (no surprise!) different numbers
• We say that those numbers are distributed
Distributions
• A distribution is a set of numbers.
– Examples: the heights of the men in the room,
the heights of the women in the room, the ages
in the room, the scores on the mid-term, etc.
Distributions
• Looking at distributions:
– We often conceptualize distributions by graphing
them with a probability density function
Age Distribution
How Many?
60
50
40
30
20
10
0
18
19
20
21
22
23
24
Ages
25
26
27
28
29
30
Distributions
• Looking at distributions:
– Here’s an example of a “normal” distribution
Age Distribution
How Many?
60
50
40
30
20
10
0
18
19
20
21
22
23
24
Ages
25
26
27
28
29
30
Distributions
• Looking at distributions:
– Here’s an example of a “rectangular” distribution
60
How Many?
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Birthdays
Distributions
• key insight: The measurements in a sample
are distributed because the population is
distributed
Distributions
• key insight: The measurements in a sample
are distributed because the population is
distributed
• Ponder this: the more people or things in
your sample, the more your sample is like
the entire population
– It’s like “sampling” ice cream with a really big
spoon
Describing Distributions
• It’s no good to just have a pile of numbers,
we need a way of summarizing the
characteristics of the distribution.
What are some ways to describe a distribution?
Describing Distributions
• All distributions have a sum
– We could just add up the samples and talk
about, for example, the total height of the men
and the total height of the women in the room.
– What’s the problem with this approach?
Describing Distributions
• All distributions have a mean (a.k.a
average)
– The mean is the normalized sum - this means
that it is adjusted for the number in the sample
Describing Distributions
• All distributions have a mean (a.k.a
average)
– The mean is the normalized sum - this means
that it is adjusted for the number in the sample
– How do we do that?
Describing Distributions
• All distributions have a mean (a.k.a
average)
– The mean is the normalized sum - this means
that it is adjusted for the number in the sample
– How do we do that?
– Divide the sum by the number in the sample
“The” Mean
“The” Mean
x- is pronounced “x bar” and means “the mean”
x1 is measurement number 1
xn is the last measurement in the distribution (of n
measurements)
xi is any one of the measurements (you can fill in
the i with any number between 1 and n)
 means “add these up”
“The” Mean
Sum of the sample
“x bar” (the mean)
Number of
measurements
Properties of the Mean
• Every value is some distance from the mean
- this distance is called a “deviation score”
_
deviation score = xi - x
Properties of the Mean
• The mean is the point from which the sum
of deviation scores is zero
Properties of the Mean
• The mean is the point from which the sum
of deviation scores is zero
• This means that the mean is like a balancing
point: all the scores below the mean are
balanced by the scores above the mean
Properties of the Mean
• The sum of the squared deviations from the
mean is smaller than from any other number
Y is any other number
Properties of the Mean
• The sum of the squared deviations from the
mean is smaller than from any other number
Properties of the Mean
• The mean is the number that, when added to
itself n times, gives you the sum of the
numbers in the sample
=
“Other” Means
• Sometimes just adding the items in the
sample and dividing by n gives you a
number that doesn’t really describe the n
numbers
“Other” Means
• Sometimes just adding the numbers in the
sample and dividing by n gives you a
number that doesn’t really describe the n
numbers
– for example: a sine wave
+1
-1
 xi = 0 !
“Other” Means
• Root-Mean-Square (RMS): first square the
scores before you sum them, then take the
square root to undo the squaring.
+1
-1
Other Descriptions of a
Distribution: the Median
• The mean is sensitive to outliers
– eg. 1, 2, 3, 100, 4
– mean = 110/5 = 22 … not particularly
representative of the numbers in the sample
Other Descriptions of a
Distribution: the Median
• Another descriptive statistic, the median, is
less sensitive to outliers
– the median is the ordinal middle of the sample:
half of the measurements lie below the median
and half of the measurements lie above it.
Other Descriptions of a
Distribution: the Median
• Another descriptive statistic, the median, is
less sensitive to outliers
– the median is the ordinal middle of the sample:
half of the measurements lie below the median
and half of the measurements lie above it.
– in other words it is the 50th percentile
Other Descriptions of a
Distribution: the Median
• for example:
– 1, 2, 3, 100, 4 put into rank order is…
– 1, 2, 3, 4, 100
– so the middle number (obviously) is 3
(remember that the mean was 22!)
Other Descriptions of a
Distribution: the Median
• if n is even take the average of the two
middle numbers:
– 1, 2, 3, 100, 4, 5 put into rank order is…
– 1, 2, 3, 4, 5 100
– so the middle number is the average of 3 and 4
= 3.5
Other Descriptions of a
Distribution: the Median
• the median is not sensitive to outliers
– notice the median of 1, 2, 3, 4, 5 = the median
of 1, 2, 3, 4, 100 = 3
Measures of Variability
What’s not so good about using the mean to
describe a distribution?
Measures of Variability
Example: similar mean temperature in Vancouver
and Lethbridge on Sept. 11 2006
Time
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
Lethbridge
Temperature
5.5
5.1
9.6
14.6
18
21
23.7
25.1
26.6
27.7
28.2
29.1
28.6
26.7
19.2
17
19.9
Vancouver
Temperature
11.9
12.3
14.3
16.6
18.3
17.7
17.7
19.3
20.5
20.2
20.2
19.8
19.1
17.7
17.1
17.2
16.3
mean =
20.3
17.4
Measures of Variability
Example: BUT the distribution of temperatures is
quite different for the two cities
Time
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
Lethbridge
Temperature
5.5
5.1
9.6
14.6
18
21
23.7
25.1
26.6
27.7
28.2
29.1
28.6
26.7
19.2
17
19.9
Vancouver
Temperature
11.9
12.3
14.3
16.6
18.3
17.7
17.7
19.3
20.5
20.2
20.2
19.8
19.1
17.7
17.1
17.2
16.3
mean =
range =
standard Deviation=
20.3
24.0
7.6
17.4
8.3
2.5
Measures of Variability
• The range is the highest number minus the
lowest number
• e.g. X = {1, 3, 23, 45, 62}
• the range is 62 - 1 = 61
Measures of Variability
• The range is the highest number minus the
lowest number
• Notice that the range doesn’t tell you much
about the distribution of numbers.
– it doesn’t tell you where the distribution is
located (the mean)
– it doesn’t tell you how the numbers relate to
each other: e.g. 1, 48,49,50,51, 52, 100 has a
range of 99!
Measures of Variability
• What’s needed is a measure of the
“distance” between the numbers in the
distribution - how spread apart are they
from each other
Measures of Variability
Question: How tightly or loosely spaced are the cities?
D2
• One approach would be to calculate the
distances between each pair of cities
Vancouver
Hope
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
Vancouver = 0
Hope
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
D2
• One approach would be to calculate the
distances between each pair of cities
Vancouver
Hope
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
Vancouver
Hope
= 150
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
D2
• One approach would be to calculate the
distances between each pair of cities
Vancouver
Hope
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
Vancouver
Hope
Cache Creek = 343
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
D2
• One approach would be to calculate the
distances between each pair of cities
Vancouver
Hope
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
Vancouver = -150
Hope
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
D2
• One approach would be to calculate the
distances between each pair of cities
Vancouver
Hope
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
Vancouver
Hope = 0
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
D2
• One approach would be to calculate the
distances between each pair of cities
Vancouver
Hope
Cache Creek
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
Vancouver
Hope
Cache Creek = 193
Kamloops
Salmon Arm
Revelstoke
Lake Louise
Banff
Calgary
Medicine Hat
Swift Current
2
D
• notice that there are n * n = n2 pairs
D2
• If you sum up all the differences between
numbers you get…
D2
• If you sum up all the differences between
numbers you get…
ZERO
D2
• If you sum up all the differences between
numbers you get…
D2
• What does a statistician do when things
sum to zero?
D2
• What does a statistician do when things
sum to zero?
• Square everything first, then sum them,
then square root
D2
• D2 is the sum of the squared differences
• D is the square root of D2
D2
• What is the problem with using D or D2 ?
D2
• What is the problem with using D or D2 ?
• if n is “pretty big” n2 will be huge!
2
S :
a better choice
• Select a representative “anchor point” and
just measure distance from that point
2
S :
a better choice
• Select a representative “anchor point” and
just measure distance from that point
• For e.g. measure distances relative to
Calgary
2
S :
a better choice
2
S :
a better choice
• Notice there are
some negative
distances
• We don’t care about
the sign of the
distances, we just
care about the
distances themselves
2
S :
a better choice
• S2 (called the variance) is like D2 except it
uses a single “anchor point” (like measuring
distances from Calgary)
2
S :
a better choice
• S2 (called the variance) is like D2 except it
uses a single “anchor point” (like measuring
distances from Calgary)
• That anchor point is the mean
2
S :
a better choice
S: the standard deviation
• The standard deviation of a distribution of
values is the square root of the variance
S: the standard deviation
• That can be rewritten this way for using a
calculator:
Next Time
• Transforming Scores (chapter 4)
• We begin significance testing (chs. 11, 12,
13, 14)