Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Central Limit Theorem
The Deepest Thought Ever Thunk
Paul E. Johnson1,2
1
2
Department of Political Science
University of Kansas
Center for Research Methods and Data Analysis
University of Kansas
2011
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Outline
CLT I
1
The Difference Between A Sample and The Truth
2
Sampling Distribution
3
Asymptotic Properties
4
The Central Limit Theorem
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Do you remember your friend, the Normal Distribution?
0.030
x ~ Normal(µ = 10.03,σ = 12.58)
1
σ 2π
e
−
1
2
x−µ
σ
2
0.020
Single Peaked
0.015
Symmetric
E [x] = µ
Var [x] = σ 2
0.000
0.005
0.010
Probability Density
0.025
f(x) =
= F(−14.62)
= 0.025
SD[x] = σ
σ = 12.58
Prob(x ≤ −14.62)
= 1 − F(34.68)
−14.62 = µ − 1.96σ
−14.62
−20
= 0.025
34.68 = µ + 1.96σ
µ = 10.03
0
34.68
20
40
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
0.08
0.04
0.00
Density
0.12
Draw one Normal Sample from N(5.353, 4.552 )
−10
0
10
20
Observations from one sample
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
0.08
0.04
0.00
Density
0.12
The Theoretical PDF Is This:
−10
0
10
20
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
0.04
0.08
Theoretical (PDF)
Observed Density
0.00
Density
0.12
But the Observed Density Differs
−10
0
10
20
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Pop. Var.= 20.7
Obs. Var.= 20.17
0.04
0.08
Exp. Value= 5.353
Obs. Mean= 5.18
0.00
Density
0.12
But the Observed Density Differs
−5
0
5
10
15
20
Observations from one sample
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
The Basic Idea
Draw a lot of samples
Collect M samples of size N
Calculate the mean for each sample
What distribution will be observed among all of those means?
Do you expect the distribution of means will be different from the
distribution of x itself?
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Important Term: Sampling Distribution
Definition: Sampling Distribution is the PDF of the “true”
distribution for an estimator like x̄
Drawing 500, or 5000, or 100,000 samples, and then creating a
histogram of the estimates, approximates the sampling distribution.
This histogram (or observed density) will not be exactly the same as
the sampling distribution, but it might get very close!
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
General Claims about the Sampling Distribution of x̄
This is the first set of facts I need to establish
If E [x] = µ, then E [x̄] = µ
If Var [x] = σ 2 , then Var [x̄] =
Var [x]
N
SD[x]
Which implies SD[x̄] = √
(N)
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
In Other Words...
The distribution of x̄
Is Centered on the same spot as xi
But x̄ is clusterd much more “tightly’ than the distribution of xi
itself.
That’s impossibly easy to see
Algebraically.
By simulation.
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Let’s define terms.
The mean of a sample x1 , x2 , x3 , . . . , xN is:
x̄ =
N
1 X
xi
N
(1)
i
If we have data on the frequency of each possible score xj , calculate
proportions
Prop.(xj ) =
Frequency (x = xj )
N
Mean(xi ) = x̄ =
m
X
(2)
Prop(xj )xj
(3)
j=1
where Prop(xj ) is the proportion of observations that have value xj .
(sums across possible values of xj , rather than summing across all
individuals observed).
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
The Expected Value of x, E [x]
EV=Another term for the “population mean” or “true mean”
Recall, population=the random process that generates xi .
discrete distribution makes it easiest to compare formulae for x̄ and
E [x]
f is a “probability mass function”
Expected Value(x) = E [x] =
X
f (xj )xj
(4)
Similar to sample mean formula, except replace the “observed
proportion” (Prop(xj )) with the “theoretical probability” f (xj ).
Similar for a continuous distribution with pdf f (x)
Z
+∞
E [x] =
f (x) x dx.
(5)
−∞
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
One Little Tricky Bit Needs explaining First
Think of a “variable” as one single observation from a distribution
xi
(6)
We were comfortable discussing a variable x as a collection of
observations.
We said x is normally distributed, usually thinking of a collection
Now think of x1 , x2 and so forth as separate variates from the same
distribution.
Appeal to Intuition. E [x] = E [x1 ] = E [x2 ] = . . . E [xN ]
To me, that was the only really surprising idea in all of this.
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Calculate the Expected Value of x̄
E [x̄] =
E
=
1
N
x1 +x2 +x3 +...+xN N
{E [x1 ] + E [x2 ] + E [x3 ] + . . . + E [xN ]}
=
1
N
{N · E [x]}
= E [x]
Conclusion: The expected value of the mean is the same as the expected
value of one draw from a given distribution.
Implication: x̄ is an unbiased estimator of E [x]
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Variance
Recall Variance in a sample is the average of squared errors (aka
“mean square error”)
1 X
(xi − x̄)2
N
Maybe you divide by N − 1 in order to make this a ’consistent’
estimator. Not a huge issue at this point.
With frequency data:
X
Variance(xi ) =
Prop.(xj )(xj − x̄)2
Variance(xi ) =
(7)
(8)
where Prop(xj ) is the proportion of observations that have value xj .
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Population Variance, same as Theoretical Variance
The “population variance” of the random process that generates xi .
For discrete variable, use the PMF in place of Prop.(x):
X
Theoretical Variance(xi ) =
f (xi )(xi − x̄)2
For a continuous variable f , use the PDF instead of proportions:
Z
Theoretical Variance(xi ) = f (xi )(xi − x̄)2 dxi
CLT I
(9)
(10)
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Recall the Variance of A Sum
The variance of a sum of two variables x1 and x2 can be found:
Var [x1 + x2] = Var [x1] + Var [x2] + 2Cov [x1, x2]
(11)
And
Var [ax1 + bx2] = a2 Var [x1] + b2 Var [x2] + 2abCov [x1, x2]
(12)
Here a and b are constants.
We want a simple result, so we often assume the Cov [x1, x2] = 0 on the
grounds that the observations are “statistically independent.”
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Calculate the Variance of the Mean
What is the variance of the mean itself?
1
1
1
x1 + x2 + . . . + xN ]
N
N
N
Invoking the “statistical independence” principle to eliminate the
Covariance terms, we apply the “Variance of a sum” rule
Var [x̄] = Var [
1
1
1
x1 + x2 + . . . + xN ) =
N
N
N
1
1
1
Var (x1 ) + 2 Var (x2 ) + . . . + 2 Var (xN )
N2
N
N
Var (
CLT I
(13)
(14)
(15)
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
If all the observations were drawn from the same random process–the
same population–then they all have the same variance, which is just
Var (xi ). So the previous instantly reduces to this:
Var (x̄) =
=
CLT I
1 NVar (xi )
N2
1
1
Var (xi )
N
(16)
(17)
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
In words, the variance of the mean of xi is the variance of xi divided by
N, the sample size upon which the mean is calculated.
That must mean the standard deviation of the means is
Standard Deviation(x̄) =
CLT I
Standard Deviation(xi )
√
N
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
The Distribution of the Mean is “Spike-ish”
Please observe the illustration of the effect of sample size on the variance
of x̄.
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
0.08
0.04
0.00
density of x
0.12
Distribution of x ∼ Normal(0, 32 )
−10
−5
0
5
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
0.8
0.4
0.0
density of mean of x
1.2
Distribution of Mean, Sample=100 (Normal(0, 32 /100))
−10
−5
0
5
10
mean of x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
4
3
2
0
1
density of mean of x
5
Distribution of Mean, Sample=2000 (Normal(0, 32 /2000))
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Terms
Asymptotic: related to very large (tending to infinite) sample sizes
Consistency: an estimator (formula’s result) ’tends to’ the correct
value as sample size tends to infinity
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Law of Large Numbers
As the Sample Size Increases, x̄ tends to the Expected Value (The True
Mean)
This is the “law of large numbers”.
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
The Basic Idea of the CLT
For ANY DISTRIBUTION (not just the normal) of x, the
distribution of x̄ approaches a normal distribution as the size of the
sample upon which x̄ is calculated tends to infinity.
This one is difficult to prove algebraically, but it is quite easy to
demonstrate with simulation
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Take, for example, the Poisson Distribution
●
0.15
●
●
0.10
0.05
Probability
0.20
●
●
●
●
0.00
●
●
0
2
4
6
8
●
●
10
a Poisson variate with lambda=3
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Poisson(3), SampleSize=10
0.2
●
●
●
●
0.1
Density
0.3
0.4
Poisson Sample N=10
●
●
●
0.0
●
2
4
6
●
8
●
●
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Poisson(3), SampleSize=100
Poisson Sample N=100
●
●
●
0.10
Density
0.20
●
●
●
●
0.00
●
●
2
4
6
8
●
●
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Poisson(3), SampleSize=2000
Poisson Sample N=2000
●
0.15
●
●
0.10
0.05
Density
0.20
●
●
●
●
0.00
●
●
2
4
6
8
●
●
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Poisson(3), SampleSize=10000
Poisson Sample N=10000
●
0.15
●
●
0.10
0.05
Density
0.20
●
●
●
●
0.00
●
●
2
4
6
8
●
●
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Means of 1000 Poisson Samples, Sample Size 10.
0.4
0.0
0.2
Density
0.6
Means with N=10
2
4
6
8
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Means from 1000 Poissons, Sample Size=100
1.5
1.0
0.0
0.5
Density
2.0
Means with N=100
2
4
6
8
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Means from 1000 Poisson samples, Sample Size=2000
6
4
0
2
Density
8
10
Means with N=2000
2
4
6
8
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Means from 1000 Poisson samples, Sample Size=10000
10
0
5
Density
15
20
Means with N=10000
2
4
6
8
10
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Same thing, bigger picture (N=10000)
Means with N=10000
10
0
5
Density
15
20
Observed
Theoretical
2.94
2.96
2.98
3.00
3.02
3.04
3.06
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
1.0
0.6
0.8
Probability
1.2
1.4
Consider the Uniform Distribution
0.0
0.2
0.4
0.6
0.8
1.0
A Uniform Variate
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Means from 1000 Uniform samples, Sample Size=30
4
3
0
1
2
Density
5
6
7
Means with N=30
0.0
0.2
0.4
0.6
0.8
1.0
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Means from 1000 Uniform samples, Sample Size=500
15
0
5
10
Density
20
25
30
Means with N=500
0.0
0.2
0.4
0.6
0.8
1.0
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
Means from 1000 Uniform samples, Sample Size=2000
Means with N=2000
60
40
0
20
Density
80
100
Observed
Theoretical
0.46
0.48
0.50
0.52
0.54
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
1.4
1.2
1.0
Probability
1.6
OK, Challenge Me With Your Beta(0.9,0.9)
0.0
0.2
0.4
0.6
0.8
1.0
A Beta Variate
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
60
Means from 1000 Beta Samples, Sample Size=2000
30
0
10
20
Density
40
50
Observed
Theoretical
0.46
0.48
0.50
0.52
0.54
x
CLT I
University of Kansas
The Difference Between A Sample and The Truth
Sampling Distribution
Asymptotic Properties
The Central Limit Theorem
My Mantra
From whatever distribution you pick, the Central Limit Theorem (CLT)
says the “Sampling Distribution of the Mean is Normal”.
CLT I
University of Kansas