Download Variation in Sample Estimates Lecture Handout

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Variation in Estimates of Sample
Properties
Samples So Far
Here’s what we’ve talked about with respect to samples of a
population:
I
Estimating Mean value of a sample property
I
Sample design to get correct unbiased estimate of that mean
p
Standard deviation to describe dispersion Variance
I
I
67% of the values within a population fall within 1 SD of the
Mean
I
95% of the values within a population fall within 2 SD of the
Mean
Sample Properties: Variance
How variable was that population?
s2 =
n
X
(Yi
Ȳ )2
i=1
n 1
I
Sums of Squares over n-1
I
n-1 corrects for both sample size and sample bias
I
I
2
if describing the population
Units in square of measurement...
Sample Properties: Standard Deviation
s=
I
I
I
I
p
s2
Units the same as the measurement
If distribution is normal, 67% of data within 1 SD
95% within 2 SD
if describing the population
Populations, Samples, and Repeatability
How good is our estimate of a population parameter?
400
300
200
100
0
0
100
200
300
400
Populations, Samples, and Repeatability
We’ve seen that we get variation in point estimates at any sample
size
What does that variation look like?
Exercise: Variation in Estimation
I
Consider a population with some distribution (rnorm, runif,
rgamma)
I
Think of the mean of one sample as one individual replicate
I
Take many (50) ‘replicate’ means from this population of
means
I
What does the distribution of means look like? Use the hist
function
I
How does it depend on sample size (within replicates) or
distribution type?
Extra: Show the change in distributions with sample size in one
figure.
Central Limit Theorem Simulation
set.seed(607)
n<-3
mvec<-rep(NA, times=100)
#simulate sampling events!
for(i in 1:length(mvec)){
mvec[i]<-mean(runif(n, 0,100))
}
hist(mvec, main="n=3")
Central Limit Theorem
The distribution of means converges on normality
n=9
0
0 10
10
n=3
20
40
60
80
20
60
80
n = 27
0
0
15
15
n = 15
40
20
40
60
80
20
40
60
80
Central Limit Theorem: The distribution of
means of a sufficiently large sample size will be approximately
normal
Estimating Variation Around a Mean
Great, so, if we can draw many replicated means from a larger
population, we can the standard deviation of an estimate!
This standard deviation of the estimate of the mean is the
Standard Error.
Variation in a Population v. Variation in Estimates of a
Parameter
Estimate of Mean from n=5
40
30
10
20
Frequency
0.004
0.002
0
0.000
Density
0.006
50
0.008
Population, Mean=10, SD=15
−150
−50
0
x
50
100
150
−10
0
10
Mean
20
30
Variation in a Population v. Variation in Estimates of a
Parameter: 66.7% of your Estimates
Estimate of Mean from n=5
SD of Estimate = 6.951
40
0
0.000
10
0.002
20
30
Frequency
0.004
Density
0.006
50
0.008
Population, Mean=10, SD=15
−150
−50
0
50
100
150
−10
0
x
10
20
30
Mean
Variation in a Population v. Variation in Estimates of a
Parameter: 95% of your Estimates
Estimate of Mean from n=5
2*SD of Estimate = 13.902
40
20
30
Frequency
0.004
10
0.002
0
0.000
Density
0.006
50
0.008
Population, Mean=10, SD=15
−150
−50
0
x
50
100
150
−10
0
10
Mean
20
30
That’s great, but for a single study, we only have one sample...
A Bootstrap Simulation Approach to Standard Error
I
Our sample is representative of the entire population
I
Therefore, we can resample it with replacement for 1
simulated sample
I
We use our sample size as the new sample size as well
We set the replace=TRUE argument in the sample function
Try sampling from the bird count data with replacement.
A Bootstrap Simulation Approach to Standard Error
sample(bird$Count, replace=T, size=nrow(bird))
# [1]
2
7 77 148 23
# [15] 300 297 67 173 148
# [29] 625 230 297 128 33
# [43] 16
1
23
18
1 173
2
4
1 135
64 33
4
2 77
1
18 173 135
12 14 23
4
3
1
1
2
67
sample(bird$Count, replace=T, size=nrow(bird))
# [1]
1
# [15] 230
# [29] 64
# [43] 28
5
3
14
13
3 28 173
7 13
2
2 230 282
3 173 625 625
1 13 77 135 135 25 173
2
3
5
10
64
14
16
1
1
16 625
4 12
33 230
A Bootstrap Simulation Approach to Standard Error
n.sims<-100
birdMean <- rep(NA, n.sims)
for(i in 1:n.sims){
birdMean[i] <- mean(sample(bird$Count, replace=T, size=nrow(bird)))
}
sd(birdMean)
# [1] 17.8
But what if we don’t have simulation?
s
SEȲ = p
n
Ȳ - sample mean
s - sample standard deviation
n - sample size
Bootstrap v. Formula for Standard Error
sd(birdMean)
# [1] 17.8
sd(bird$Count)/sqrt(nrow(bird))
# [1] 18.6
95% Confidence Interval and SE
I
Recall that 95% of the data in a sample is within 2SD of its
mean
I
So, 95% of the times we sample a population, the true mean
will lie within 2SE of our estimated mean
I
This is the 95% Confidence Interval
Ȳ
2SE  µ  Ȳ + 2SE
How Often does the 95% Confidence Interval Contain the
True Mean?
0
5
10
15
20
True Mean = 0 from Normal Distribution with SD=1
−1.0
−0.5
0.0
value
0.5
1.0
Variation in Other Estimates
I
I
I
I
Many SEs and CIs of estimates have formulae and well
understood properties
For those that do not, we can bootstrap the SE of any
estimate - e.g., the median
Bootstrapped estimates (mean of simulated replicates)
can be used to assess bias
Bootstrapping is not a panacea - requires a good sample
size to start