Download Handout - Amherst College

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Amherst College
Department of Economics
Economics 360
Fall 2012
Wednesday, September 12 Handout: Interval Estimates and the Central Limit
Theorem
Preview
• Review
o Random Variables
o Relative Frequency Interpretation of Probability
• Populations, Samples, Estimation Procedures, and the Estimate’s Probability Distribution
o Mean and Variance of the Estimate’s Probability Distribution for a Sample Size of T
o Why Is the Mean of the Estimate’s Probability Distribution Important?
o Why Is the Variance of the Estimate’s Probability Distribution Important?
• Interval Estimates
• Central Limit Theorem
• Normal Distribution: A Way to Estimate Probabilities
• Clint’s Dilemma and His Opinion Poll
Review
Random Variables: Before the experiment is conducted
• Bad news. What we do not know: We cannot determine the numerical value of the
random variable with certainty.
• Good news. What we do know: On the other hand, we can often calculate the
random variable’s probability distribution telling us how likely it is for the random
variable to equal each of its possible numerical values.
Relative Frequency Interpretation of Probability: After many, many repetitions of the
experiment the distribution of the numerical values from the experiments mirrors the
random variable’s probability distribution
Question: How do we describe a distribution? Center (Mean) and Spread (Variance)
Populations, Samples, Estimation Procedures, and the Estimate’s Probability Distribution
Populations and Samples
Question: How can we use sample information to draw inferences about a population?
Opinion Poll: Sample Size of T
Write the names of every individual in the population on a card
• Perform the following procedure T times:
o Thoroughly shuffle the cards.
o Randomly draw one card.
o Ask that individual if he/she supports Clint; the individual’s answer
determines the numerical value of vi:
th
vi equals 1 if the i individual polled supports Clint; 0 otherwise.
o Replace the card.
• Calculate the fraction of those polled supporting Clint.
v1 + v2 + … + vT
1
=
where T = Sample Size
EstFrac =
T ( v1 + v2 + … + vT,)
T
The estimated fraction, EstFrac, is a random variable; we cannot predict its value before
the poll is conducted.
2
Question: What can we say about the random variable EstFrac?
Answer: We can describe the center and spread of EstFrac’s probability distribution by
calculating its mean and variance.
Question: What do we know about the vi’s?
Answer: Recall our discussion of a sample size of 2. Applying the logic we used about v1 and
v2, we know the following:
•
Mean[vi] = p for each i; that is, Mean[v1] = Mean[v2] = … = Mean[vT] = p.
•
•
Var[vi] = p(1-p) for each i; that is, Var[v1] = Var[v2] = … = Var[vT] = p(1-p).
the vi’s are independent; hence, their covariances equal 0.
where p = ActFrac = Actual fraction of the population supporting Clint
Distribution Center: Mean of the Estimate’s Probability Distribution
1
Mean[EstFrac] = Mean T ( v1 + v2 + … + vT )
Mean[cx] = cMean[x]
[
]
=
Mean[x + y] = Mean[x] + Mean[y]
=
Mean[v1] = Mean[v2] = … = Mean[vT] = p
=
How many p terms are there? A total of ___.
=
Simplifying
=
Distribution Spread: Variance of the Estimate’s Probability Distribution
1
Var[EstFrac]
= Var T (v1 + v2 + … + vT)
[
]
2
Var[cx] = c Var[x]
=
Var[x + y] = Var[x] + Var[y] when x and y are
independent; hence, the covarainces are all 0.
=
Var[v1] = Var[v2] = … = Var[vT] = p(1 − p)
=
How many p(1 − p) terms are there? A total of ____.
=
Simplifying
=
To summarize: Mean[EstFrac] =
Var[EstFrac] =
3
Simulations: Confirming the Equations for the Mean and Variance
p(1 − p)
Mean[EstFrac] = p
Var[EstFrac] =
T
where p = ActFrac
T = Sample Size
1
1
For purposes of illustration, let the actual population fraction, ActFrac, equal 2 : p = 2
Mean[EstFrac] =
1
2
= .50
Equations:
Mean of
Variance of
EstFrac’s
EstFrac’s
Sample Probability
Probability
Size
Distribution Distribution
1
2
1
× (1− 2 )
1 1
×
2 2
1
4
1
= T = 4T
T
Simulations:
Mean (Average) of
Variance of
Numerical Values
Numerical Values
Simulation
of EstFrac from
of EstFrac from
Repetitions the Experiments
the Experiments
Var[EstFrac] =
T
=
1
_______
_________________
_________
______
_______________
2
_______
_________________
_________
______
_______________
25
_______
_________________
_________
______
_______________
100
_______
_________________
_________
______
_______________
400
_______
_________________
_________
______
_______________
Conclusion: Our equations and simulations produce identical results illustrating the relative
frequency interpretation of probability: After many, many repetitions of the experiment, the
distribution of the actual numerical values mirrors the random variable’s probability distribution.
Question: Why is the Mean of the Estimate’s Probability Distribution Important?
Conceptually, an estimation procedure is unbiased when it does not systematically
underestimate or overestimate the actual population fraction.
Probability Distribution of EstFrac
Formally, an estimation procedure is unbiased whenever the mean of
the estimate’s probability distribution equals the actual value. Clint’s
estimation procedure is unbiased:
Unbiased Estimation Procedure
↓
Mean[EstFrac] = ActFrac
We can apply the relative frequency interpretation of probability to
gain more insight into what it means for an estimation procedure to be unbiased:
Relative Frequency
Interpretation of Probability
↓
Average of the estimate’s
numerical values after
=
many, many repetitions
Mean[EstFrac]
ActFrac
Unbiased Estimation Procedure
Mean[EstFrac] = ActFrac
Average of the estimate’s
numerical values after
many, many repetitions
= ActFrac
If the probability distribution is symmetric, we have even more intuition; in a single poll,
the chances that
the chances that
the estimated fraction
__________
the estimated fraction
is too low
is too high
EstFrac
4
Question: Why is the Variance of the Estimate’s Probability Distribution Important when the Estimation
Procedure Is Unbiased?
Claim: When the estimation procedure is unbiased, the reliability of the estimate depends on the
variance of the estimate’s probability distribution.
Quantifying Reliability: The Interval Estimates
Interval Estimate Question: What is the probability that the estimated fraction from a single poll
lies “close to” the actual value?
Small probability
↓
Estimate is _______________.
Large probability
↓
Estimate is _______________.
The “close to” criterion: First, we must decide on our ““close to”” criterion. For purposes
of illustration, choose .05.
Interval Estimate Question: What is the probability that the estimated fraction from a single poll
lies “close to”, within .05 of, the actual value?
Strategy: To answer the interval estimate question we shall use your opinion poll
simulation and then exploit the relative frequency interpretation of probability.
Question: After many, many repetitions, how frequently is the estimated fraction “close
to”, within .05, of the actual population fraction?
To keep the arithmetic simple, assume that the election is actually a tossup:
1
Actual Population Fraction = ActFrac = p = 2 = .50
⇒
1
Mean[EstFrac] = 2 = .50
Sample
Size
Variance of
EstFrac’s Probability
Distribution
Simulation
Repetitions
Simulation: Percent of Repetitions
in which the Numerical Value of
EstFrac Lies between .45 and .55
25
0.01
__________
_______
100
0.0025
__________
_______
400
0.000625
__________
_______
Histograms of EstFrac Numerical Values
Sample size = 25
Sample size = 100
Sample size = 400
___%
___%
.45
.50
.55
___%
.45
.50
.55
.45
.50
.55
5
Now, reconsider the interval estimate question:
Interval Estimate Question: What is the probability that the numerical value of the estimated
fraction from a single poll (one repetition of the experiment) lies “close to”, within .05, of the
actual population fraction?
Query: How can we use the simulation to answer the interval estimate question?
Answer: We can apply the relative frequency interpretation of probability.
Relative Frequency Interpretation of Probability: After many, many repetitions of
the experiment the distribution of the numerical values from the experiments mirrors
the random variable’s probability distribution
Applying the relative frequency interpretation of probability:
The portion of estimates that lie
within .05 of the actual value,
between .45 and .55,
after many, many repetitions,
___________
The probability that the estimate lies
within .05 of the actual value,
between .45 and .55,
in a single poll (one repetition)
Sample
Size
Variance of EstFrac’s
Probability Distribution
Probability that the Numerical Value
of EstFrac Lies between .45 and .55
in a Single Poll (One Repetition)
25
0.01
_______
100
0.0025
_______
400
0.000625
_______
Now, let us generalize:
Variance Large
↓
________ probability that the numerical
value of the estimate from one
repetition of the experiment will be
“close to” the actual value.
↓
Estimate is _______________
Variance Small
↓
________ probability that the numerical
value of the estimate from one
repetition of the experiment will be
“close to” the actual value.
↓
Estimate is _______________
Variance Large
ActFrac
EstFrac
Variance Small
ActFrac
EstFrac
Summary: When the estimation procedure is unbiased, the variance of the estimates
probability distribution tells us how reliable the estimate is.
6
Central Limit Theorem Motivation: Role of Standard Deviations
Central Limit Theorem: As the sample size becomes larger and larger, the normal
distribution provides better and better approximations of interval estimates.
Strategy for Explaining the Central Limit Theorem: Four Steps
• Step 1: Use the equations to calculate the mean, variance and standard deviations of
EstFrac’s probability distribution for three sample sizes, 25, 100, and 400.
• Step 2: Use simulations to calculate the percent of repetitions that fall within 1, 2, and
3 standard deviations of Mean[EstFrac], the mean EstFrac’s probability distribution.
• Step 3: Observe an interesting similarity.
• Step 4: Introduce the normal distribution and use it to calculate the percent of
repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac].
1
Step 1: To keep the arithmetic simple, assume that the election is a tossup: ActFrac = p = 2 = .50
Sample Size = T = 25:
Mean[EstFrac] = p =
Var[EstFrac] =
p(1 − p)
=
T
=
⎯⎯⎯⎯⎯⎯
SD[EstFrac] = √ Var[EstFrac] =
Sample Size = T = 100:
⎯⎯
=
= ________
Mean[EstFrac] = p =
Var[EstFrac] =
p(1 − p)
=
T
=
⎯⎯⎯⎯⎯⎯
SD[EstFrac] = √ Var[EstFrac] =
Sample Size = T = 400:
√
=
√
=
⎯⎯
=
= ________
Mean[EstFrac] = p =
Var[EstFrac] =
p(1 − p)
=
T
=
⎯⎯⎯⎯⎯⎯
SD[EstFrac] = √ Var[EstFrac] =
√
=
⎯⎯
=
= ________
25
Sample Sizes
100
400
Mean[EstFrac]
______
______
______
SD[EstFrac]
______
______
______
______ - ______
______ - ______
______ - ______
_____%
_____%
_____%
______ - ______
______ - ______
______ - ______
_____%
_____%
_____%
______ - ______
______ - ______
______ - ______
_____%
_____%
_____%
Step 2: Simulations
Interval: 1 SD
From-To Values
Percent of Repetitions
Interval: 2 SD’s
From-To Values
Percent of Repetitions
Interval: 3 SD’s
From-To Values
Percent of Repetitions
Step 3: Observe an interesting similarity. Question: What do the results suggest?
7
Step 4: The Normal Distribution – The Famous Bell-Shaped Curve
z is the “normalized” value of the random variable; z equals the number of standard
deviations the value lies from the distribution mean:
Value of Random Variable – Distribution Mean
z=
Standard Deviation of Random Variable
z
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
0.00
0.5000
0.4602
0.4207
0.3821
0.3446
0.3085
0.2743
0.2420
0.2119
0.1841
0.1587
0.1357
0.1151
0.0968
0.0808
0.0668
0.0548
0.0446
0.0359
0.0287
0.0228
0.0179
0.0139
0.0107
0.0082
0.0062
0.0047
0.0035
0.0026
0.01
0.4960
0.4562
0.4168
0.3783
0.3409
0.3050
0.2709
0.2389
0.2090
0.1814
0.1562
0.1335
0.1131
0.0951
0.0793
0.0655
0.0537
0.0436
0.0351
0.0281
0.0222
0.0174
0.0136
0.0104
0.0080
0.0060
0.0045
0.0034
0.0025
0.02
0.4920
0.4522
0.4129
0.3745
0.3372
0.3015
0.2676
0.2358
0.2061
0.1788
0.1539
0.1314
0.1112
0.0934
0.0778
0.0643
0.0526
0.0427
0.0344
0.0274
0.0217
0.0170
0.0132
0.0102
0.0078
0.0059
0.0044
0.0033
0.0024
0.03
0.4880
0.4483
0.4090
0.3707
0.3336
0.2981
0.2643
0.2327
0.2033
0.1762
0.1515
0.1292
0.1093
0.0918
0.0764
0.0630
0.0516
0.0418
0.0336
0.0268
0.0212
0.0166
0.0129
0.0099
0.0075
0.0057
0.0043
0.0032
0.0023
0.04
0.4840
0.4443
0.4052
0.3669
0.3300
0.2946
0.2611
0.2296
0.2005
0.1736
0.1492
0.1271
0.1075
0.0901
0.0749
0.0618
0.0505
0.0409
0.0329
0.0262
0.0207
0.0162
0.0125
0.0096
0.0073
0.0055
0.0041
0.0031
0.0023
Using the table:
• The row specifies the z value’s whole
number and its tenths.
• The column the z value’s hundredths.
The number in the table estimates the
probability that the random variable lies z
standard deviations above the mean.
Normal Distribution: Three Important
Properties
• The normal distribution is bell shaped.
• The normal distribution is symmetric
around its mean (center).
• The area beneath the normal
distribution equals 1.
0.05
0.4801
0.4404
0.4013
0.3632
0.3264
0.2912
0.2578
0.2266
0.1977
0.1711
0.1469
0.1251
0.1056
0.0885
0.0735
0.0606
0.0495
0.0401
0.0322
0.0256
0.0202
0.0158
0.0122
0.0094
0.0071
0.0054
0.0040
0.0030
0.0022
0.06
0.4761
0.4364
0.3974
0.3594
0.3228
0.2877
0.2546
0.2236
0.1949
0.1685
0.1446
0.1230
0.1038
0.0869
0.0721
0.0594
0.0485
0.0392
0.0314
0.0250
0.0197
0.0154
0.0119
0.0091
0.0069
0.0052
0.0039
0.0029
0.0021
0.07
0.4721
0.4325
0.3936
0.3557
0.3192
0.2843
0.2514
0.2206
0.1922
0.1660
0.1423
0.1210
0.1020
0.0853
0.0708
0.0582
0.0475
0.0384
0.0307
0.0244
0.0192
0.0150
0.0116
0.0089
0.0068
0.0051
0.0038
0.0028
0.0021
0.08
0.4681
0.4286
0.3897
0.3520
0.3156
0.2810
0.2483
0.2177
0.1894
0.1635
0.1401
0.1190
0.1003
0.0838
0.0694
0.0571
0.0465
0.0375
0.0301
0.0239
0.0188
0.0146
0.0113
0.0087
0.0066
0.0049
0.0037
0.0027
0.0020
0.09
0.4641
0.4247
0.3859
0.3483
0.3121
0.2776
0.2451
0.2148
0.1867
0.1611
0.1379
0.1170
0.0985
0.0823
0.0681
0.0559
0.0455
0.0367
0.0294
0.0233
0.0183
0.0143
0.0110
0.0084
0.0064
0.0048
0.0036
0.0026
0.0019
Probability Distribution
Probability of being
more than z standard
deviations above the
distribution mean
Distribution
Mean
z SD’s
8
Central Limit Theorem
Central Limit Theorem: As the sample size becomes larger and larger, the normal
distribution provides better and better approximations of interval estimates.
To justify using the normal distribution to calculate the probabilities, reconsider our
simulations in which we calculated the percent of repetitions that fall within 1, 2, and 3
standard deviations of the mean after many, many repetitions. Now, use the normal
distribution to calculate these percentages.
Interval:
Standard Deviations from
Random Variable’s Mean
Simulation:
Percent of Repetitions
Within Interval
Sample Size
25
100
400
Normal
Distribution
Percentages
1
≈69.2%
≈68.5%
≈68.3%
_____%
2
≈96.3%
≈95.6%
≈95.5%
_____%
3
≈99.9%
≈99.8%
≈99.7%
_____%
To use the normal distribution to estimate the probability of being within one, two, and three
standard deviations of the mean reviewing two of the normal distribution’s properties:
• The normal distribution is symmetric around its mean (center).
• The area beneath the normal distribution equals 1.
z
0.00
0.01
0.9
0.1841 0.1814
1.0
0.1587 0.1562
1.1
0.1357 0.1335
Probability within 1 SD =
z
0.00
0.01
1.9
0.0287 0.0281
2.0
0.0228 0.0222
2.1
0.0179 0.0174
Probability within 2 SD’s =
Probability within 3 SD’s =
_______________ = _____
_______________ = _____
_______________ = _____
Probability Distribution
______
z
2.9
3.0
Probability Distribution
______
______
1 SD
1 SD
Distribution
Mean
______
2 SD’s
Distribution
Mean
Normal Distribution Rules of Thumb
Standard Deviations from Probability of
the Distribution Mean
being within
1
_______
2
_______
3
_______
2 SD’s
0.00
0.01
0.0019 0.0018
0.0013 0.0013
9
Revisiting Clint’s Dilemma
On the eve of the election Clint must decide whether or not to finance pre-election party. He
does not have enough time to canvas everyone, however.
• If he is comfortably ahead, he will not hold the party; he will save his campaign
funds for a future political endeavor (or a spring vacation trip to Cancun).
• If he is not comfortably ahead, he will hold the party trying to capture more votes.
There is not enough time to canvas everyone, however. What should he do?
Econometrician’s Philosophy: If you lack the information to determine the value directly, do
the best you can by estimating the value using the information you do have.
Clint’s Estimation Procedure: Use the fraction of those polled, EstFrac, to estimate the actual
population fraction.
• Questionnaire: Are you voting for Clint?
• Procedure: Clint selects 16 students at random and poses the question.
• Results: 12 students report that they will vote for Clint and 4 against Clint.
12
3
Fraction of those polled supporting Clint: EstFrac = 16 = 4 = .75
From the poll, we estimate that seventy-five percent, .75, of the population supports Clint.
The poll suggests that Clint leads.
Question: Should Clint be confident that he has the election in hand or should he fund the
party?