Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Amherst College Department of Economics Economics 360 Fall 2012 Wednesday, September 12 Handout: Interval Estimates and the Central Limit Theorem Preview • Review o Random Variables o Relative Frequency Interpretation of Probability • Populations, Samples, Estimation Procedures, and the Estimate’s Probability Distribution o Mean and Variance of the Estimate’s Probability Distribution for a Sample Size of T o Why Is the Mean of the Estimate’s Probability Distribution Important? o Why Is the Variance of the Estimate’s Probability Distribution Important? • Interval Estimates • Central Limit Theorem • Normal Distribution: A Way to Estimate Probabilities • Clint’s Dilemma and His Opinion Poll Review Random Variables: Before the experiment is conducted • Bad news. What we do not know: We cannot determine the numerical value of the random variable with certainty. • Good news. What we do know: On the other hand, we can often calculate the random variable’s probability distribution telling us how likely it is for the random variable to equal each of its possible numerical values. Relative Frequency Interpretation of Probability: After many, many repetitions of the experiment the distribution of the numerical values from the experiments mirrors the random variable’s probability distribution Question: How do we describe a distribution? Center (Mean) and Spread (Variance) Populations, Samples, Estimation Procedures, and the Estimate’s Probability Distribution Populations and Samples Question: How can we use sample information to draw inferences about a population? Opinion Poll: Sample Size of T Write the names of every individual in the population on a card • Perform the following procedure T times: o Thoroughly shuffle the cards. o Randomly draw one card. o Ask that individual if he/she supports Clint; the individual’s answer determines the numerical value of vi: th vi equals 1 if the i individual polled supports Clint; 0 otherwise. o Replace the card. • Calculate the fraction of those polled supporting Clint. v1 + v2 + … + vT 1 = where T = Sample Size EstFrac = T ( v1 + v2 + … + vT,) T The estimated fraction, EstFrac, is a random variable; we cannot predict its value before the poll is conducted. 2 Question: What can we say about the random variable EstFrac? Answer: We can describe the center and spread of EstFrac’s probability distribution by calculating its mean and variance. Question: What do we know about the vi’s? Answer: Recall our discussion of a sample size of 2. Applying the logic we used about v1 and v2, we know the following: • Mean[vi] = p for each i; that is, Mean[v1] = Mean[v2] = … = Mean[vT] = p. • • Var[vi] = p(1-p) for each i; that is, Var[v1] = Var[v2] = … = Var[vT] = p(1-p). the vi’s are independent; hence, their covariances equal 0. where p = ActFrac = Actual fraction of the population supporting Clint Distribution Center: Mean of the Estimate’s Probability Distribution 1 Mean[EstFrac] = Mean T ( v1 + v2 + … + vT ) Mean[cx] = cMean[x] [ ] = Mean[x + y] = Mean[x] + Mean[y] = Mean[v1] = Mean[v2] = … = Mean[vT] = p = How many p terms are there? A total of ___. = Simplifying = Distribution Spread: Variance of the Estimate’s Probability Distribution 1 Var[EstFrac] = Var T (v1 + v2 + … + vT) [ ] 2 Var[cx] = c Var[x] = Var[x + y] = Var[x] + Var[y] when x and y are independent; hence, the covarainces are all 0. = Var[v1] = Var[v2] = … = Var[vT] = p(1 − p) = How many p(1 − p) terms are there? A total of ____. = Simplifying = To summarize: Mean[EstFrac] = Var[EstFrac] = 3 Simulations: Confirming the Equations for the Mean and Variance p(1 − p) Mean[EstFrac] = p Var[EstFrac] = T where p = ActFrac T = Sample Size 1 1 For purposes of illustration, let the actual population fraction, ActFrac, equal 2 : p = 2 Mean[EstFrac] = 1 2 = .50 Equations: Mean of Variance of EstFrac’s EstFrac’s Sample Probability Probability Size Distribution Distribution 1 2 1 × (1− 2 ) 1 1 × 2 2 1 4 1 = T = 4T T Simulations: Mean (Average) of Variance of Numerical Values Numerical Values Simulation of EstFrac from of EstFrac from Repetitions the Experiments the Experiments Var[EstFrac] = T = 1 _______ _________________ _________ ______ _______________ 2 _______ _________________ _________ ______ _______________ 25 _______ _________________ _________ ______ _______________ 100 _______ _________________ _________ ______ _______________ 400 _______ _________________ _________ ______ _______________ Conclusion: Our equations and simulations produce identical results illustrating the relative frequency interpretation of probability: After many, many repetitions of the experiment, the distribution of the actual numerical values mirrors the random variable’s probability distribution. Question: Why is the Mean of the Estimate’s Probability Distribution Important? Conceptually, an estimation procedure is unbiased when it does not systematically underestimate or overestimate the actual population fraction. Probability Distribution of EstFrac Formally, an estimation procedure is unbiased whenever the mean of the estimate’s probability distribution equals the actual value. Clint’s estimation procedure is unbiased: Unbiased Estimation Procedure ↓ Mean[EstFrac] = ActFrac We can apply the relative frequency interpretation of probability to gain more insight into what it means for an estimation procedure to be unbiased: Relative Frequency Interpretation of Probability ↓ Average of the estimate’s numerical values after = many, many repetitions Mean[EstFrac] ActFrac Unbiased Estimation Procedure Mean[EstFrac] = ActFrac Average of the estimate’s numerical values after many, many repetitions = ActFrac If the probability distribution is symmetric, we have even more intuition; in a single poll, the chances that the chances that the estimated fraction __________ the estimated fraction is too low is too high EstFrac 4 Question: Why is the Variance of the Estimate’s Probability Distribution Important when the Estimation Procedure Is Unbiased? Claim: When the estimation procedure is unbiased, the reliability of the estimate depends on the variance of the estimate’s probability distribution. Quantifying Reliability: The Interval Estimates Interval Estimate Question: What is the probability that the estimated fraction from a single poll lies “close to” the actual value? Small probability ↓ Estimate is _______________. Large probability ↓ Estimate is _______________. The “close to” criterion: First, we must decide on our ““close to”” criterion. For purposes of illustration, choose .05. Interval Estimate Question: What is the probability that the estimated fraction from a single poll lies “close to”, within .05 of, the actual value? Strategy: To answer the interval estimate question we shall use your opinion poll simulation and then exploit the relative frequency interpretation of probability. Question: After many, many repetitions, how frequently is the estimated fraction “close to”, within .05, of the actual population fraction? To keep the arithmetic simple, assume that the election is actually a tossup: 1 Actual Population Fraction = ActFrac = p = 2 = .50 ⇒ 1 Mean[EstFrac] = 2 = .50 Sample Size Variance of EstFrac’s Probability Distribution Simulation Repetitions Simulation: Percent of Repetitions in which the Numerical Value of EstFrac Lies between .45 and .55 25 0.01 __________ _______ 100 0.0025 __________ _______ 400 0.000625 __________ _______ Histograms of EstFrac Numerical Values Sample size = 25 Sample size = 100 Sample size = 400 ___% ___% .45 .50 .55 ___% .45 .50 .55 .45 .50 .55 5 Now, reconsider the interval estimate question: Interval Estimate Question: What is the probability that the numerical value of the estimated fraction from a single poll (one repetition of the experiment) lies “close to”, within .05, of the actual population fraction? Query: How can we use the simulation to answer the interval estimate question? Answer: We can apply the relative frequency interpretation of probability. Relative Frequency Interpretation of Probability: After many, many repetitions of the experiment the distribution of the numerical values from the experiments mirrors the random variable’s probability distribution Applying the relative frequency interpretation of probability: The portion of estimates that lie within .05 of the actual value, between .45 and .55, after many, many repetitions, ___________ The probability that the estimate lies within .05 of the actual value, between .45 and .55, in a single poll (one repetition) Sample Size Variance of EstFrac’s Probability Distribution Probability that the Numerical Value of EstFrac Lies between .45 and .55 in a Single Poll (One Repetition) 25 0.01 _______ 100 0.0025 _______ 400 0.000625 _______ Now, let us generalize: Variance Large ↓ ________ probability that the numerical value of the estimate from one repetition of the experiment will be “close to” the actual value. ↓ Estimate is _______________ Variance Small ↓ ________ probability that the numerical value of the estimate from one repetition of the experiment will be “close to” the actual value. ↓ Estimate is _______________ Variance Large ActFrac EstFrac Variance Small ActFrac EstFrac Summary: When the estimation procedure is unbiased, the variance of the estimates probability distribution tells us how reliable the estimate is. 6 Central Limit Theorem Motivation: Role of Standard Deviations Central Limit Theorem: As the sample size becomes larger and larger, the normal distribution provides better and better approximations of interval estimates. Strategy for Explaining the Central Limit Theorem: Four Steps • Step 1: Use the equations to calculate the mean, variance and standard deviations of EstFrac’s probability distribution for three sample sizes, 25, 100, and 400. • Step 2: Use simulations to calculate the percent of repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac], the mean EstFrac’s probability distribution. • Step 3: Observe an interesting similarity. • Step 4: Introduce the normal distribution and use it to calculate the percent of repetitions that fall within 1, 2, and 3 standard deviations of Mean[EstFrac]. 1 Step 1: To keep the arithmetic simple, assume that the election is a tossup: ActFrac = p = 2 = .50 Sample Size = T = 25: Mean[EstFrac] = p = Var[EstFrac] = p(1 − p) = T = ⎯⎯⎯⎯⎯⎯ SD[EstFrac] = √ Var[EstFrac] = Sample Size = T = 100: ⎯⎯ = = ________ Mean[EstFrac] = p = Var[EstFrac] = p(1 − p) = T = ⎯⎯⎯⎯⎯⎯ SD[EstFrac] = √ Var[EstFrac] = Sample Size = T = 400: √ = √ = ⎯⎯ = = ________ Mean[EstFrac] = p = Var[EstFrac] = p(1 − p) = T = ⎯⎯⎯⎯⎯⎯ SD[EstFrac] = √ Var[EstFrac] = √ = ⎯⎯ = = ________ 25 Sample Sizes 100 400 Mean[EstFrac] ______ ______ ______ SD[EstFrac] ______ ______ ______ ______ - ______ ______ - ______ ______ - ______ _____% _____% _____% ______ - ______ ______ - ______ ______ - ______ _____% _____% _____% ______ - ______ ______ - ______ ______ - ______ _____% _____% _____% Step 2: Simulations Interval: 1 SD From-To Values Percent of Repetitions Interval: 2 SD’s From-To Values Percent of Repetitions Interval: 3 SD’s From-To Values Percent of Repetitions Step 3: Observe an interesting similarity. Question: What do the results suggest? 7 Step 4: The Normal Distribution – The Famous Bell-Shaped Curve z is the “normalized” value of the random variable; z equals the number of standard deviations the value lies from the distribution mean: Value of Random Variable – Distribution Mean z= Standard Deviation of Random Variable z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 0.00 0.5000 0.4602 0.4207 0.3821 0.3446 0.3085 0.2743 0.2420 0.2119 0.1841 0.1587 0.1357 0.1151 0.0968 0.0808 0.0668 0.0548 0.0446 0.0359 0.0287 0.0228 0.0179 0.0139 0.0107 0.0082 0.0062 0.0047 0.0035 0.0026 0.01 0.4960 0.4562 0.4168 0.3783 0.3409 0.3050 0.2709 0.2389 0.2090 0.1814 0.1562 0.1335 0.1131 0.0951 0.0793 0.0655 0.0537 0.0436 0.0351 0.0281 0.0222 0.0174 0.0136 0.0104 0.0080 0.0060 0.0045 0.0034 0.0025 0.02 0.4920 0.4522 0.4129 0.3745 0.3372 0.3015 0.2676 0.2358 0.2061 0.1788 0.1539 0.1314 0.1112 0.0934 0.0778 0.0643 0.0526 0.0427 0.0344 0.0274 0.0217 0.0170 0.0132 0.0102 0.0078 0.0059 0.0044 0.0033 0.0024 0.03 0.4880 0.4483 0.4090 0.3707 0.3336 0.2981 0.2643 0.2327 0.2033 0.1762 0.1515 0.1292 0.1093 0.0918 0.0764 0.0630 0.0516 0.0418 0.0336 0.0268 0.0212 0.0166 0.0129 0.0099 0.0075 0.0057 0.0043 0.0032 0.0023 0.04 0.4840 0.4443 0.4052 0.3669 0.3300 0.2946 0.2611 0.2296 0.2005 0.1736 0.1492 0.1271 0.1075 0.0901 0.0749 0.0618 0.0505 0.0409 0.0329 0.0262 0.0207 0.0162 0.0125 0.0096 0.0073 0.0055 0.0041 0.0031 0.0023 Using the table: • The row specifies the z value’s whole number and its tenths. • The column the z value’s hundredths. The number in the table estimates the probability that the random variable lies z standard deviations above the mean. Normal Distribution: Three Important Properties • The normal distribution is bell shaped. • The normal distribution is symmetric around its mean (center). • The area beneath the normal distribution equals 1. 0.05 0.4801 0.4404 0.4013 0.3632 0.3264 0.2912 0.2578 0.2266 0.1977 0.1711 0.1469 0.1251 0.1056 0.0885 0.0735 0.0606 0.0495 0.0401 0.0322 0.0256 0.0202 0.0158 0.0122 0.0094 0.0071 0.0054 0.0040 0.0030 0.0022 0.06 0.4761 0.4364 0.3974 0.3594 0.3228 0.2877 0.2546 0.2236 0.1949 0.1685 0.1446 0.1230 0.1038 0.0869 0.0721 0.0594 0.0485 0.0392 0.0314 0.0250 0.0197 0.0154 0.0119 0.0091 0.0069 0.0052 0.0039 0.0029 0.0021 0.07 0.4721 0.4325 0.3936 0.3557 0.3192 0.2843 0.2514 0.2206 0.1922 0.1660 0.1423 0.1210 0.1020 0.0853 0.0708 0.0582 0.0475 0.0384 0.0307 0.0244 0.0192 0.0150 0.0116 0.0089 0.0068 0.0051 0.0038 0.0028 0.0021 0.08 0.4681 0.4286 0.3897 0.3520 0.3156 0.2810 0.2483 0.2177 0.1894 0.1635 0.1401 0.1190 0.1003 0.0838 0.0694 0.0571 0.0465 0.0375 0.0301 0.0239 0.0188 0.0146 0.0113 0.0087 0.0066 0.0049 0.0037 0.0027 0.0020 0.09 0.4641 0.4247 0.3859 0.3483 0.3121 0.2776 0.2451 0.2148 0.1867 0.1611 0.1379 0.1170 0.0985 0.0823 0.0681 0.0559 0.0455 0.0367 0.0294 0.0233 0.0183 0.0143 0.0110 0.0084 0.0064 0.0048 0.0036 0.0026 0.0019 Probability Distribution Probability of being more than z standard deviations above the distribution mean Distribution Mean z SD’s 8 Central Limit Theorem Central Limit Theorem: As the sample size becomes larger and larger, the normal distribution provides better and better approximations of interval estimates. To justify using the normal distribution to calculate the probabilities, reconsider our simulations in which we calculated the percent of repetitions that fall within 1, 2, and 3 standard deviations of the mean after many, many repetitions. Now, use the normal distribution to calculate these percentages. Interval: Standard Deviations from Random Variable’s Mean Simulation: Percent of Repetitions Within Interval Sample Size 25 100 400 Normal Distribution Percentages 1 ≈69.2% ≈68.5% ≈68.3% _____% 2 ≈96.3% ≈95.6% ≈95.5% _____% 3 ≈99.9% ≈99.8% ≈99.7% _____% To use the normal distribution to estimate the probability of being within one, two, and three standard deviations of the mean reviewing two of the normal distribution’s properties: • The normal distribution is symmetric around its mean (center). • The area beneath the normal distribution equals 1. z 0.00 0.01 0.9 0.1841 0.1814 1.0 0.1587 0.1562 1.1 0.1357 0.1335 Probability within 1 SD = z 0.00 0.01 1.9 0.0287 0.0281 2.0 0.0228 0.0222 2.1 0.0179 0.0174 Probability within 2 SD’s = Probability within 3 SD’s = _______________ = _____ _______________ = _____ _______________ = _____ Probability Distribution ______ z 2.9 3.0 Probability Distribution ______ ______ 1 SD 1 SD Distribution Mean ______ 2 SD’s Distribution Mean Normal Distribution Rules of Thumb Standard Deviations from Probability of the Distribution Mean being within 1 _______ 2 _______ 3 _______ 2 SD’s 0.00 0.01 0.0019 0.0018 0.0013 0.0013 9 Revisiting Clint’s Dilemma On the eve of the election Clint must decide whether or not to finance pre-election party. He does not have enough time to canvas everyone, however. • If he is comfortably ahead, he will not hold the party; he will save his campaign funds for a future political endeavor (or a spring vacation trip to Cancun). • If he is not comfortably ahead, he will hold the party trying to capture more votes. There is not enough time to canvas everyone, however. What should he do? Econometrician’s Philosophy: If you lack the information to determine the value directly, do the best you can by estimating the value using the information you do have. Clint’s Estimation Procedure: Use the fraction of those polled, EstFrac, to estimate the actual population fraction. • Questionnaire: Are you voting for Clint? • Procedure: Clint selects 16 students at random and poses the question. • Results: 12 students report that they will vote for Clint and 4 against Clint. 12 3 Fraction of those polled supporting Clint: EstFrac = 16 = 4 = .75 From the poll, we estimate that seventy-five percent, .75, of the population supports Clint. The poll suggests that Clint leads. Question: Should Clint be confident that he has the election in hand or should he fund the party?