Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PHP 2510 Central limit theorem, confidence intervals PHP 2510 – October 20, 2008 1 Distribution of the sample mean Case 1: Population distribution is normal For an individual in the population, Xi ∼ N (µ, σ 2 ) for i = 1, 2, ..., n. Then, for a sample of size n, the sample mean also has a normal distribution X ∼ N (µ, σ 2 /n) Case 2: Population distribution is not normal, e.g. Poisson, Binomial, Then, for large samples, the sample mean also has a normal distribution with mean equal to E(X) and variance equal to var(X)/n This is known as the central limit theorem PHP 2510 – October 20, 2008 2 Central Limit Theorem Characterizes distribution of X in large samples Suppose a sample X1 , . . . , Xn comes from a distribution with mean E(X) and variance var(X). This can be almost any distribution (binomial, poisson, etc.) When n is large, the sample mean X is normally distributed. Its mean is equal to the population mean, and its variance is var(X)/n. We can write µ ¶ var(X) X ∼ N E(X), n PHP 2510 – October 20, 2008 3 Example 1: Throw a fair coin. Use sample mean to estimate the probability of having a head. Let X be the outcome of throw a fair coin once. X∼ I Throw a fair coin n times. Let X1 , X2 , ..., Xn be the outcomes. Pn II Compute sample mean X̄ = i=1 Xi /n. III CLT says X̄ is normally distributed for a large n. To illustrate this, let’s repeat Steps I and II 1000 times. Plot X̄ versus its relative frequency. PHP 2510 – October 20, 2008 4 0.2 0.4 0.6 0.8 1 0.0666666666666667 0.6 sample mean n=40 n=100 0.00 0.04 relative freqency 0.08 0.04 0.00 0.8 0.08 sample mean 0.12 0 relative freqency 0.10 0.00 0.10 0.20 relative freqency 0.30 0.20 n=15 0.00 relative freqency n=5 0.275 0.45 0.6 sample mean PHP 2510 – October 20, 2008 0.75 0.35 0.5 sample mean 5 Example 2: Throw a fair die. Use sample mean to estimate the probability of having a six. Let X be the outcome of throw a fair die once. X∼ I Throw a fair die n times. Let X1 , X2 , ..., Xn be the outcomes. Pn II Compute sample mean X̄ = i=1 Xi /n. III CLT says X̄ is normally distributed for a large n. To illustrate this, let’s repeat Steps I and II 1000 times. Plot X̄ versus its relative frequency. PHP 2510 – October 20, 2008 6 n=15 0.20 0.00 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 sample mean n=40 n=100 0.00 0.04 relative freqency 0.10 0.05 0.00 0.08 sample mean 0.15 0 relative freqency 0.10 relative freqency 0.3 0.2 0.1 0.0 relative freqency 0.4 n=5 0 0.15 0.3 sample mean PHP 2510 – October 20, 2008 0.07 0.23 sample mean 7 Confidence intervals Confidence intervals can be used to convey uncertainty about the estimate of any parameter. A confidence interval is comprised of two random variables (lower & upper bound) and covers the true mean with some pre-specified probability The confidence interval boundaries themselves are random variables. PHP 2510 – October 20, 2008 8 What question does a CI answer? Example 1: Incidence of pre-eclampsia. A random sample of 1249 women is selected and followed through pregnancy. 250 get pre-eclampsia. Estimate the incidence by sample mean: 250 1249 = 20% We would like to find an interval that contains, with 95% probability, the true incidence of pre-eclampsia. PHP 2510 – October 20, 2008 9 Example 2: Hospitalization rate of HIV-infected women during a 6-month period. A sample of 787 women are followed for 6 months, resulting in the following data. We are interested to construct an interval that contains the true rate of hospitalization with 90% probability. PHP 2510 – October 20, 2008 10 numhosp | Freq. Percent Cum. ------------+----------------------------------0 | 508 64.55 64.55 1 | 176 22.36 86.91 2 | 61 7.75 94.66 3 | 20 2.54 97.20 4 | 13 1.65 98.86 5 | 5 0.64 99.49 6 | 1 0.13 99.62 7 | 3 0.38 100.00 ------------+----------------------------------Total | 787 100.00 Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------numhosp | 787 .5870394 1.036723 0 7 PHP 2510 – October 20, 2008 11 Constructing a confidence interval Central limit theorem: Says the sample mean is normally distributed in large samples X ∼ N (E(X), var(X)/n) Writing it this way is a little tedious, so we use µ and σ to generically denote E(X) and var(X); i.e. X ∼ N (µ, σ 2 /n) Implies that the sample mean can be rescaled to a standard normal X −µ √ ∼ N (0, 1) σ/ n PHP 2510 – October 20, 2008 12 Applying the CLT to form confidence intervals To form a 90% confidence interval, we want an interval that contains the true mean with probability 0.90. Logic: For a large sample size, the sample mean is a normally distributed random variable. Find an interval that contains a standard normal random variable with some pre-specified probability. Center it using the sample mean, and scale it using the standard error. PHP 2510 – October 20, 2008 13 Step 1. Determine which two values contain 90% of the area under the standard normal curve Ans: –1.65 and 1.65 Step 2. Then with 90% probability, the standardized mean will fall between –1.65 and 1.65 X −µ √ < 1.65 −1.65 < σ/ n In other words, µ ¶ X −µ √ < 1.65 = 0.90 Pr −1.65 < σ/ n PHP 2510 – October 20, 2008 14 Step 3. µ ¶ X −µ √ < 1.65 = 0.90 Pr −1.65 < σ/ n ¡ √ √ ¢ ⇒ Pr X − 1.65(σ/ n) < µ < X + 1.65(σ/ n) = 0.90. In words: start with X, then add and subtract 1.65 standard errors. √ X ± 1.65 × (σ/ n) In large samples, can replace σ with sample SD S PHP 2510 – October 20, 2008 15 Properties of the confidence interval Covers the true mean with pre-specified probability Increase this probability by increasing number of standard errors to add and subtract For 95% coverage, add and subtract 1.96 std. errors. √ X ± 1.96 × (σ/ n) Width of an interval determined by • Population variance σ 2 • Sample size n • Nominal coverage probability PHP 2510 – October 20, 2008 16 PHP 2510 – October 20, 2008 17 PHP 2510 – October 20, 2008 18 1.4 1.8 2.2 95% CI PHP 2510 – October 20, 2008 2.6 1.4 1.8 2.2 2.6 90% CI 19 Example 1. Incidence of pre-eclampsia Sample 1249 women, 250 get pre-eclampsia. Find an interval that contains the true incidence with 95% probability. Step 1. Let X be the pre-eclampsia status. X ∼ Bernoulii(p) where E(X) = p and σ 2 = var(X) = p(1 − p). Sample mean: X = 250/1249 = 0.20. We estimate p by pb = X̄, and σ 2 by σ b2 = pb(1 − pb) = (0.2)(0.8) = 0.16. PHP 2510 – October 20, 2008 20 Step 2. Find number of std. errors needed for 95% coverage ⇒ 1.96 Step 3. Add and subtract 1.96 std. errors from sample mean Lower limit = 0.20 – (1.96)(0.011) = 0.18 Upper limit = 0.20 + (1.96)(0.011) = 0.22 Confidence interval: (0.18, 0.22) How to make this a 90% interval? PHP 2510 – October 20, 2008 21 Example 2: Hospitalization data Find a 90% confidence interval for mean number of hospitalizations Variable | Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------numhosp | 787 .5870394 1.036723 0 7 Step 1: Use summary statistics to obtain key values Sample mean = 0.59 Sample SD = 1.04 √ Std error of sample mean = 1.04/ 787 = 0.03 Step 2: Coverage probability is 90%. Add and subtract 1.65 SE’s Step 3: Compute interval 0.59 ± (1.65)(0.03) ⇒ (0.54, 0.64) PHP 2510 – October 20, 2008 22