Download PHP 2510 Central limit theorem, confidence intervals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
PHP 2510
Central limit theorem, confidence intervals
PHP 2510 – October 20, 2008
1
Distribution of the sample mean
Case 1: Population distribution is normal
For an individual in the population, Xi ∼ N (µ, σ 2 ) for
i = 1, 2, ..., n. Then, for a sample of size n, the sample mean
also has a normal distribution
X ∼ N (µ, σ 2 /n)
Case 2: Population distribution is not normal, e.g. Poisson,
Binomial, Then, for large samples, the sample mean also has a
normal distribution with mean equal to E(X) and variance
equal to var(X)/n
This is known as the central limit theorem
PHP 2510 – October 20, 2008
2
Central Limit Theorem
Characterizes distribution of X in large samples
Suppose a sample X1 , . . . , Xn comes from a distribution with
mean E(X) and variance var(X). This can be almost any
distribution (binomial, poisson, etc.)
When n is large, the sample mean X is normally distributed. Its
mean is equal to the population mean, and its variance is
var(X)/n. We can write
µ
¶
var(X)
X ∼ N E(X),
n
PHP 2510 – October 20, 2008
3
Example 1: Throw a fair coin.
Use sample mean to estimate the probability of having a head.
Let X be the outcome of throw a fair coin once.
X∼
I Throw a fair coin n times. Let X1 , X2 , ..., Xn be the outcomes.
Pn
II Compute sample mean X̄ = i=1 Xi /n.
III CLT says X̄ is normally distributed for a large n. To illustrate
this, let’s repeat Steps I and II 1000 times. Plot X̄ versus its
relative frequency.
PHP 2510 – October 20, 2008
4
0.2
0.4
0.6
0.8
1
0.0666666666666667
0.6
sample mean
n=40
n=100
0.00
0.04
relative freqency
0.08
0.04
0.00
0.8
0.08
sample mean
0.12
0
relative freqency
0.10
0.00
0.10
0.20
relative freqency
0.30
0.20
n=15
0.00
relative freqency
n=5
0.275 0.45
0.6
sample mean
PHP 2510 – October 20, 2008
0.75
0.35
0.5
sample mean
5
Example 2: Throw a fair die.
Use sample mean to estimate the probability of having a six. Let
X be the outcome of throw a fair die once.
X∼
I Throw a fair die n times. Let X1 , X2 , ..., Xn be the outcomes.
Pn
II Compute sample mean X̄ = i=1 Xi /n.
III CLT says X̄ is normally distributed for a large n. To illustrate
this, let’s repeat Steps I and II 1000 times. Plot X̄ versus its
relative frequency.
PHP 2510 – October 20, 2008
6
n=15
0.20
0.00
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
sample mean
n=40
n=100
0.00
0.04
relative freqency
0.10
0.05
0.00
0.08
sample mean
0.15
0
relative freqency
0.10
relative freqency
0.3
0.2
0.1
0.0
relative freqency
0.4
n=5
0
0.15
0.3
sample mean
PHP 2510 – October 20, 2008
0.07 0.23
sample mean
7
Confidence intervals
Confidence intervals can be used to convey uncertainty about the
estimate of any parameter.
A confidence interval is comprised of two random variables (lower
& upper bound) and covers the true mean with some pre-specified
probability
The confidence interval boundaries themselves are random
variables.
PHP 2510 – October 20, 2008
8
What question does a CI answer?
Example 1: Incidence of pre-eclampsia.
A random sample of 1249 women is selected and followed through
pregnancy. 250 get pre-eclampsia.
Estimate the incidence by sample mean:
250
1249
= 20%
We would like to find an interval that contains, with 95%
probability, the true incidence of pre-eclampsia.
PHP 2510 – October 20, 2008
9
Example 2: Hospitalization rate of HIV-infected women during a
6-month period.
A sample of 787 women are followed for 6 months, resulting in the
following data.
We are interested to construct an interval that contains the true
rate of hospitalization with 90% probability.
PHP 2510 – October 20, 2008
10
numhosp |
Freq.
Percent
Cum.
------------+----------------------------------0 |
508
64.55
64.55
1 |
176
22.36
86.91
2 |
61
7.75
94.66
3 |
20
2.54
97.20
4 |
13
1.65
98.86
5 |
5
0.64
99.49
6 |
1
0.13
99.62
7 |
3
0.38
100.00
------------+----------------------------------Total |
787
100.00
Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------numhosp |
787
.5870394
1.036723
0
7
PHP 2510 – October 20, 2008
11
Constructing a confidence interval
Central limit theorem: Says the sample mean is normally
distributed in large samples
X ∼ N (E(X), var(X)/n)
Writing it this way is a little tedious, so we use µ and σ to
generically denote E(X) and var(X); i.e.
X ∼ N (µ, σ 2 /n)
Implies that the sample mean can be rescaled to a standard
normal
X −µ
√ ∼ N (0, 1)
σ/ n
PHP 2510 – October 20, 2008
12
Applying the CLT to form confidence intervals
To form a 90% confidence interval, we want an interval that
contains the true mean with probability 0.90.
Logic: For a large sample size, the sample mean is a normally
distributed random variable.
Find an interval that contains a standard normal random
variable with some pre-specified probability.
Center it using the sample mean, and scale it using the
standard error.
PHP 2510 – October 20, 2008
13
Step 1. Determine which two values contain 90% of the area
under the standard normal curve
Ans: –1.65 and 1.65
Step 2. Then with 90% probability, the standardized mean will
fall between –1.65 and 1.65
X −µ
√ < 1.65
−1.65 <
σ/ n
In other words,
µ
¶
X −µ
√ < 1.65 = 0.90
Pr −1.65 <
σ/ n
PHP 2510 – October 20, 2008
14
Step 3.
µ
¶
X −µ
√ < 1.65 = 0.90
Pr −1.65 <
σ/ n
¡
√
√ ¢
⇒ Pr X − 1.65(σ/ n) < µ < X + 1.65(σ/ n) = 0.90.
In words: start with X, then add and subtract 1.65 standard
errors.
√
X ± 1.65 × (σ/ n)
In large samples, can replace σ with sample SD S
PHP 2510 – October 20, 2008
15
Properties of the confidence interval
Covers the true mean with pre-specified probability
Increase this probability by increasing number of standard errors
to add and subtract
For 95% coverage, add and subtract 1.96 std. errors.
√
X ± 1.96 × (σ/ n)
Width of an interval determined by
• Population variance σ 2
• Sample size n
• Nominal coverage probability
PHP 2510 – October 20, 2008
16
PHP 2510 – October 20, 2008
17
PHP 2510 – October 20, 2008
18
1.4
1.8
2.2
95% CI
PHP 2510 – October 20, 2008
2.6
1.4
1.8
2.2
2.6
90% CI
19
Example 1. Incidence of pre-eclampsia
Sample 1249 women, 250 get pre-eclampsia. Find an interval that
contains the true incidence with 95% probability.
Step 1. Let X be the pre-eclampsia status.
X ∼ Bernoulii(p)
where E(X) = p and σ 2 = var(X) = p(1 − p).
Sample mean: X = 250/1249 = 0.20.
We estimate p by
pb = X̄,
and σ 2 by
σ
b2 = pb(1 − pb) = (0.2)(0.8) = 0.16.
PHP 2510 – October 20, 2008
20
Step 2. Find number of std. errors needed for 95% coverage ⇒
1.96
Step 3. Add and subtract 1.96 std. errors from sample mean
Lower limit = 0.20 – (1.96)(0.011) = 0.18
Upper limit = 0.20 + (1.96)(0.011) = 0.22
Confidence interval: (0.18, 0.22)
How to make this a 90% interval?
PHP 2510 – October 20, 2008
21
Example 2: Hospitalization data
Find a 90% confidence interval for mean number of hospitalizations
Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------numhosp |
787
.5870394
1.036723
0
7
Step 1: Use summary statistics to obtain key values
Sample mean = 0.59
Sample SD = 1.04
√
Std error of sample mean = 1.04/ 787 = 0.03
Step 2: Coverage probability is 90%. Add and subtract 1.65 SE’s
Step 3: Compute interval
0.59 ± (1.65)(0.03) ⇒ (0.54, 0.64)
PHP 2510 – October 20, 2008
22
Related documents