Download Chapter 5: Sampling distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Chapter 5: Sampling distributions
Outline
1
Introduction
2
Sampling distribution of a proportion
3
Sampling distribution of the mean
4
Normal approximation to the binomial
5
The continuity correction
Introduction
Sampling distributions
What does it mean to take a sample of size n?
Y1 , . . . , Yn form a random sample if they are independent and
have a common distribution.
From a sample, we can calculate a sample statistic such
as the sample mean Ȳ .
Ȳ is random too.
The distribution of Ȳ is called a sampling distribution.
Sampling Picture
Chalk Board
Randomness via sampling.
Ȳ , p̂.
Sampling distribution of a proportion
Estimator for a Proportion
A dichotomous trait occurs in a population with an
unknown proportion p
You wish to estimate p.
Steps
1
2
Collect a sample of size n
Let Y = total observations which have the trait of interest.
Y ∼ B(n, p) (Binomial)
An estimator for the unknown p
Y
p̂ =
n
Example
What proportion (p) of the human population has type O
blood (dichotomous because either have O or not).
You randomly sample 100 people and observe 42 have
type O.
Here n = 100, Y = 42.
An estimator for p?
Y
p̂ =
= .42
n
What would happen if you took a different sample of 100
people? You’d get a different value of p̂.
Sampling Distribution of p̂
Y ∼ B(n, p),
p̂ =
Y
n
1
What is the distribution of Y : binomial
2
What is the distribution of p̂: binomial divided by n.
Chalkboard Example
Write out the distribution of p̂
Will be a hassle if n is big!
Mean and Variance of p̂
Y ∼ B(n, p),
1
Moments of Y
E(Y ) = np,
2
Y
p̂ =
n
Var(Y ) = np(1 − p)
Moments of p̂
E(p̂) = p,
p(1 − p)
Var(p̂) =
n
Chalkboard
Motivate
E(p̂) = p,
p(1 − p)
Var(p̂) =
n
Example
Example: cross of two heterozygotes Aa × Aa. Probability
distribution of the offspring’s genotype:
Offspring genotype
AA
Aa
aa
0.25 0.50 0.25
An offspring is dominant if it has genotype AA or Aa.
Experiment: Get n = 2 offsprings, count the number Y of
dominant offspring, and calculate the sample proportion
p̂ = Y /2.
We would like p̂ to be close to the “true” value p = 0.75
p̂ is random
Distribution of p̂ (from the binomial distribution):
Y
0
1
2
p̂
0.0
0.5
1.0
IP 0.0625 0.3750 0.5625
Example
Larger sample size: Y = # of dominant offspring out of n = 20,
p̂ = Y /20 the sample proportion.
We still want p̂ to be close to the “true” value p = 0.75
p̂ is still random
What is the probability that p̂ is within 0.05 of p? Translate
into a binomial question
IP{0.70 ≤ p̂ ≤ 0.80} = IP{0.70 ≤ Y /20 ≤ 0.80}
= IP{14 ≤ Y ≤ 16}
= IP{Y = 14} + IP{Y = 15} + IP{Y = 16}
= 0.56
Sample size of 20 better than sample size of 2 !!
Sampling distribution of the mean
Categorical versus Numerical Variables
1
Dichotomous categorical variable (e.g. left-handedness)
Natural value to consider is proportion
2
Numerical variables (e.g. height)
Natural value to consider is the mean
Estimator for the Mean
A numerical trait occurs in a population with an unknown
mean µ
You wish to estimate µ.
Collect a random sample of size n
Y1 , Y2 , ..., Yn
An estimator for the unknown µ
n
1X
Ȳ =
Yi
n
i=1
Example
1
You want to study the serum cholesterol level of 17 year
old males in the US.
2
Specifically, you are interested in the mean (µ) of serum
cholesterol levels of all 17 year old males in the US.
3
You collect a sample of size 25.
4
Observe Ȳ = 30.
5
What would happen if you collected a different sample of
n = 25, 17 year old males.
Example
Example: weight of seeds of some variety of beans.
Sample size n = 4
Student #
1
2
3
Observations
462 368 607 483
346 535 650 451
579 677 636 529
sample mean ȳ
ȳ = 480
ȳ = 495.5
ȳ = 605.25
Ȳ is random. How do we know its distribution?
We will see 3 key facts.
Key fact # 1
If Y1 , . . . , Yn is a random sample, and if the Yi ’s have mean µ
and standard deviation σ, then
Ȳ has mean
µȲ = µ
and variance var(Ȳ ) = σ 2 /n, i.e. standard deviation
σ
σȲ = √
n
Seed weight example: Assume beans have mean µ = 500 mg
and σ = 120 mg. In a sample of size n = 4, the sample mean
Ȳ has mean
√ µȲ = 500 mg and standard deviation
σȲ = 120/ 4 = 60 mg.
Key fact # 2
If Y1 , . . . , Yn is a random sample, and if the Yi ’s are all from
N (µ, σ), then
σ
Ȳ ∼ N (µ, √ )
n
Actually, Y1 + · · · + Yn = n Ȳ ∼ N too.
Seed weight example: 100 students do the same experiment.
56
n=4
n=16
30 32
14
17
4
3
350
450
32
550
0
650
350
sample mean
5
450
7
0
550
650
Key fact # 3
Central limit theorem
If Y1 , . . . , Yn is a random sample from (almost) any distribution.
Then, as n gets large, Ȳ is normally distributed.
Note: Y1 + · · · + Yn ∼ normally too.
How big must n be?
Usually, n = 30 is big enough, unless the distribution is strongly
skewed.
Remarkable result! It explains why the normal distribution is so
common, so “normal”. It is what we get when we average over
lots of pieces. Ex: human height. Results from ...
Ex: beans are filtered, discarded if too small.
n=5
n=1
0
200
400
600
800
1000 300
n = 10
350
400
450
400
500
600
n = 30
500
550
600
650
450
500
550
700
Example: Mixture of 2 bean varieties.
n=1
0
200
n=5
400
600
800
1000 300
400
450
500
600
n = 30
n = 10
350
400
500
550
600
650
450
500
550
700
Exercise
Snowfall Y ∼ N (.53, .21) on winter days (inches).
Take the sample mean Ȳ of a random sample of 30 winter
days, over the 10 previous years. What is the probability that
Ȳ ≤ .50 in?
Ȳ has mean 0.53 inches
√
Ȳ has standard deviation 0.21/ 30 = 0.0383 inches
Ȳ ’s distribution is approximately normal, because the
sample size is large enough (n = 30)
Ȳ − 0.53
0.50 − 0.53
IP Ȳ ≤ .50 = IP
≤
.0383
.0383
' IP{Z ≤ −0.782} = 0.217
Normal approximation to the binomial
The normal approximation to the binomial
Example: X = # of children with side effects after a vaccine, out
of n = 200 children. Probability of side effect: p = 0.05. So
X ∼ B(200, 0.05).
What is IP{X ≤ 15}?
Direct calculation:
IP{X = 0} + IP{X = 1} + · · · + IP{X = 15} =
0
200
200 C0 .05 .95
+ · · · + 200 C15 .0515 .95185
Or we can use a trick: the binomial might be close to a
normal distribution. Pretend X is normally distributed!
n= 50 , p= 0.05
n= 200 , p= 0.05
0.12
0.10
0.20
0
2
4
6
8
10
0.00
0.0
0.00
0.02
0.05
0.04
0.10
0.06
0.15
0.08
0.4
0.3
0.2
0.1
Probability
0.5
0.25
n= 10 , p= 0.05
0
2
4
10
0
5
10
15
20
25
0.20
0.15
0.10
0.05
0.00
0.05
0.05
0.10
0.15
0.10
0.20
0.15
0.25
n= 20 , p= 0.9
0.00
Probability
8
n= 20 , p= 0.5
0.25
n= 20 , p= 0.1
6
0
5
10
15
20
0
5
10
15
Some Possible Values
20
0
5
10
15
20
The normal approximation to the binomial
X = Y1 + · · · + Y200 where
1 if child #1 has side effects,
Y1 =
0 otherwise.
1 if child #200 has side effects,
Y200 =
0 otherwise.
Apply key result #3: if n (# of children) is large enough,
then Y1 + · · · + Yn has a normal distribution.
Use the normal distribution with X ’s mean and variance:
p
µ = np = 10 ,
σ = np(1 − p) = 3.08
If X ∼ B(n, p) and if n is large enough:
if np ≥ 5
and n(1 − p) ≥ 5
(rule of thumb), then X ’s distribution is approximately
p
N (np, np(1 − p))
The normal approximation to the binomial
Back to our question: IP{X ≤ 15}.
np = 10 and n(1 − p) = 190 are both ≥ 5, so X ≈ N (10, 3.08).
15 − 10
X − 10
≤
IP{X ≤ 15} = IP
3.08
3.08
' IP{Z ≤ 1.62}
= 0.9474
True value:
> sum( dbinom(0:15, size=200, prob=0.05))
[1] 0.9556444
Continuity Correction
The continuity correction
0
5
10
15
# of children with side effect
20
The continuity correction
0
5
10
15
# of children with side effect
20
The continuity correction
0
5
10
15
# of children with side effect
20
The continuity correction
X binomial B(200, 0.05), and Y normal N (10, 3.08).
No continuity correction:
Y − 10
15 − 10
≤
IP{X ≤ 15} ' IP{Y ≤ 15} = IP
3.08
3.08
= IP{Z ≤ 1.62}
= 0.9474
The continuity correction gives a better approximation.
Y − 10
15.5 − 10
IP{X ≤ 15} ' IP{Y ≤ 15.5} = IP
≤
3.08
3.08
= IP{Z ≤ 1.78}
= 0.9624
(true value was 0.9556)
The continuity correction
X binomial B(200, 0.05), and Y normal N (10, 3.08).
What is the probability that between 8 and 15 children get side
effects?
IP{8 ≤ X ≤ 15} ' IP{7.5 ≤ X ≤ 15.5}
7.5 − 10
15.5 − 10
Y − 10
= IP
≤
≤
3.08
3.08
3.08
= IP{−0.81 ≤ Z ≤ 1.78}
= IP{Z ≤ 1.78} − IP{Z ≤ −0.81}
= 0.7535
True value:
> sum( dbinom(8:15, size=200, prob=0.05) )
[1] 0.7423397