Download 1 Inference for Population Proportions 2 Estimation of a population

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
1
Inference for Population Proportions
So far we were interested in estimating and answering questions for population means. In these
cases our parameters of interest were either a population mean µ or a difference between two
population means, µ1 − µ2 .
We will now study the analysis of population proportions:
• the proportion of voters who intend on voting for the incumbent prime minister in the
next election.
• the proportion of cancer patients who are going to survive at least 5 years after treatment
• the proportion of batteries which lasts at least 6 hours
• the proportion of students who receive an A in Stat 151
We can also interpret the population proportion, as the probability of the event of interest
(when randomly choosing an individual from the population).
2
Estimation of a population proportion p
The sample proportion p̂ is defined for a sample an given by:
p̂ =
x
n
where x denotes the number of members in the sample that have the specified attribute (or
number of successes), and n denotes the sample size.
It seems natural to use the sample proportion for estimating a population proportion.
In order to confirm that this is statistically reasonable, we need to study the distribution of p̂
(why is this a random variable?).
The following is also called the Central Limit Theorem for proportions.
For samples of size n:
1. (mean)
mup̂ = p
2. (standard deviation) σp̂ =
q
p(1 − p)/n
3. (shape) If n is large then p̂ is approximately normally distributed.
The first property means, that p̂ is an unbiased estimator for p,
the second property means the larger n, the more likely p̂ is falling ”close” to p,
and the last property lets us construct confidence intervals and tests (yippee!)
Example A study showed, that the proportion of people in the 20 to 34 age group with an IQ
(on the Wechsler Intelligence Scale) of over 120 is about 0.35. Calculate the probability for the
event that in a sample of 50 there are more than 20 people with an IQ of at least 120. For this
1
sample p̂ = 20/20 = 0.4 We will calculate how likely a sample proportion of 0.4 (or larger) is
occurring in a sample of size 50, with a true population proportion of 0.35
P (p̂ ≥ 0.4 = P ( q
p̂ − 0.35
0.35(0.65)/50
= P (Z ≥ 0.74)
= 1 − P (Z < 0.74)
= 1 − 0.7704 = .2296
≥q
0.4 − 0.35
) standardize
0.35(0.65)/50
(table II)
We calculated that the probability that more than 20 out of 50 people (between 20 and 34)
have an IQ greater than 120 is .23. Not that unlikely.
2.1
Large-Sample Confidence Interval for a Population Proportion
p
Let p be the probability of an event of interest.
We saw before that p̂ = nx is an unbiased estimate for p, if x is the number of successes in n
trials. Usually p is unknown and based on a random sample we can calculate a (1 − α)100%
confidence interval.
A (1 − α)100% Large Sample Confidence Interval for a Population Proportion p.
s
p(1 − p)
n
where z1−α/2 is the 1 − α/2 percentile of a standard normal distribution.
Since p is unknown, it is estimated using p̂.
The sample size is considered large when the normal approximation to the binomial distribution
is adequate – namely when the number of successes and the number of failures are both at
least five.
p̂ ± zα/2
Proof:
P p̂ − z1−α/2
q
p(1−p)
n
≤ p ≤ p̂ + z1−α/2
q
p(1−p)
n
= P −z1−α/2
p−p̂
p(1−p)/n
≤√
= P
√
p−p̂
p(1−p)/n
= 1−
α
2
− (1 − (1 − α2 ))
≤ z1−α/2
≤ z1−α/2 − P
p−p̂
p(1−p)/n
√
≥ −z1−α/2
= 1−α
since √
p−p̂
p(1−p)/n
is according to the Central Limit Theorem standard normal distributed.
Remark:
A confidence interval is calculated, when p is unknown. So the boundaries will be calculated
by replacing p by the unbiased estimator p̂. This is only appropriate if n is large and will result
2
in an approximate confidence interval, that means the probability for the parameter to fall into
the interval is approximately 1 − α.
So we use:
Let zα/2 the (1−α/2) percentile of the standard normal distribution and np > 5 and n(1−p) > 5.
Then is


s
s
p̂(1
−
p̂)
p̂(1
−
p̂)

p̂ − zα/2
; p̂ + zα/2
n
n
an approximate (1 − α) confidence interval for p.
Example:
Consider flipping a coin 1000 times. In only 400 of the experiments HEAD was observed. Is
this a surprising number, if the coin is unbiased. To answer this question calculate a 95%
confidence interval from this data and check if 0.5 (the probability for HEAD, when tossing
an unbiased coin) is in the confidence interval. First check if the conditions are met: np =
n(1 − p) = 1000 · 0.5 = 500 ≥ 5. We conclude that we can apply the Central Limit Theorem
and can use the above described method for obtaining a confidence interval.
p̂ − zα/2
q
p̂(1−p̂)
n
; p̂ + zα/2
q
p̂(1−p̂)
n
h
=
400
1000
q
− 1.96
0.4·0.6
1000
q
; 0.4 − 1.96
0.4·0.6
1000
i
= [0.4 − 0.030 ; 0.4 + 0.030]
= [0.37 ; 0.43]
We can be 95% confident, that the true probability for HEAD is in the interval [0.37; 0.43].
Since 0.5 is not in the interval, it seems to be unlikely that 0.5 is the true probability for HEAD.
Check the coin, what makes it biased!
2.2
Choosing the Sample Size
The Margin of Error for the estimation of p is
q
E = zα/2 p(1 − p)/n
Choosing the sample size for estimating a proportion p follows the same argument, as finding
the sample size for estimating a mean µ, only that the formula is based on another confidence
interval.
Assume a probability p shall be estimated within a margin of error of E with a (1 − α)100%
confidence interval, then
!2
z( α/2)
p(1 − p)
n≥
E
Since p is not known, use a guess, or use p = 0.5 as a conservative value in this formula.
Example
A poll shall be conducted to find the proportion of Canadians supporting the Liberal party
within a margin of error of 3% (E = 0.03) then
n≥
1.96
0.03
2
0.5(0.5) = 1067.111
A sample size of 1068 would be required to make this goal. (This is why most polls are based
on samples of size of a little above 1000).
3
2.3
A Large Sample Test Concerning a Proportion p
For developing a test again the facts we know from the CLT have to be considered. The point
estimator for a proportion is the sample proportion p̂. From the Central Limit Theorem we
know about the sampling distribution of p̂ that:
1. µp̂ = p
s
2. σp̂ =
p(1 − p)
n
3. If n is large the sampling distribution of p̂ is approximately normal.
So we get that
p − p̂
z=q
p(1−p)
n
is standard normally distributed for large sample sizes.
Using these properties it can be proved that the following procedure, is a statistical test, that
ensures, that the probability to make an error of type I is less or equal than α.
A Large Sample Test concerning a Proportion p
1. Hypotheses:
Test type
Upper tail
H0 : p ≤ p0 versus Ha : p > p0
Lower tail
H0 : p ≥ p0 versus Ha : p < p0
Two tail
H0 : p = p0 versus Ha : p 6= p0
Choose α.
2. Assumption:Random sample and, the sample size n is large, that is that np̂ > 5 and
n(1 − p̂) > 5.
3. Test statistic: Let p0 be a value between zero and one and define the test statistic
z0 = q
p̂ − p0
(p0 (1 − p0 ))/n
4. p-value and Rejection Region:
Test type
p-value
Rejection Region
Upper tail
P (z > z0 )
z0 > zα
Lower tail
P (z < z0 )
z0 < −zα
Two tail
2 · P (z > abs(z0 )) abs(z0 ) > zα/2
4
Where zα is the 1 − α percentile of the standard normal distribution.
5. Decision:
If P-value≤ α or z0 falls in the rejection region, then reject H0
If P-value> α or z0 does not fall in the rejection region then do not reject H0
6. Context.
Example:
Suppose that you want to show that the proportion of adults above 40 who are participating
in fitness activities is below 0.2.
1. So you want to test ( putting what you want to show into the alternative hypothesis Ha )
H0 : p ≥ 0.2 vs. Ha : p < 0.2 at a significance level of α = 0.05.
2. The sample size is n = 100 and the number of people sampled who participate in those
activities equals 19, so that p̂ = 0.19, np̂ = 19 > 5 and n(1 − p̂) = 81 > 5, so the
assumptions are met (assuming the sample was randomly chosen).
3. Then
0.19 − 0.2
= −0.25
z0 = q
0.2·0.8
100
4. Now calculate the P-value, according to the choice of Ha it is a lower tail test, so the
P-value is the lower tail probability. P − value = P (z < −0.25) = 0.4013 (from table II.)
5. Decision: Since 0.4013=P-value> 0.05 = α, H0 is not rejected.
6. Context: At significance level of 5% the sample data do not provide sufficient evidence
that less than 20% of adults 40 and older take part in fitness activities.
5
2.4
Estimating the Difference between Two Population Proportions
Instead of comparing two population means let’s now compare two population proportions.
Assume you want to compare
• the rate of people who play computer games in the age groups of 20 to 30 and 30 to 40
• The proportion of defective items manufactured in two production lines
The statistic for estimating the difference in two population proportions that comes to mind is
the difference in the sample proportion (p̂1 − p̂2 ). Let study the sampling distribution of this
statistic to construct a confidence interval.
Properties of the Sampling Distribution of the Difference between two Sample
Proportions (p̂1 − p̂2 )
Consider that you have two independent samples of sizes n1 and n2 from binomial populations
with parameters p1 and p2 , respectively.
The sampling distribution of (p̂1 − p̂2 ) has these properties:
1. The mean of (p̂1 − p̂2 ) is
µp̂1 −p̂2 = p1 − p2
and the standard error is
s
p1 (1 − p1 ) p2 (1 − p2 )
+
n1
n2
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
+
n1
n2
SE =
which is estimated by
ˆ =
SE
2. The sampling distribution of (p̂1 − p̂2 ) is approximately normal distributed, when the
sample sizes n1 and n2 are large, that is when
n1 p1 > 5 and n1 (1 − p1 ) > 5 and n2 p2 > 5 and n2 (1 − p2 ) > 5
These results now lead to the description of the estimation of (p1 − p2 ).
Large Sample Point Estimation of (p1 − p2 )
Point estimate: (p̂1 − p̂2 )
s
Margin of error: zα/2
p1 (1 − p1 ) p2 (1 − p2 )
+
n1
n2
Large Sample (1 − α)100% Confidence Interval for (p1 − p2 )
s
(p̂1 − p̂2 ) ± zα/2
p1 (1 − p1 ) p2 (1 − p2 )
+
n, 1
n2
6
For this we have to assume again that n1 and n2 are large, that is
n1 p1 5, n1 (1 − p1 ), n2 p2 , n2 (1 − p2 ) are greater than 5.
In order to apply the tools described above, find that p1 and p2 , the population proportions, are
unknown. In order to use the above procedures, we have to replace the population proportions
by their estimates pˆ1 and p̂2 .
So that you will estimate the margin of error by
s
ˆ = ±1.96
±1.96SE
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
+
n1
n2
and use the following
Approximate Large Sample (1 − α)100% Confidence Interval for (p1 − p2 )
s
(p̂1 − p̂2 ) ± zα/2
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
+
n1
n2
For this we have to assume again that n1 and n2 are large, that is
n1 p1 , n1 (1 − p1 ), n2 p2 , n2 (1 − p2 ) are greater than 5.
Example:
Suppose we want to compare therapies. The criteria for the comparison is the probability to
survive at least 5 years after therapy.
The study produced the following data:
Population 1 Population 2
n
100
80
x
90
70
0.875
p̂ = x/n 0.9
That is 90 out of 100 patients, who underwent therapy 1 survived at least 5 years.
If we use p̂1 as estimate for p1 and p̂2 as estimate for p2 , we find that n1 p1 , n1 (1−p1 ), n2 p2 , n2 (1−
p2 ) are all greater than 5. So we can use the formula from above for calculating a 95% confidence
interval for p1 − p2 .
s
(p̂1 −p̂2 )±zα/2
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
0.9(0.1) 0.875(0.125)
+
= (0.025)±1.96·
+
= 0.025±0.093
n1
n2
100
80
or [-0.068 ; 0.118]. Since 0 is captured in this interval, we find, that this data does not provide
evidence, that the two therapies result in different probabilities to survive 5 years.
They can be different, but this data does not show it.
7
2.5
Statistical Test for Two Population Proportions p1 and p2
Notation:
population 1
population 2
population
sample
proportion size proportion
p1
n1
p̂1
p2
n2
p̂2
Large-Sample z Test for comparing p1 and p2
• Hypotheses
Test type
Upper tail
H0 : p1 − p2 ≤ 0 versus Ha : p1 − p2 > 0
Lower tail
H0 : p1 − p2 ≥ 0 versus Ha : p1 − p2 < 0
Two tail
H0 : p1 − p2 = 0 versus Ha : p1 − p2 6= 0
Assumption: Both sample sizes are large: Random samples,
n1 p̂1 > 5, n1 (1 − p̂1 ) > 5, n2 p̂2 > 5, n2 (1 − p̂2 ) > 5
Test statistic:
(p̂1 − p̂2 )
z0 = q p̂
c (1−p̂c )
n1
+
p̂c (1−p̂c )
n2
P-value and Rejection Region:
Test type
P-value
Rejection Region
Upper tail
P (z > z0 )
z0 > zα
Lower tail
P (z < z0 )
z0 < −zα
Two tail
2 · P (z < −abs(z0 )) abs(z0 ) > zα/2
• Decision
• Context
Example:
Find if the proportions of red M&M’s in the plain and peanut variety do differ at a significance
level of 0.05.
The sample
Plain(1) Peanut(2)
Sample Size
56
32
Number of red M&Ms 12
8
This results in p̂1 = 12/56 = 0.214 and p̂2 = 8/32 = 0.25 and p̂c = (12 + 8)/(56 + 32) = 20/88 =
0.227
8
1. The question asks for a test of H0 : p1 − p2 = 0 vs. Ha : p1 − p2 6= 0.
α = 0.05
2. Assumption: Since p̂1 n1 , (1− p̂1 )n1 , p̂2 n2 , (1− p̂2 )n2 are all greater than 5, the assumptions
are met and the test will deliver a reliable result.
3. Test statistic:
z0 = q p̂
(p̂1 − p̂2 )
c (1−p̂c )
n1
+
p̂c (1−p̂c )
n2
=q
(0.214 − 0.25)
0.227(0.773)
56
+
0.227(0.773)
32
=√
−0.036
= −0.3882
0.0031 + 0.0055
4. Rejection region: With α = 0.05 the rejection region for a two tailed test is:
abs(z0 ) > zα/2 = 1.96.
or using the p-value:
2-tailed p-value=2P (z > abs(z0 )) = 2P (z > 0.3882) = 2(1 − 0.6517) = 0.6966
5. Decision: Since the P-value is not smaller than α = 0.05 do not reject H0 at significance
level of 0.05.
6. At significance level of 5% we conclude that we do not have enough evidence, that the
proportion of red M&M’s is different for the plain and peanut variety.
9