Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
```Chapter 11
Comparing Two Populations
11.1 Inferences concerning the difference between two population means using
independent samples
Many investigations are carried out for the purpose of comparing two populations. For
example,
Population 1 = {All nonsmokers}.
 1 = the mean life expectancy of nonsmokers.
Population 2 = {All smokers}.
 2 = the mean life expectancy of smokers.
A medical researcher may be interested in whether 1 > 2.

Notation
Mean value
Population 1
Population 2
1
2
Variance


Standard deviation
1
2
2
1
2
2
Sample from population 1
Sample
Size
n1
Sample
Mean
x1
Sample
Variance
s12
Sample Standard
deviation
s1
Sample from population 2
n2
x2
s 22
s2
Comparison of means focuses on the difference,  1-  2.
 1-  2 = 0 is equivalent to  1 =  2
 1-  2 > 0 is equivalent to  1>  2
 1-  2 < 0 is equivalent to  1<  2
Definition 11.1 Two samples are said to be independent if the selection of the individuals
or objects that make up one sample does not influence the selection of those in the other
sample.
Since x1 provides an estimate of 1 and x 2 gives an estimate of 2, it is natural to use x1 x 2 as a point estimate of 1 - 2. Our inferential methods will be based on information
about the sampling distribution of x1 - x 2 .

Properties of the sampling distribution of x1 - x 2
If the random samples on which x1 and x 2 are based are selected independently, then
1. the mean value of x1 - x 2 :  x1  x2 =  x1 -  x 2 = 1- 2.
That is, the sampling distribution of x1 - x 2 is always centered at the value of 1- 2,
thus x1 - x 2 is an unbiased statistic for estimating  1-  2.
2. the variance of x1 - x 2 :  2x  x =  2x +  2x =
1 2
1
2
 12
n1
The standard deviation of x1 - x 2 :  x1  x2 =
 n22 .
 12
n1
2
 n22 .
2
3. If both population distributions are normal, x1 and x 2 each has a normal distribution
and x1 - x 2 has also a normal
distribution. Thus,
z  x1  x2 ( 1  2 )
 12 n1  22 n2
has the standard normal (z) distribution.
4. If n1 and n2 are both large (n1  30 and n2  30), x1 and x 2 each has approximately a
normal distribution and x1 - x 2 has approximately a normal distribution even when
each population distribution is not itself normal. Thus,
z
x1  x2  ( 1   2 )
 12 n1  22 n2
has approximately the standard normal (z) distribution.
Note: Properties 1 and 2 follow from the following general results:
1) The mean value of a difference equals the difference of the two individual mean
values.
2) The variance of a difference of independent quantities is the sum of the two
individual variances.
Generally,  12 and  22 are unknown, we must estimate them by the corresponding sample
variances, s12 and s22 , and must use
t

The sampling distribution of t 
x1  x2  ( 1   2 )
s12 n1  s 22 n2
.
x1  x2  ( 1   2 )
s12 n1  s 22 n2
If (1) the two random samples are independently selected,
(2) the population distributions are normal or n1 and n2 are both large (n1  30 and n2 
30)
then the standardized variable
t
x1  x2  ( 1   2 )
s12 n1  s 22 n2
has (approximately) a t distribution with degrees of freedom
df 
(V1  V2 ) 2
V2
1
n1 1

V2
2
n 2 1
, where V1 
s12
n1
and V2 
s22
n2
.
df should be truncated (rounded down) to an integer.

Summary of the two-sample t test for comparing two population means
Null hypothesis: H0:  1-  2 = hypothesized value.
Test statistic: t 
x1  x 2  hypothesized value
s12 n1  s 22 n 2
.
The appropriate df for the two-sample t test is
df 
(V1  V2 ) 2
V2
1
n1 1

V2
2
n 2 1
, where V1 
s12
n1
and V2 
s22
n2
.
df should be truncated (rounded down) to an integer.
Alternative hypothesis
Ha:  1-  2 > hypothesized value
(Upper-tailed test)
Ha:  1-  2 < hypothesized value
(Lower-tailed test)
Ha:  1-  2  hypothesized value
(Two-tailed test)
P-value
Area under appropriate t curve to the right of the
computed t
Area under appropriate t curve to the left of the
computed t
(i) 2(area to the right of the computed t) if t is
positive
(ii) 2(area to the left of the computed t) if t is
negative
Assumptions:
1. The two samples are independently selected random samples.
2. n1 and n2 are both large (n1 30 and n2 30) or the population distributions are at least
approximately normal.

The two-sample t confidence interval for the difference between two population
means
When
1. the two samples are independent random samples, and
2. the sample sizes are both large (generally n1  30 and n2  30) OR the population
distributions are approximately normal
the general formula for a confidence interval for  1-  2 is

s 22
n2
.
and V2 
s22
n2
.
( x1 - x 2 )  (t critical value)
s12
n1`
The t critical value is based on
df 
(V1  V2 ) 2
V2
1
n1 1

V2
2
n 2 1
, where V1 
s12
n1
df should be truncated (rounded down) to an integer. The t critical values for the usual
confidence levels are given in Appendix Table 3 on page 708.
Example 11.1 In a study comparing the salaries of workers with college degrees and
workers without college degrees the following summary data resulted from independent
random samples on both types of workers.
Group
College Degree
No College Degree
Sample Size
100
75
Sample Mean Salary
\$32,500
\$25,800
Sample Standard Deviation
\$4,100
\$5,600
(a) Is there sufficient evidence to conclude that college graduates earn at least \$5,000
more than workers not having a college degree, on the average? Use  = .05
(b) Estimate the difference in mean salaries for workers having a college degree and
workers not having a college degree using a 95% confidence interval.
(c) What is your conclusion based on confidence interval in part (b)?
(a)
1. Population characteristics of interest:
1 = true mean salary of workers having a college degree.
2 = true mean salary of workers not having a college degree.
1 -2 = difference in mean salaries.
2. Null hypothesis: H0: 1 -2 = ?
3. Alternative hypothesis: Ha: 1 -2 > ?
4. Significance level:  = 0.05
5. Test statistic:
ed value
x  x  5, 000
t  x1  x 2  hypothesiz
= 12 2 2
2
2
s1 n1  s 2 n 2
s1 n1  s 2 n 2
6. Assumptions: This test requires two independent random samples and two large
sample sizes. The given samples were two independent random samples and the
sample sizes were n1 = 100, n2= 75, the two sample t test is appropriate.
7. Computations: n1 = 100, x1 =32,500, s1 = 4,100, n2 = 75, x2 = 25,800, and s2 =
5,600
t  (32,5002 25,800)  5,000
= 1,700 / 765.6588  2.22.
2
( 4 ,100) 100 ( 5, 600)
75
8. P-value: We first compute the df for the two-sample t test:
2
2
V1  sn11 = 4,1002/100 = 168,100, V2  ns22 =5,6002/75 = 418,133.3333,
df 
(V1 V2 ) 2
V2
V2
1
 2
n1 1 n2 1

(168,100 418,133.3333) 2
168,1002
1001

418,133.33332
751
 129.781.
We truncate df to 129.
P-value = area under t curve with df = 129 to the right of 2.22
 1-P( z < 2.22 ) = 0.0132.
9. Conclusion: Since P-value = 0.0132 < 0.05 = , H0 is rejected at the .05 level.
There is sufficient evidence to conclude that college graduates earn at least \$5,000
more than workers not having a college degree on the average.
(b) A 95% confidence interval for 1 -2 is
( x1 - x 2 )  (t critical value)
= (32,500 – 25,800)  1.96
s12
n1`

s 22
n2
( 4 ,100) 2
100
)
 (5,600
75
2
= 6,700 1.96 765.6588 = 6,700 1500.69 = (5,199.31, 8200.69).
Question: What is the interpretation of the confidence interval?
(c) Since the entire interval is above 5,000, we can conclude that college graduates earn
at least \$5,000 more than workers not having a college degree on the average, the
same conclusion as in part (a).
11.2 Large-sample inferences concerning a difference between two population
proportions
Many investigations are carried out to compare the proportion of successes in one
population to the proportion of successes in a second population.

Notation
Population 1: Proportion of “successes” = 1
Population 2: Proportion of “successes” = 2
Sample from population 1
Sample from population 2
Sample size Sample proportion of successes
n1
p1
n2
p2
When comparing the “success” proportions of two populations, it is common to focus on
the quantity 1-2, the difference between the two proportions. Since p1 provides an
estimate of 1 and p2 provides an estimate of 2, the obvious choice for an estimate of 12 is p1–p2. Since the statistic p1–p2 will be the basis for drawing inferences about 1-2,
we introduce some properties of its sampling distribution first

Properties of the sampling distribution of p1– p2
If two random samples are selected independently, the following properties hold:
1.  p1  p 2 = 1-2
That is, the sampling distribution of p1-p2 is centered at 1-2, so p1-p2 is an unbiased
statistic for estimating 1-2.
2.  p21  p2   p21   p22 
 p p 
1
2
 1 (1 1 )
n1
 1 (1 1 )
n1

 2 (1 2 )
n2
  2 (1n2  2 ) .
3. If both n1 and n2 are large (that is, n11  10, n1 (1- 1)  10, n22  10, and n2 (1- 2) 
10), then p1 and p2 each has approximately a normal distribution, and p1– p2 also has
approximately a normal distribution.
Thus, the standardized variable
z 
( p1  p2 )  (  1   2 )
 1 ( 1  1 )
n1

 2 ( 1  2 )
n2
has approximately the standard normal(z) distribution.

A large-sample test procedure
For comparisons of 1 and 2, the most general null hypothesis of interest has the form
H0: 1-2 = hypothesized value
Since H0: 1-2 = 0, which is equivalent to 1=2, is almost always the relevant one in
applied problems, we will focus exclusively on it.
Let  denote the common value of the two population proportions. Then the z variable
obtained by standardizing p1– p2 becomes
z 
p1  p2
 ( 1 )
n1

 ( 1 )
n2
.
Unfortunately, this cannot serve as a test statistic, because  is unknown and thus the
denominator cannot be computed. A test statistic can be obtained by first estimating 
from the sample data and then using this estimate in the denominator of z.
When 1 = 2, either p1 or p2 separately gives an estimate of the common proportion .
However, a better estimate than either of these is a weighted average of the two.
Definition 11.2 The combined estimate of the common population proportion is
pc 
n1 p1  n 2 p 2
n1  n 2
= (total number of S’s in two samples) / (total sample size).
The test statistic for testing H0: 1-2 = 0 results from using pc in place of  in the
standardized variable z. When H0 is true, z  pc (1 ppc1)  pp2c (1 pc ) has approximately the
n1

n2
standard normal distribution. Thus we have the following test procedure.

Summary of large-sample z tests for 1-2 = 0
Null hypothesis: H0: 1-2 = 0
Test statistic: z 
p1  p 2
pc ( 1  pc )
n1

Alternative hypothesis
Ha:  1-  2 > 0
(Upper-tailed test)
Ha:  1-  2 < 0
(Lower-tailed test)
Ha:  1-  2  0
(Two-tailed test)
pc ( 1  pc )
n2
P-value
Area under the z curve to the right of the computed z
Area under the z curve to the left of the computed z
(i) 2(area to the right of the computed z) if z is positive
(ii) 2(area to the left of the computed z) if z is negative
Assumptions:
b. The samples are independent random samples.
c. Both sample sizes are large: n1p1  10, n1 (1- p1)  10, n2p2  10, and n2 (1- p2)  10.

A confidence interval
A large-sample confidence interval for 1-2 is a special case of the general z interval
formula
Point estimate  (z critical value) (estimated standard deviation)
The statistic p1– p2 gives a point estimate of 1-2, and the standard deviation of this
statistic is
 p p 
1
2
 1 (1 1 )
n1
  2 (1n2  2 ) .
An estimated standard deviation is obtained by using the sample proportions p1 and p2 in
place of 1 and 2 respectively.

A large-sample confidence interval for 1-2
When
1. the samples are independent random samples, and
2. both sample sizes are large: n1p1  10, n1 (1- p1)  10, n2p2  10, and n2 (1- p2)  10.
a large-sample confidence interval for 1-2 is
(p1– p2)  (z critical value)
p1 (1 p1 )
n1

p 2 (1 p 2 )
n2
Example 11.2 Independent samples of the opinions of registered Republican and
Democratic party members was taken concerning the weakening of the Endangered
Species Act (ESA). The results are listed in the table below.
Party Affiliation For Weakening ESA Against Weakening ESA
Republican
38
62
Democrat
26
74
Let 1 denote the proportion of Republicans who are in favor of weakening the
Endangered Species Act and 2 denote the proportion of Democrats who are in favor of
weakening the Endangered Species Act.
(a) Give point estimates of 1, 2, and 1 - 2.
(b) Is there sufficient evidence to conclude that a higher proportion of Republicans would
like to weaken the Endangered Species Act? Use  = .05.
(c) Compute a 95% confidence interval for 1 - 2.
(d) What is your conclusion based on confidence interval in part (c)?
(a) The point estimates for 1, 2, and 1 - 2 are
p1 = 38 / (38 + 62) = 0.38, p2 = 26 / (26 + 74) = .26, and p1 - p2 = 0.38 – 0.26 = 0.12
(b)
1. Population characteristics of interest:
1 = the proportion of Republicans who are in favor of weakening the Endangered
Species Act.
2 = the proportion of Democrats who are in favor of weakening the Endangered
Species Act.
1 - 2 = the difference between the proportions of Republicans and Democrats who
are in favor of weakening the Endangered Species Act.
2. Null hypothesis: H0: 1 -2 = 0
3. Alternative hypothesis: Ha: 1 -2 > 0
4. Significance level:  = 0.05
5. Test statistic:
z 
p1  p 2
pc ( 1  pc )
n1

pc ( 1  pc )
n2
6. Assumptions: This test requires two independent random samples and two large
sample sizes. The given samples were two independent random samples with sample
sizes of n1 = 38 + 62 = 100 and n2 = 26 + 74 = 100. Since n1 p1= 38 > 10, n1(1- p1) =
62 > 10, n2 p2= 26 > 10, and n2(1- p2) = 74 > 10, the large-sample z test for 1 -2 = 0
is appropriate.
7. Calculations: n1 = n2 = 100, p1 = .38, p2 = .26
pc  n1 pn11 nn22 p 2 = (1000.38+100.26) / (100 + 100) = .32
z 
0.38  0.26
.32 ( 1  .32 )
100

.32 ( 1  .32 )
100
= 0.12 / 0.066 =1.82
8. P-value:
P-value = the area under the z curve to the right of 1.82 = 1- 0.9656 = 0.0344.
9. Conclusion: Since P-value = 0.0344 < 0.05 = , H0 is rejected.
There is sufficient evidence to conclude that a higher proportion of Republicans would
like to weaken the Endangered Species Act.
(c) A 95% confidence interval for 1-2 is
（p1– p2) (z critical value)
= (0.38 – 0.26)  (1.96)
= (-0.0082, 0.2482)
p1 (1 p1 )
n1
( 0.38)(1 0.38)
100

p 2 (1 p 2 )
n2
1 0.26)
= 0.12 1.960.0654 = 0.12 0.1282
 0.26(100
Question: What is the interpretation of the confidence interval?
(d) Since 0 is in this interval, there is no sufficient evidence to conclude that a higher
proportion of Republicans would like to weaken the Endangered Species Act.
Question: Why does the conclusion based on the confidence interval differ from the one
based on the hypothesis test?
```
Related documents