Download MATH 183 Estimate for Difference in Means

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Law of large numbers wikipedia , lookup

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Dr. Neal, WKU
MATH 183
Estimate for Difference in Means
Consider a measurement X on two distinct populations: !1 where X has mean µ1 and
variance !12 , and ! 2 where X has mean µ 2 and variance ! 22 . Our goal is to estimate
the difference of averages µ1 ! µ2 . To do so, we obtain a random sample of size n1 from
!1 and let x1 be its sample mean and S1 be its sample deviation. Then we obtain an
independent random sample of size n2 from population ! 2 and let x2 be its sample
mean and S2 be its sample deviation.
Ultimately we want to conclude, with a certain level of confidence, whether one
average is significantly higher than the other average, or whether there is no significant
difference in the averages.
With “large” populations and non-trivial measurements, it is almost certainly the
case that the averages µ1 and µ 2 will be different. However, they may be close enough
to each other so that a confidence interval for the difference µ1 ! µ2 contains the
number 0 and has a small margin of error. In this case, we might say that there is “no
statistical difference” in the means.
Average and Variance of a Difference
Let X and Y denote random measurements on a population and let W = X ! Y denote
all random differences in these measurements. The important facts that we need are
I. The “average of the difference” equals the “difference of the averages.” That is,
µW = µ X – µY .
II. When measurements X and Y are obtained independently, then the variance of the
difference is the sum of the variances. That is,
2
= ! 2X + ! Y2 .
!W
III. The previous results apply to all possible differences in sample means x1 ! x2 from
independent samples on !1 and ! 2 . We know that all possible values of x1 have mean
!2
!
µ1 and standard deviation 1 ; so their variance is 1 . Likewise, all possible values of
n1
n1
! 22
. Now let W = x1 ! x 2 be all possible differences in
x2 have mean µ 2 and variance
n2
sample means. Then
µ W = µ1 ! µ 2
2
!W
!12 ! 22
=
+
n1
n2
!W =
!12 ! 22
+
.
n1
n2
Moreover, for large sample sizes n1 and n2 , W is approximately normally distributed.
Dr. Neal, WKU
IV. Because W ≈ N( µ W , ! W ) , with probability r , the values of W = x1 ! x2 will be
within
µ W ± z! / 2 " W ,
where z! / 2 is the appropriate z -score. That is, with probability r = 1 ! " the values of
x1 ! x2 will be within
# 12 # 22
.
(µ1 ! µ 2 ) ± z" / 2
+
n1 n2
But then with probability r = 1 ! " , the values of µ1 ! µ2 will be within
#12 # 22
.
(x1 ! x2 ) ± z" / 2
+
n1 n2
Confidence Interval for Difference in Means
From the previous result, we obtain the confidence interval formula for the difference in
means:
#12 # 22
µ1 ! µ2 ≈ (x1 ! x2 ) ± z" / 2
+
n1 n2
(2-SampZInt on Calc.)
Arbitrary Measurements: Need large samples.
Normal Measurements: Any sample sizes work.
With large samples, we may replace !1 and ! 2 with estimates or upper bounds:
µ1 ! µ2 ≈ (x1 ! x2 ) ± z" / 2
S12 S22
U12 U22
or µ1 ! µ2 ≈ (x1 ! x2 ) ± z" / 2
+
+
n1 n2
n1
n2
• If we disregard the ! 2 statistics, then this formula reduces to the confidence interval
z
"
" 12
for a single mean µ1 . It becomes µ1 ≈ x1 ± z! / 2
= x ± ! /2 .
n
n1
Dr. Neal, WKU
• To account for smaller populations of sizes N1 and N2 , we could obtain a smaller
margin of error by using
µ1 ! µ2 " (x1 ! x2 ) ± z# / 2
$12 %' N1 ! n1 (* $ 22 %' N2 ! n2 (*
+
n1 '& N1 ! 1 *) n2 '& N2 ! 1 *)
There are the same usual problems with this formula as with the general confidence
interval for the mean. Namely,
(i) For arbitrary measurements, we need large sample sizes to obtain accuracy and
small margins of error.
(ii) The use of the z -score comes from an approximate standard normal distribution
applied to all possible difference in sample means W = x1 ! x 2 ;
(iii) The true standard deviations are often unknown.
As with the confidence interval for the mean, we can improve the accuracy and
overcome these problems when sampling from normally distributed measurements.
Example 1. The tensile strength (in pounds per square inch) is being measured on two
different manufactures of a synthetic fiber. For a sample of 35 fibers having 15% cotton,
the sample statistics were x1 = 9.8 with S1 = 3.5. For an independent sample of 30 fibers
having 35% cotton, the statistics were x2 = 10.1 with S2 = 3.4.
Find a 95% confidence interval for the difference in average tensile strength among
all such manufactures of 15% cotton fibers and 35% cotton fibers. Explain the interval in
words.
Solution. (By hand) µ1 ! µ2 ≈ (x1 ! x2 ) ± z" / 2
3. 52 3. 42
S12 S22
= (9.8 – 10.1) ± 1. 96
+
+
35
30
n1 n2
= !0. 3 ± 1.68 ; or –1.98 ≤ µ1 ! µ2 ≤ 1.38.
That is, the average tensile strength of the 15% cotton fibers is from 1.98 psi lower to
1.38 psi higher than that of the 35% cotton fibers. With this interval, µ1 ! µ2 could equal
0; so the average tensile strengths could be equal. From this data alone, there is not a
statistically significant difference in the average tensile strength, as evidenced by the
closeness of x1 = 9.8 and x2 = 10.1.
This type of confidence interval also can be computed with the 2–SampZInt feature
(item 9) from the STATS TESTS menu. Set the Inpt to Stats, then enter the variables
(using the sample deviations for !1 and ! 2 ) and calculate.
Dr. Neal, WKU
Example 2. We wish to see if there is any apparent difference in high school grade point
average between girls and boys in the state who choose to go to college. The following
data is a random collection of high school GPAs from a group of Kentucky high school
graduates in their first year of college. Find a 90% confidence interval for the difference
between average female and average male grade point average. Explain the interval in
words.
3.25
3.76
2.78
Random Collection of Female High School GPAs
3.32
3.05
3.08
4.00
3.68
3.05
3.26
3.46
3.70
3.00
3.24
3.52
3.80
3.06
2.84
4.00
3.15
3.75
3.56
3.44
3.78
3.25
3.33
2.74
3.62
3.28
3.14
3.75
4.00
3.76
Random Collection of Male High School GPAs
2.32
2.90
3.00
4.00
2.12
3.50
2.11
3.72
3.04
3.42
4.00
2.38
2.49
2.92
2.48
2.50
3.76
2.18
3.68
4.00
2.78
2.48
2.72
2.34
3.04
3.78
4.00
Solution. Because the data sets are of the same size, we can compute the statistics of both
samples at once with the 2–Var Stats command. Enter the data into lists L1 (female)
and L2 (male) in the STAT Edit screen. Then enter the command 2–Var Stats L1, L2.
After computing the statistics, we see that the average female GPA is x1 = 3.363,
with a sample deviation of S1 ≈ 0.3474. The average male GPA is x2 = 3.105667 with S2
≈ 0.67254. (In each case there are 30 measurements.)
Note: If the data sets are of different sizes, then the 2–Var Stats command will not
work. In this case, simply execute the 1–Var Stats command on each list and note the
values of n , x , and S for each sample.
Now to find a 90% confidence interval for the true difference in average GPA:
µ1 ! µ2 ≈ (3.363 – 3.105667) ± 1.645
0.34742 0.67254 2
+
≈ 0.257333 ± 0.227343, or
30
30
0.03 ≤ µ1 ! µ2 ≤ 0.485.
Thus we can assert that among all college
bound students in the state, the average
female GPA is from 0.03 grade points higher
to 0.485 grade points higher than the average
male GPA.
Dr. Neal, WKU
Difference in Proportions
We also can apply the formula to the special case of a difference in proportions p1 ! p2 .
2
Recall that for a proportion p , the population variance is ! = p(1 ! p) which is
2
estimated by S = p (1 ! p ) . So for large populations, a confidence interval for the
difference in proportions is
p1 – p2 ≈ ( p1 – p2 ) ± z! / 2
p1 (1 ! p1) p2 (1 ! p2 )
(2–PropZInt on Calc.)
+
n1
n2
or use p1 – p2 ≈ ( p1 – p2 ) ± z! / 2
0.25 0.25
+
n1
n2
• When one or more population sizes is small, then we could include its small
population correction factor to decrease the margin of error. For two small populations,
we have
p1 – p2 ≈ ( p1 – p2 ) ± z! / 2
p1 (1 ! p1 ) " N1 ! n1 % p2 (1 ! p2 ) " N2 ! n2 %
$
' +
$
'
n1
# N1 ! 1 &
n2
# N2 ! 1 &
or p1 – p2 ≈ ( p1 – p2 ) ± z! / 2
0.25 "$ N1 ! n1 %' 0.25 "$ N2 ! n2 %'
+
n1 $# N1 ! 1 '&
n2 $# N2 ! 1 '&
Example 3. A poll commissioned by the Center on Addiction and Substance Abuse at
Columbia University found that 1340 of 2000 adults and 304 out of 400 youths
interviewed believe that popular culture encourages drug use. Find a 95% confidence
interval for the true difference in proportions between adults and youths who have this
belief. Explain the interval in words.
Solution. By hand: Using the confidence interval for two “large” populations, we have
1340
304
= 0.67 (for adults), p2 =
= 0.76 (for children), and then
p1 =
2000
400
p1 ! p2 ≈ (0.67 ! 0. 76) ± 1. 96
(0.67)(0.33) (0.76)(0.24)
= –0.09 ± 0.04665.
+
2000
400
0.25 0.25
Or, p1 – p2 ≈ (0.67 ! 0.76) ± 1.96
≈ –0.09 ± 0.0537
+
2000 400
Using the first interval –0.09 ± 0.04665, we have – 0.1366 ≤ p1 ! p2 ≤ –0.04335 or
equivalently 0.04335 ≤ p2 ! p1 ≤ 0.1366. So the proportion of all children who believe
that popular culture encourages drug use is from about 4.3 percentage points higher to
about 13.7 percentage points higher than that of adults.
Dr. Neal, WKU
Note: Do not say that the proportion is from 4.3% higher to 13.7% higher.
difference in proportions always should be expressed in terms of percentage points.
The
The first type of confidence interval for a difference in (large population)
proportions also can be found with the 2–PropZInt feature (item B) from the STAT
TESTS screen.
Practice Exercises
1. A bank is offering two different credit card plans. Sample groups are being
monitored to see if there is a significant difference in the average amounts the groups
charge per quarter. The summary statistics for one quarter are below.
Group
Plan A
Plan B
n
152
148
x
$1987.54
$2056.89
S
$392.68
$413.12
Find a 95% confidence interval for the difference in the average amounts charged this
quarter among all people in the two groups. Explain the interval in words.
2. A nationwide survey found that 260 out of 1040 people under age 30 regularly
download music using peer-to-peer services. But only 78 out of 975 people at least 30
years old regularly do so. Find a 98% confidence interval for the difference in the true
proportions of those who download music in these age groups. Explain the interval in
words.
Answers:
1. The 2–SampZInt gives –160.6 ≤ µ1 ! µ2 ≤ 21.901. The average amount charged per
quarter by Plan A customers is from $160.60 lower to $21.90 higher than the average
amount charged by Plan B customers.
2. The 2–PropZInt gives 0.13279 ≤ p1 ! p2 ≤ 0.20721. The proportion of people under
30 who download music using peer-to-peer is from 13.28 pct. points higher to 20.72 pct.
points higher than the proportion of people 30 or older who do so.