Download Hatfield.Topic 8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Omnibus test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Topic 8 - Comparing two samples
• Confidence intervals/hypothesis tests for two means
• Hypothesis test for two variances
1
Comparing two populations
• Sometimes we want to compare two populations rather
making decisions about a single population.
• For example, we might want to compare two population
means or two population proportions to see if they are
equal.
– Is the expected drying time for one type of paint lower than
that of another type of paint?
– Is a new drug more effective? Either increased or decreased
mean versus the “established” drug, or increased or
decreased percentage vs. control
– Does the new method actually result in increased crop yields
or percentages, or decrease in tons lost to insects, etc.
2
Behind the scenes. What do the distributions
look like?
3
Comparing two population means
• Suppose we have two independent samples, X1,…,Xm and
Y1,…,Yn, from two separate populations.
• A natural statistic for comparing the two population
means, mX and mY, is X  Y .
• E(X  Y )  E(X )  E(Y )  mx  my from chapter 5
• Var ( X  Y )  Var (X )  Var (Y ) 
 x2
m

 y2
n
• The distribution of X  Y is also Normal for m and n both
large.
4
Large samples test for comparing population means
To test H0: mX – mY = D0, use the test statistic
Z 
HA
X  Y  D0
sX2 /m  sY2 /n
Reject H0 if
mX – mY < D0 Z < -za
mX – mY > D0 Z > za
mX – mY ≠ D0 |Z| > za/2
5
Home sales data
A realtor in Albuquerque wants to argue that houses in the
Northeast are more expensive on average than those in the rest of
town.
NE = 0 indicates a home was not in the Northeast.
Test the appropriate hypotheses with a = 0.01.
6
This is what the StatCrunch data looks like.
Summary statistics for PRICE:
Group by: NE
NE
n
Mean
0
39
97,282
1
78
110,769
Variance
1,026,531,010
1,612,360,830
Std. Dev. Std. Err.
32,040
5,130
40,154
4,547
Median
94,000
98,500
7
Here’s the output in StatCrunch
Hypothesis test results:
μ1 : mean of PRICE where NE=1 (Std. Dev. not specified)
μ2 : mean of PRICE where NE=0 (Std. Dev. not specified)
μ1 - μ2 : mean difference
H0 : μ1 - μ2 = 0
HA : μ1 - μ2 > 0
Difference
μ1 - μ2
n1
78
n2
39
Sample Mean
13487.18
Std. Err.
Z-Stat
P-value
6855.115 1.967462 0.0246
8
What does it look like?
Ztest 
110769  97282
320402
39
2
40154

13487

 1.967
6855.1147
78
9
Large samples confidence interval for the
difference between two population means
• A large sample (1-a)100% confidence interval for mX – mY is
X  Y  za /2 sX2 /m  sY2 /n
• For the home sales data, what is a 99% confidence interval for
the difference between sale prices in the Northeast and the rest
of town?
10
Equal population variances
• Suppose we assume that the two populations have a
common variance 2.
• Var (X  Y ) 
2
m

2
n
  2(
1 1
 )
m n
• We can then estimate this common variance using the
pooled sample variance:
2
2
(
m

1)
s

(
n

1)
s
X
Y
s 2p 
n m 2
11
Small samples test for comparing population means
from Normal distributions with equal variances
To test H0: mX – mY = D0, use the test statistic
T 
X  Y  D0
s p 1/m  1/n
HA
Reject H0 if
mX – mY < D0
T < -ta,n+m-2
mX – mY > D0
T > ta,n+m-2
mX – mY ≠ D0
|T| > ta/2,n+m-2
12
THC example with equal variances
The active component in marijuana is THC. An experiment was
conducted to compare two slightly different configurations of
this substance.
The THC data set contains the time until the effect was
perceived for 6 subjects exposed to each configuration.
Is there any evidence that the mean time to perception is
different between the two configurations using a = 0.01?
13
Here’s what the calculations look like.
Pooled standard deviation
Summary statistics:
Column
n
Mean
THC1
6
18.786667
THC2
6
18.011667
Variance
34.908108
19.519497
Std. Dev. Std. Err.
5.908309 2.412057
4.418088 1.803677
(6  1)34.9081  (6  1)19.5195
 27.2138
662
s p  5.216685
s 2p 
sp
1 1
1 1
  5.216685   3.01185
m n
6 6
14
What does it look like?
18.78667  18.01167
Ttest 
 0.2573
3.01185
p  value  (1  tail )  2 x0.4011  0.8022
Twice the one tail value.
15
Small samples confidence interval for the
difference between two population means
• Assuming equal variances, a small sample (1-a)100% confidence
interval for mX – mY is
X  Y  ta /2,n m 2s p 1/m  1/n
• For the THC data, what is a 99% confidence interval for the mean
difference between the detection times for the two configurations?
16
Unequal population variances
• The pooled procedures we have discussed previously are
fairly robust to the assumption of equal variances.
• In other words if the two population variances are
relatively close, the procedures perform well:
– The level of significance for the hypothesis test is close
to what it should be
– The coverage probability for the confidence interval is
close to what it should be
• If the variances are quite different, then we need a
different procedure.
17
Small samples test for comparing population means from
Normal distributions with unequal variances
To test H0: mX – mY = D0, use the test statistic
T 
X  Y  D0
sX2 /m  sY2 /n
with degrees of freedom
(sX2 /m  sY2 /n )2
v 2
(sX /m )2
(sY2 /n )2

m 1
n 1
HA
Reject H0 if
mX – mY < D0
T < -ta,v
mX – mY > D0
T > ta,v
mX – mY ≠ D0
|T| > ta/2,v
18
Small samples confidence interval for the
difference between two population means…
with unequal variances.
• Assuming unequal variances, a small sample (1-a)100%
confidence interval for mX – mY is
X  Y  ta /2,v sx2 /m  sY2 /n
• For the THC data, what is a 99% confidence interval for the mean
difference between the detection times for the two configurations?
19
Comparing two population variances
• Suppose two chemical companies can supply a raw material,
but we suspect the variability in concentration may differ
between the two.
• The standard deviation of concentration in a random sample
of 15 batches from company 1 was found to be 4.7 g/l
(variance 22.09). A sample of 21 batches from company 2
yielded a standard deviation of 5.8 g/l (variance 33.64).
• Is there sufficient evidence to conclude that the variability in
concentration differs for the two companies?
20
Test for comparing population variances from
Normal distributions
To test H0: X2 Y2, use the test statistic
HA
2
X
2
Y
s
F 
s
Reject H0 if
X2 > Y2
F > Fa,m-1,n-1
X2 < Y2
F < F1a,m-1,n-1
X2 ≠ Y2
F > Fa/2,m-1,n-1
or
F < F1a/2,m-1,n-1
21
Chemical example
• Is there sufficient evidence to conclude that the variability in
concentration differs for the two companies with a = 0.05?
• Demonstrate the F calculator.
22
Confidence interval for the ratio of two Normal
population variances
• A (1-a)100% confidence interval for X2/Y2 is
 sX2 /sY2
s X2 /sY2 
,


 F1a /2,m 1,n 1 Fa /2,m 1,n 1 
• For the THC example, what is a 95% confidence interval for the
ratio of concentration variances?
The additional file for Topic 8 contains examples of large and small scale
tests on the differences in population means and proportions.
23
Paired data
• Sometimes we have a third variable that connects elements
from the X and Y samples.
• In this case, the assumption of independence between the
two samples may be violated.
• Is there any evidence that the first twin and the second twin
have different average weights among boy-boy twins?
• In this case, the twins are clearly connected by the mother.
• It might be better to base our test on the n pairwise
differences, Di = Xi – Yi.
24
Paired test for comparing population means
To test H0: mX – mY = D0, use the test statistic
T 
D  D0
sD
n
HA
Reject H0 if
mX – mY < D0
T < -ta,n-1
mX – mY > D0
T > ta,n-1
mX – mY ≠ D0
|T| > ta/2,n-1
25
Twins example
• Load the Twins data from StatCrunch sample data sets.
• Is there any evidence that Twin A and Twin B have different
average weights among boy-boy twins with a = 0.1?
26
Additional pooled vs. paired
•
Example: The article “Sex and Race Discrimination in the New
Car Showroom: A fact or Myth” (J. Consumer Affairs, 1977, pp
107-113) reports the results of an experiment in which
individuals of different races and sexes visited 9 car dealerships
to request the best possible deal on a certain car. The actual
car prices obtained are shown below:
27
Summary data:
x  4476.778, sx2  40118.69, sx  200.2965
y  4388.444, s y2  18405.28, s y  135.6661
Is there sufficient evidence at α = 0.05 to conclude that the dealerships are
quoting different prices for the black woman and the white man?
The standard deviations are relatively close, so we could consider this as a pooled
test of differences, with the following results;
28
29
Two ways to look at the situation
Why did we get such poor results from our test?
The assumption in a pooled test is that there’s independence
of data. In other words, any values from the woman’s distribution
of prices are independent of values from the man’s distribution….
A valid comparison in that situation looks like this….
30
However, we know that’s not the case.
Prices from dealership 1 can be compared to each other (M to W),
dealership 2, etc. There’s a relationship between the prices, a
“pairing variable”. They are not independent and when viewed
correctly, the data shows something completely diffferent…..
31
32
33
Paired confidence interval for the difference
between two population means
• A small sample (1-a)100% confidence interval for mX – mY is
D  ta /2,n 1sD / n
• For the car price example, what is a 90% confidence interval for
the mean difference between the prices quoted to the black woman
vs. the white man?
• CarData
34
Comparing two population proportions
• A natural statistic for comparing the two population
ˆ X  pˆY .
proportions, pX and pY, is p
•
ˆX  p
ˆY )  E ( p
ˆ X )  E( p
ˆY )  pX  pY
E( p
pˆ X (1  pˆ X ) pˆY (1  pˆY )
• Var ( p
ˆ X  pˆY ) 

m
n
1 1
 p(1  p )(  ), with common p
m n
ˆ X  pˆY is also Normal for m and n
• The distribution of p
both large.
35
Large samples test for comparing population
proportions
To test H0: pX – pY = 0, use the test statistic
HA
Z 
ˆX  p
ˆY  0
p
1
1
ˆ (1  p
ˆ )(
p

)
m
n
Reject H0 if
pX – pY < 0
Z < -za
pX – pY > 0
Z > za
pX – pY ≠ 0
|Z| > za/2
Please note that the common p listed above is calculated as the total number of
successes overall in the study, divided by the total number of observations…..
36
Polio example
• The following table summarizes a study of the efficacy of the
Salk vaccine. (Please note that I changed the actual percentages who got polio in
this example to make the numbers MUCH more workable….don’t panic).
Treatment
Total
Patients
Polio
Vaccine
2,000
30
Placebo
2,000
100
• Was the vaccine effective? Test at a = 0.05.
37
Large samples confidence interval for the
difference between two population proportions
• A large sample (1-a)100% confidence interval for pX – pY is
ˆX  p
ˆY  za /2 p
ˆ X (1  p
ˆ X )/m  pˆY (1  pˆY )/n
p
• For the Polio data, what is a 95% confidence interval for the
difference between the proportion who contract the disease under
each treatment?
(0.015  0.05)  1.96 [0.015(0.985)]/2000  [0.05(0.95)]/2000


 0.035  0.01093
(0.0459; 0.0241)
38