Download Class 23: Thursday, Dec. 2nd

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Class 23: Thursday, Dec. 2nd
• Today: One-way analysis of variance, multiple
comparisons.
• Next week: Two-way analysis of variance.
• I will e-mail the final homework, Homework 9, to you this
weekend.
• All of the final project ideas look good. I have e-mailed
some of you my comments already and will e-mail the
rest of you my comments by tomorrow.
• Schedule:
– Thurs., Dec. 9th – Final class
– Mon., Dec. 13th (5 pm) – Preliminary results from final project
due
– Tues., Dec. 14th (5 pm) – Homework 9 due
– Tues., Dec. 21st (Noon) – Final project due.
Individual vs. Familywise Error
Rate
• When several tests are considered
simultaneously, they constitute a family of tests.
• Individual Type I error rate: Probability for a
single test that the null hypothesis will be
rejected assuming that the null hypothesis is
true.
• Familywise Type I error rate: Probability for a
family of test that at least one null hypothesis will
be rejected assuming that all of the null
hypotheses are true.
• When we consider a family of tests, we want to
make the familywise error rate small, say 0.05,
to protect against falsely rejecting a null
hypothesis.
Why Control the Familywise error
rate:
• Five children in a particular school got leukemia last
year? Is that a coincidence or does the clustering of
cases suggest the presence of an environmental toxin
that caused the disease?
• Individual Type I error rate: Calculate the probability that
five children at this particular school would all get
leukemia this particular year. If this is small, say smaller
than 0.05, become alarmed.
• Familywise Type I error rate: Calculate the probabilty
that five children in any school would develop the same
severe disease in the same year. If this is small, say
smaller than 0.05, become alarmed.
• If we control the individual type I error rate, then we will
locate many disease “clusters” that are not caused by an
environmental toxin but are just coincidences.
Bonferroni Method
• General method for doing multiple comparisons
for any family of k tests.
• Denote familywise type I error rate we want by
p*, say p*=0.05.
• Compute p-values for each individual test -p1,..., pk
p*
• Reject null hypothesis for ith test if pi 
k
• Guarantees that familywise type I error rate is at
most p*.
• Why Bonferroni works: If we do k tests and all
null hypotheses are true , then using Bonferroni
with p*=0.05, we have probability 0.05/k to make
a Type I error for each test and expect to make
k*(0.05/k)=0.05 errors in total.
Multiplicity
• A news report says, “A 15 year study of more than
45,000 Swedish solidiers revealed that heavy users
of marijuana were six times more likely than
nonusers to develop schizophrenia.”
• Were the investigators only looking for difference in
schizophrenia among heavy/non-heavy users of
marijuana?
• Key question: What is their family of tests? If they
were actually looking for a difference among 100
outcomes (e.g., blood pressure, lung cancer),
Bonferroni should be used to control the familywise
Type I error rate, i.e., only consider a difference
significant if p-value is less than .05/100=.0005.
• The best way to deal with the multiple comparisons
problem is to design a study to search specifically for
a pattern that was suggested by an exploratory data
analysis. Then there is only one comparison.
Bonferroni method on Milgram’s
data
Expanded Estimates
Nominal factors expanded to all levels
Term
Intercept
Condition[Proximity]
Condition[Remote]
Condition[Touch-Proximity]
Condition[Voice-Feedback]
Estimate
338.25
-26.25
66.75
-70.125
29.625
Std Error
9.067431
15.70525
15.70525
15.70525
15.70525
t Ratio
37.30
-1.67
4.25
-4.47
1.89
Prob>|t|
<.0001
0.0966
<.0001
<.0001
0.0611
• If we want to test whether each of the four groups has a
mean different from the mean of all four groups, we have
four tests. Bonferroni method: Check whether p-value of
each test is <0.05/4=0.0125.
• There is strong evidence that the remote group has a
mean higher than the mean of the four groups and the
touch-proximity group has a mean lower than the mean
of the four groups.
Multiple Comparison Simulation
• In multiplecomp.JMP, 50 groups are
compared with sample sizes of ten for each
group.
• The observations for each group are
simulated from a standard normal distribution.
Thus, in fact, 1  2    50  0
• Bonferroni approach to deciding which
groups have means different than average:
Reject null hypothesis that a group’s mean is
the average mean of all groups only if the pvalue for the t-test is .05/50=.001.
Multiple Comparison Simulation
Iteration
1
# of Groups
with pvalue <
0.05
# of Groups
with pvalue <
.0025
2
3
4
5
Pairwise Comparisons
Expanded Estimates
Nominal factors expanded to all levels
Term
Intercept
Condition[Proximity]
Condition[Remote]
Condition[Touch-Proximity]
Condition[Voice-Feedback]
Estimate
338.25
-26.25
66.75
-70.125
29.625
Std Error
9.067431
15.70525
15.70525
15.70525
15.70525
t Ratio
37.30
-1.67
4.25
-4.47
1.89
Prob>|t|
<.0001
0.0966
<.0001
<.0001
0.0611
• We are interested not just in what groups have
means that are different than the average mean,
but in pairwise comparisons between the
groups.
• For a pairwise comparison between group i and
group j, we want to test the null hypothesis that
group i and group j have the same means
versus the alternative that group i and group j
have different means, i.e., H 0 : i   j vs. H a : i   j
Pairwise Comparisons Cont.
• For Milgram’s obedience data, there are six pairwise
comparisons:
(1) Proximity vs. Remote; (2) Proximity vs. TouchProximity; (3) Proximity vs. Voice-Feedback; (4) Remote
vs. Touch-Proximity; (5) Remote vs. Voice-Feedback; (6)
Touch-Proximity vs. Voice-Feedback
• Multiple comparisons situation with a family of six tests.
We want to control the familywise error rate at .05 rather
than the individual type I error rate.
• Could use Bonferroni to do this but there is a method
called Tukey’s HSD (stands for “Honest Significant
Differences”) that is specially designed to control the
familywise type I error rate for pairwise comparisons in
ANOVA.
LSMeans Differences Tukey HSD
Alpha=
0.050 Q=
2.59695LSMean[i] By LSMean[j]
Mean[i]-Mean[j]
Std Err Dif
Lower CL Dif
Upper CL Dif
Proximity
Remote
Touch-Proximity
Voice-Feedback
Level
Remote
Voice-Feedback
Proximity
Touch-Proximity
A
A
B
B
C
C
Proximity
Remote
Touch-Proximity
Voice-Feedback
0
0
0
0
93
25.6466
26.3972
159.603
-43.875
25.6466
-110.48
22.7278
55.875
25.6466
-10.728
122.478
-93
25.6466
-159.6
-26.397
0
0
0
0
-136.88
25.6466
-203.48
-70.272
-37.125
25.6466
-103.73
29.4778
43.875
25.6466
-22.728
110.478
136.875
25.6466
70.2722
203.478
0
0
0
0
99.75
25.6466
33.1472
166.353
-55.875
25.6466
-122.48
10.7278
37.125
25.6466
-29.478
103.728
-99.75
25.6466
-166.35
-33.147
0
0
0
0
Least Sq Mean
405.00000
367.87500
312.00000
268.12500
Levels not connected by same letter are significantly different
Comparisons between groups that are in red are groups for which the null hypothesis
that the group means are the same is rejected using the Tukey HSD procedure, which
controls the familywise Type I error rate at 0.05. A confidence interval for the difference
in group means that adjusts for multiple comparisons is shown in the third and fourth
lines.
More on Tukey’s HSD
• Using Tukey’s HSD, the pairs for which there is strong evidence of a
difference in means adjusting for multiple comparisons are remote is
higher than proximity, remote is higher than touch proximity and
voice feedback is higher than touch proximity.
• For confidence intervals for differences in the means of each pair of
groups, if we use the usual confidence intervals, there is a good
chance that at least one of the intervals will not contain the true
difference in means between the groups.
• When making a family of confidence intervals, we want confidence
intervals that have a 95% chance of all intervals in the family
containing their true values. The confidence intervals produced by
the Tukey HSD procedure have this property.
• 95% confidence interval for difference in mean of remote group vs.
mean of proximity group using Tukey’s HSD: (26.40, 159.60).
• 95% confidence interval for difference in mean of remote group vs.
mean of proximity group assuming that this is the only confidence
interval being formed (family of one confidence interval): (42.34,
143.66). Tukey’s HSD confidence interval is wider because in order
for a family of CIs to each contain their true value when multiple CIs
are formed, each CI must be wider than it would be if only one CI
was being formed.
Tukey HSD in JMP
• Use Analyze, Fit Model to do the analysis
of variance by making the X variable the
categorical variable denoting the group.
• After Fit Model, click red triangle next to
group variable (Condition in the Milgram
study) and click LS Means Differences
Tukey HSD. Clicking LS Means
Differences Student’s t gives CIs that do
not adjust for multiple comparisons.
Assumptions in one-way ANOVA
• Assumptions needed for validity of oneway analysis of variance p-values and CIs:
– Linearity: automatically satisfied.
– Constant variance: Spread within each group
is the same.
– Normality: Distribution within each group is
normally distributed.
– Independence: Sample consists of
independent observations.
Rule of thumb for checking
constant variance
• Constant variance: Look at standard deviation of
different groups by using Fit Y by X and clicking Means
and Std Dev.
Means and Std Deviations
Level
Proximity
Remote
Touch-Proximity
Voice-Feedback
Number
40
40
40
40
Mean
312.000
405.000
268.125
367.875
Std Dev
129.979
63.640
131.874
119.518
Std Err Mean
20.552
10.062
20.851
18.897
• Check whether (highest group standard deviation/lowest
group standard deviation)^2 is greater than 3. If greater
than 3, then constant variance is not reasonable and
transformation should be considered.. If less than 3,
then constant variance is reasonable.
• (Highest group standard deviation/lowest group standard
deviation)^2 =(131.874/63.640)^2=4.29. Thus, constant
variance is not reasonable for Milgram’s data.
Transformations to correct for
nonconstant variance
• If standard deviation is highest for high groups with high
means, try transforming Y to log Y or Y . If standard
deviation is highest for groups with low means, try
transforming Y to Y2.
Means and Std Deviations
Level
Proximity
Remote
Touch-Proximity
Voice-Feedback
Number
40
40
40
40
Mean
312.000
405.000
268.125
367.875
Std Dev
129.979
63.640
131.874
119.518
Std Err Mean
20.552
10.062
20.851
18.897
• SD is particularly low for group with highest mean. Try
transforming to Y2. To make the transformation, right
click in new column, click New Column and then right
click again in the created column and click Formula and
enter the appropriate formula for the transformation.
Transformation of Milgram’s data to
Squared Voltage Level
Means and Std Deviations
Level
Proximity
Remote
Touch-Proximity
Voice-Feedback
Number
40
40
40
40
Mean
113816
167974
88847
149259
Std Dev
78920.2
48541.4
79291.3
74053.6
Std Err Mean
12478
7675
12537
11709
• Check of constant variance for transformed data:
(Highest group standard deviation/lowest group
standard deviation)^2 = 2.67. Constant variance
assumption is reasonable for voltage squared.
• Analysis of variance tests are approximately
valid for voltage squared data; reanalyzed data
using voltage squared.
Analysis using Voltage Squared
Response Voltage Squared
Effect Tests
Source
Condition
Nparm
3
DF
3
Sum of Squares
1.50737e11
F Ratio
9.8735
Prob > F
<.0001
Strong evidence that the group mean voltage squared levels are not all the same.
LSMeans Differences Tukey HSD
Alpha=
0.050 Q=
2.59695LSMean[i] By LSMean[j]
Mean[i]-Mean[j]
Std Err Dif
Lower CL Dif
Upper CL Dif
Proximity
Remote
Touch-Proximity
Voice-Feedback
Proximity
Remote
Touch-Proximity
Voice-Feedback
0
0
0
0
54157.5
15951.4
12732.6
95582.4
-2.5e+4
15951.4
-6.64e4
16455.6
35443.1
15951.4
-5981.8
76868.1
-5.42e4
15951.4
-9.56e4
-1.27e4
0
0
0
0
-7.91e4
15951.4
-120552
-3.77e4
-1.87e4
15951.4
-6.01e4
22710.6
24969.4
15951.4
-1.65e4
66394.3
79126.9
15951.4
37701.9
120552
0
0
0
0
60412.5
15951.4
18987.6
101837
-3.54e4
15951.4
-7.69e4
5981.81
18714.4
15951.4
-2.27e4
60139.3
-6.04e4
15951.4
-101837
-1.9e+4
0
0
0
0
Strong evidence that remote has higher mean voltage squared level than proximity
and touch-proximity and that voice-feedback has higher mean voltage squared level
than touch-proximity, taking into account the multiple comparisons.
Rule of Thumb for Checking
Normality in ANOVA
• The normality assumption for ANOVA is that the
distribution in each group is normal. Can be checked by
looking at the boxplot, histogram and normal quantile
plot for each group.
• If there are more than 30 observations in each group,
then the normality assumption is not important; ANOVA
p-values and CIs will still be approximately valid even for
nonnormal data if there are more than 30 observations in
each group.
• If there are less than 30 observations per group, then we
can check normality by clicking Analyze, Distribution and
then putting the Y variable in the Y, Columns box and the
categorical variable denoting the group in the By box.
We can then create normal quantile plots for each group
and check that for each group, the points in the normal
quantile plot are in the confidence bands. If there is
nonnormality, we can try to use a transformation such as
log Y and see if the transformed data is approximately
normally distributed in each group.
One way Analysis of Variance:
Steps in Analysis
1. Check assumptions (constant variance,
normality, independence). If constant variance
is violated, try transformations.
2. Use the effect test (commonly called the Ftest) to test whether all group means are the
same.
3. If it is found that at least two group means
differ from the effect test, use Tukey’s HSD
procedure to investigate which groups are
different, taking into account the fact multiple
comparisons are being done.
Example: Discrimination
against the Handicapped
• Study of how physical handicaps affect people’s
perception of employment qualifications.
• Researchers prepared five videotaped job interviews,
using same two male actors for each. Tapes differed
only in that applicant appeared with a different
handicap in each– (i) wheelchair; (ii) on crutches; (iii)
hearing impaired; (iv) one leg amputated; (v) no
handicap.
• Each tape shown to 14 students from U.S. university.
Students rate qualifications of candidate on 0 to 10
point scale based on tape.
• Questions of interest: Do subjects systematically
evaluate qualifications differently according to
candidate’s handicap? If so, which handicaps
produce different evaluations?
Checking Assumptions
Oneway Analysis of SCORE By HANDICAP
9
8
SCORE
7
6
5
4
3
2
1
AMPUTEE
CRUTCHES
HEARING
NONE
WHEELCHAIR
HANDICAP
Means and Std Deviations
Level
AMPUTEE
CRUTCHES
HEARING
NONE
WHEELCHAIR
Number
14
14
14
14
14
Mean
4.42857
5.92143
4.05000
4.90000
5.34286
Std Dev
1.58572
1.48178
1.53259
1.79358
1.74828
Std Err Mean
0.42380
0.39602
0.40960
0.47935
0.46725
Lower 95%
3.5130
5.0659
3.1651
3.8644
4.3334
Upper 95%
5.3441
6.7770
4.9349
5.9356
6.3523
Constant variance is reasonable – (Largest standard deviation/smallest standard
deviation)^2=(1.79/1.48)^2=1.46. There are less than 30 observations per group
so we need to check normality but a check of the normal quantile plot for each
group indicates that normality is OK.
Do all videotapes have the same
mean?
Response SCORE
•
Effect Tests
Source
HANDICAP
Nparm
4
DF
4
Sum of Squares
30.521429
F Ratio
2.8616
Prob > F
0.0301
Expanded Estimates
Nominal factors expanded to all levels
Term
Intercept
HANDICAP[AMPUTEE]
HANDICAP[CRUTCHES]
HANDICAP[HEARING]
HANDICAP[NONE]
HANDICAP[WHEELCHAIR]
Estimate
4.9285714
-0.5
0.9928572
-0.878571
-0.028571
0.4142857
Std Error
0.195173
0.390347
0.390347
0.390347
0.390347
0.390347
t Ratio
25.25
-1.28
2.54
-2.25
-0.07
1.06
Prob>|t|
<.0001
0.2048
0.0134
0.0278
0.9419
0.2925
Test of H_0: Mean of all five videotapes is the same vs. H_A: At least two of the
videotapes have different means has p-value 0.0301. Evidence that there is some
difference in the means of the videotapes.
How do the videotapes compare?
LSMeans Differences Tukey HSD
Alpha=
0.050 Q=
2.80582LSMean[i] By LSMean[j]
Mean[i]-Mean[j]
AMPUTEE
Std Err Dif
Lower CL Dif
Upper CL Dif
AMPUTEE
0
0
0
0
CRUTCHES
1.49286
0.61719
-0.2389
3.22459
HEARING
-0.3786
0.61719
-2.1103
1.35316
NONE
0.47143
0.61719
-1.2603
2.20316
WHEELCHAIR
0.91429
0.61719
-0.8174
2.64602
CRUTCHES
HEARING
NONE
WHEELCHAIR
-1.4929
0.61719
-3.2246
0.23888
0
0
0
0
-1.8714
0.61719
-3.6032
-0.1397
-1.0214
0.61719
-2.7532
0.7103
-0.5786
0.61719
-2.3103
1.15316
0.37857
0.61719
-1.3532
2.1103
1.87143
0.61719
0.1397
3.60316
0
0
0
0
0.85
0.61719
-0.8817
2.58173
1.29286
0.61719
-0.4389
3.02459
-0.4714
0.61719
-2.2032
1.2603
1.02143
0.61719
-0.7103
2.75316
-0.85
0.61719
-2.5817
0.88173
0
0
0
0
0.44286
0.61719
-1.2889
2.17459
-0.9143
0.61719
-2.646
0.81745
0.57857
0.61719
-1.1532
2.3103
-1.2929
0.61719
-3.0246
0.43888
-0.4429
0.61719
-2.1746
1.28888
0
0
0
0
The only conclusion we can make about how the videotapes compare, taking
account of the fact that we are making multiple comparisons, is that
Crutches has a higher mean than Hearing.