Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Class 23: Thursday, Dec. 2nd • Today: One-way analysis of variance, multiple comparisons. • Next week: Two-way analysis of variance. • I will e-mail the final homework, Homework 9, to you this weekend. • All of the final project ideas look good. I have e-mailed some of you my comments already and will e-mail the rest of you my comments by tomorrow. • Schedule: – Thurs., Dec. 9th – Final class – Mon., Dec. 13th (5 pm) – Preliminary results from final project due – Tues., Dec. 14th (5 pm) – Homework 9 due – Tues., Dec. 21st (Noon) – Final project due. Individual vs. Familywise Error Rate • When several tests are considered simultaneously, they constitute a family of tests. • Individual Type I error rate: Probability for a single test that the null hypothesis will be rejected assuming that the null hypothesis is true. • Familywise Type I error rate: Probability for a family of test that at least one null hypothesis will be rejected assuming that all of the null hypotheses are true. • When we consider a family of tests, we want to make the familywise error rate small, say 0.05, to protect against falsely rejecting a null hypothesis. Why Control the Familywise error rate: • Five children in a particular school got leukemia last year? Is that a coincidence or does the clustering of cases suggest the presence of an environmental toxin that caused the disease? • Individual Type I error rate: Calculate the probability that five children at this particular school would all get leukemia this particular year. If this is small, say smaller than 0.05, become alarmed. • Familywise Type I error rate: Calculate the probabilty that five children in any school would develop the same severe disease in the same year. If this is small, say smaller than 0.05, become alarmed. • If we control the individual type I error rate, then we will locate many disease “clusters” that are not caused by an environmental toxin but are just coincidences. Bonferroni Method • General method for doing multiple comparisons for any family of k tests. • Denote familywise type I error rate we want by p*, say p*=0.05. • Compute p-values for each individual test -p1,..., pk p* • Reject null hypothesis for ith test if pi k • Guarantees that familywise type I error rate is at most p*. • Why Bonferroni works: If we do k tests and all null hypotheses are true , then using Bonferroni with p*=0.05, we have probability 0.05/k to make a Type I error for each test and expect to make k*(0.05/k)=0.05 errors in total. Multiplicity • A news report says, “A 15 year study of more than 45,000 Swedish solidiers revealed that heavy users of marijuana were six times more likely than nonusers to develop schizophrenia.” • Were the investigators only looking for difference in schizophrenia among heavy/non-heavy users of marijuana? • Key question: What is their family of tests? If they were actually looking for a difference among 100 outcomes (e.g., blood pressure, lung cancer), Bonferroni should be used to control the familywise Type I error rate, i.e., only consider a difference significant if p-value is less than .05/100=.0005. • The best way to deal with the multiple comparisons problem is to design a study to search specifically for a pattern that was suggested by an exploratory data analysis. Then there is only one comparison. Bonferroni method on Milgram’s data Expanded Estimates Nominal factors expanded to all levels Term Intercept Condition[Proximity] Condition[Remote] Condition[Touch-Proximity] Condition[Voice-Feedback] Estimate 338.25 -26.25 66.75 -70.125 29.625 Std Error 9.067431 15.70525 15.70525 15.70525 15.70525 t Ratio 37.30 -1.67 4.25 -4.47 1.89 Prob>|t| <.0001 0.0966 <.0001 <.0001 0.0611 • If we want to test whether each of the four groups has a mean different from the mean of all four groups, we have four tests. Bonferroni method: Check whether p-value of each test is <0.05/4=0.0125. • There is strong evidence that the remote group has a mean higher than the mean of the four groups and the touch-proximity group has a mean lower than the mean of the four groups. Multiple Comparison Simulation • In multiplecomp.JMP, 50 groups are compared with sample sizes of ten for each group. • The observations for each group are simulated from a standard normal distribution. Thus, in fact, 1 2 50 0 • Bonferroni approach to deciding which groups have means different than average: Reject null hypothesis that a group’s mean is the average mean of all groups only if the pvalue for the t-test is .05/50=.001. Multiple Comparison Simulation Iteration 1 # of Groups with pvalue < 0.05 # of Groups with pvalue < .0025 2 3 4 5 Pairwise Comparisons Expanded Estimates Nominal factors expanded to all levels Term Intercept Condition[Proximity] Condition[Remote] Condition[Touch-Proximity] Condition[Voice-Feedback] Estimate 338.25 -26.25 66.75 -70.125 29.625 Std Error 9.067431 15.70525 15.70525 15.70525 15.70525 t Ratio 37.30 -1.67 4.25 -4.47 1.89 Prob>|t| <.0001 0.0966 <.0001 <.0001 0.0611 • We are interested not just in what groups have means that are different than the average mean, but in pairwise comparisons between the groups. • For a pairwise comparison between group i and group j, we want to test the null hypothesis that group i and group j have the same means versus the alternative that group i and group j have different means, i.e., H 0 : i j vs. H a : i j Pairwise Comparisons Cont. • For Milgram’s obedience data, there are six pairwise comparisons: (1) Proximity vs. Remote; (2) Proximity vs. TouchProximity; (3) Proximity vs. Voice-Feedback; (4) Remote vs. Touch-Proximity; (5) Remote vs. Voice-Feedback; (6) Touch-Proximity vs. Voice-Feedback • Multiple comparisons situation with a family of six tests. We want to control the familywise error rate at .05 rather than the individual type I error rate. • Could use Bonferroni to do this but there is a method called Tukey’s HSD (stands for “Honest Significant Differences”) that is specially designed to control the familywise type I error rate for pairwise comparisons in ANOVA. LSMeans Differences Tukey HSD Alpha= 0.050 Q= 2.59695LSMean[i] By LSMean[j] Mean[i]-Mean[j] Std Err Dif Lower CL Dif Upper CL Dif Proximity Remote Touch-Proximity Voice-Feedback Level Remote Voice-Feedback Proximity Touch-Proximity A A B B C C Proximity Remote Touch-Proximity Voice-Feedback 0 0 0 0 93 25.6466 26.3972 159.603 -43.875 25.6466 -110.48 22.7278 55.875 25.6466 -10.728 122.478 -93 25.6466 -159.6 -26.397 0 0 0 0 -136.88 25.6466 -203.48 -70.272 -37.125 25.6466 -103.73 29.4778 43.875 25.6466 -22.728 110.478 136.875 25.6466 70.2722 203.478 0 0 0 0 99.75 25.6466 33.1472 166.353 -55.875 25.6466 -122.48 10.7278 37.125 25.6466 -29.478 103.728 -99.75 25.6466 -166.35 -33.147 0 0 0 0 Least Sq Mean 405.00000 367.87500 312.00000 268.12500 Levels not connected by same letter are significantly different Comparisons between groups that are in red are groups for which the null hypothesis that the group means are the same is rejected using the Tukey HSD procedure, which controls the familywise Type I error rate at 0.05. A confidence interval for the difference in group means that adjusts for multiple comparisons is shown in the third and fourth lines. More on Tukey’s HSD • Using Tukey’s HSD, the pairs for which there is strong evidence of a difference in means adjusting for multiple comparisons are remote is higher than proximity, remote is higher than touch proximity and voice feedback is higher than touch proximity. • For confidence intervals for differences in the means of each pair of groups, if we use the usual confidence intervals, there is a good chance that at least one of the intervals will not contain the true difference in means between the groups. • When making a family of confidence intervals, we want confidence intervals that have a 95% chance of all intervals in the family containing their true values. The confidence intervals produced by the Tukey HSD procedure have this property. • 95% confidence interval for difference in mean of remote group vs. mean of proximity group using Tukey’s HSD: (26.40, 159.60). • 95% confidence interval for difference in mean of remote group vs. mean of proximity group assuming that this is the only confidence interval being formed (family of one confidence interval): (42.34, 143.66). Tukey’s HSD confidence interval is wider because in order for a family of CIs to each contain their true value when multiple CIs are formed, each CI must be wider than it would be if only one CI was being formed. Tukey HSD in JMP • Use Analyze, Fit Model to do the analysis of variance by making the X variable the categorical variable denoting the group. • After Fit Model, click red triangle next to group variable (Condition in the Milgram study) and click LS Means Differences Tukey HSD. Clicking LS Means Differences Student’s t gives CIs that do not adjust for multiple comparisons. Assumptions in one-way ANOVA • Assumptions needed for validity of oneway analysis of variance p-values and CIs: – Linearity: automatically satisfied. – Constant variance: Spread within each group is the same. – Normality: Distribution within each group is normally distributed. – Independence: Sample consists of independent observations. Rule of thumb for checking constant variance • Constant variance: Look at standard deviation of different groups by using Fit Y by X and clicking Means and Std Dev. Means and Std Deviations Level Proximity Remote Touch-Proximity Voice-Feedback Number 40 40 40 40 Mean 312.000 405.000 268.125 367.875 Std Dev 129.979 63.640 131.874 119.518 Std Err Mean 20.552 10.062 20.851 18.897 • Check whether (highest group standard deviation/lowest group standard deviation)^2 is greater than 3. If greater than 3, then constant variance is not reasonable and transformation should be considered.. If less than 3, then constant variance is reasonable. • (Highest group standard deviation/lowest group standard deviation)^2 =(131.874/63.640)^2=4.29. Thus, constant variance is not reasonable for Milgram’s data. Transformations to correct for nonconstant variance • If standard deviation is highest for high groups with high means, try transforming Y to log Y or Y . If standard deviation is highest for groups with low means, try transforming Y to Y2. Means and Std Deviations Level Proximity Remote Touch-Proximity Voice-Feedback Number 40 40 40 40 Mean 312.000 405.000 268.125 367.875 Std Dev 129.979 63.640 131.874 119.518 Std Err Mean 20.552 10.062 20.851 18.897 • SD is particularly low for group with highest mean. Try transforming to Y2. To make the transformation, right click in new column, click New Column and then right click again in the created column and click Formula and enter the appropriate formula for the transformation. Transformation of Milgram’s data to Squared Voltage Level Means and Std Deviations Level Proximity Remote Touch-Proximity Voice-Feedback Number 40 40 40 40 Mean 113816 167974 88847 149259 Std Dev 78920.2 48541.4 79291.3 74053.6 Std Err Mean 12478 7675 12537 11709 • Check of constant variance for transformed data: (Highest group standard deviation/lowest group standard deviation)^2 = 2.67. Constant variance assumption is reasonable for voltage squared. • Analysis of variance tests are approximately valid for voltage squared data; reanalyzed data using voltage squared. Analysis using Voltage Squared Response Voltage Squared Effect Tests Source Condition Nparm 3 DF 3 Sum of Squares 1.50737e11 F Ratio 9.8735 Prob > F <.0001 Strong evidence that the group mean voltage squared levels are not all the same. LSMeans Differences Tukey HSD Alpha= 0.050 Q= 2.59695LSMean[i] By LSMean[j] Mean[i]-Mean[j] Std Err Dif Lower CL Dif Upper CL Dif Proximity Remote Touch-Proximity Voice-Feedback Proximity Remote Touch-Proximity Voice-Feedback 0 0 0 0 54157.5 15951.4 12732.6 95582.4 -2.5e+4 15951.4 -6.64e4 16455.6 35443.1 15951.4 -5981.8 76868.1 -5.42e4 15951.4 -9.56e4 -1.27e4 0 0 0 0 -7.91e4 15951.4 -120552 -3.77e4 -1.87e4 15951.4 -6.01e4 22710.6 24969.4 15951.4 -1.65e4 66394.3 79126.9 15951.4 37701.9 120552 0 0 0 0 60412.5 15951.4 18987.6 101837 -3.54e4 15951.4 -7.69e4 5981.81 18714.4 15951.4 -2.27e4 60139.3 -6.04e4 15951.4 -101837 -1.9e+4 0 0 0 0 Strong evidence that remote has higher mean voltage squared level than proximity and touch-proximity and that voice-feedback has higher mean voltage squared level than touch-proximity, taking into account the multiple comparisons. Rule of Thumb for Checking Normality in ANOVA • The normality assumption for ANOVA is that the distribution in each group is normal. Can be checked by looking at the boxplot, histogram and normal quantile plot for each group. • If there are more than 30 observations in each group, then the normality assumption is not important; ANOVA p-values and CIs will still be approximately valid even for nonnormal data if there are more than 30 observations in each group. • If there are less than 30 observations per group, then we can check normality by clicking Analyze, Distribution and then putting the Y variable in the Y, Columns box and the categorical variable denoting the group in the By box. We can then create normal quantile plots for each group and check that for each group, the points in the normal quantile plot are in the confidence bands. If there is nonnormality, we can try to use a transformation such as log Y and see if the transformed data is approximately normally distributed in each group. One way Analysis of Variance: Steps in Analysis 1. Check assumptions (constant variance, normality, independence). If constant variance is violated, try transformations. 2. Use the effect test (commonly called the Ftest) to test whether all group means are the same. 3. If it is found that at least two group means differ from the effect test, use Tukey’s HSD procedure to investigate which groups are different, taking into account the fact multiple comparisons are being done. Example: Discrimination against the Handicapped • Study of how physical handicaps affect people’s perception of employment qualifications. • Researchers prepared five videotaped job interviews, using same two male actors for each. Tapes differed only in that applicant appeared with a different handicap in each– (i) wheelchair; (ii) on crutches; (iii) hearing impaired; (iv) one leg amputated; (v) no handicap. • Each tape shown to 14 students from U.S. university. Students rate qualifications of candidate on 0 to 10 point scale based on tape. • Questions of interest: Do subjects systematically evaluate qualifications differently according to candidate’s handicap? If so, which handicaps produce different evaluations? Checking Assumptions Oneway Analysis of SCORE By HANDICAP 9 8 SCORE 7 6 5 4 3 2 1 AMPUTEE CRUTCHES HEARING NONE WHEELCHAIR HANDICAP Means and Std Deviations Level AMPUTEE CRUTCHES HEARING NONE WHEELCHAIR Number 14 14 14 14 14 Mean 4.42857 5.92143 4.05000 4.90000 5.34286 Std Dev 1.58572 1.48178 1.53259 1.79358 1.74828 Std Err Mean 0.42380 0.39602 0.40960 0.47935 0.46725 Lower 95% 3.5130 5.0659 3.1651 3.8644 4.3334 Upper 95% 5.3441 6.7770 4.9349 5.9356 6.3523 Constant variance is reasonable – (Largest standard deviation/smallest standard deviation)^2=(1.79/1.48)^2=1.46. There are less than 30 observations per group so we need to check normality but a check of the normal quantile plot for each group indicates that normality is OK. Do all videotapes have the same mean? Response SCORE • Effect Tests Source HANDICAP Nparm 4 DF 4 Sum of Squares 30.521429 F Ratio 2.8616 Prob > F 0.0301 Expanded Estimates Nominal factors expanded to all levels Term Intercept HANDICAP[AMPUTEE] HANDICAP[CRUTCHES] HANDICAP[HEARING] HANDICAP[NONE] HANDICAP[WHEELCHAIR] Estimate 4.9285714 -0.5 0.9928572 -0.878571 -0.028571 0.4142857 Std Error 0.195173 0.390347 0.390347 0.390347 0.390347 0.390347 t Ratio 25.25 -1.28 2.54 -2.25 -0.07 1.06 Prob>|t| <.0001 0.2048 0.0134 0.0278 0.9419 0.2925 Test of H_0: Mean of all five videotapes is the same vs. H_A: At least two of the videotapes have different means has p-value 0.0301. Evidence that there is some difference in the means of the videotapes. How do the videotapes compare? LSMeans Differences Tukey HSD Alpha= 0.050 Q= 2.80582LSMean[i] By LSMean[j] Mean[i]-Mean[j] AMPUTEE Std Err Dif Lower CL Dif Upper CL Dif AMPUTEE 0 0 0 0 CRUTCHES 1.49286 0.61719 -0.2389 3.22459 HEARING -0.3786 0.61719 -2.1103 1.35316 NONE 0.47143 0.61719 -1.2603 2.20316 WHEELCHAIR 0.91429 0.61719 -0.8174 2.64602 CRUTCHES HEARING NONE WHEELCHAIR -1.4929 0.61719 -3.2246 0.23888 0 0 0 0 -1.8714 0.61719 -3.6032 -0.1397 -1.0214 0.61719 -2.7532 0.7103 -0.5786 0.61719 -2.3103 1.15316 0.37857 0.61719 -1.3532 2.1103 1.87143 0.61719 0.1397 3.60316 0 0 0 0 0.85 0.61719 -0.8817 2.58173 1.29286 0.61719 -0.4389 3.02459 -0.4714 0.61719 -2.2032 1.2603 1.02143 0.61719 -0.7103 2.75316 -0.85 0.61719 -2.5817 0.88173 0 0 0 0 0.44286 0.61719 -1.2889 2.17459 -0.9143 0.61719 -2.646 0.81745 0.57857 0.61719 -1.1532 2.3103 -1.2929 0.61719 -3.0246 0.43888 -0.4429 0.61719 -2.1746 1.28888 0 0 0 0 The only conclusion we can make about how the videotapes compare, taking account of the fact that we are making multiple comparisons, is that Crutches has a higher mean than Hearing.