Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Psych 613
Fall 2016
Problem Set 3 Solutions
1) Continue the cancer example from the previous problem set
More detail on the data file: Patients with advanced cancers of the stomach, bronchus,
colon, ovary or breast were treated with vitamin c (ascorbate). The purpose of the study
was to determine whether patient survival differed depending on the organ affected by
the cancer. Adapted from Cameron, E. and Pauling, L. (1978) Supplemental ascorbate
in the supportive treatment of cancer: re-evaluation of prolongation of survival times in
terminal human cancer. Proceedings of the National Academy of Science,
75, 4538-42.
(a) Present a graph of the five cell means with 1 standard error around each mean.
SPSS Syntax
graph
/errorbar(sterror 1) = survival by cantype.
R Syntax
install.packages("gplots")
library(gplots)
plotmeans(formula = survival ~ cantype,
data = cancer,
p = .68,
ylab = "Survival Time in Days",
xlab = "Fives types of Cancer Patients")
(b) Test the research hypothesis that there is a difference among the five means. Use =
0.05. Interpret the result in words, you can refer to the graph in part 1a.
We reject the null
that there is no difference among the five cell means, F(4, 59) =
6.43, p = .0002.
SPSS Syntax
oneway survival by cantype.
R Syntax
aov1 <- aov(formula = survival ~ cantype,
data = cancer)
summary(aov1)
cantype
Residuals
Df Sum Sq
Mean Sq
4 11535761 2883940
59 26448144 448274
F value Pr(>F)
6.433
0.000229 ***
(c) Using only the F test for the ANOVA you ran above, can the researchers claim
support for the hypothesis that ascorbates will have the best effect on survival rate for
breast cancer patients? What else in addition to the F test is needed to make this claim?
No computation is necessary, just explain.
No, the F test for the ANOVA does not test the hypothesis that ascorbates will
have the best effect on survival rate for breast cancer patients (see answer 1 b).
One could compare survival time among the 5 groups, say, with a Tukey test or a
contrast in order to determine whether ascorbates have the strongest effect on
breast cancer patients.
(d) Examine the statistical assumptions for these data. Use boxplots and normal
probability plots. Do you think assumptions are violated in the data set? Explain your
conclusions and use the graphs to support your claims. No need to perform a
transformation or take remedial measures for this part.
The boxplots for ovary and breast cancer patients look quite different compared
to the rest of the patient groups. For example, breast cancer group has a few
outliers and the ovary cancer group’s median is not centered within the
interquartile range.
SPSS Syntax
examine
variables = survival by cantype
/plot boxplot npplot spreadlevel
/statistics = none
/nototal.
R Syntax
boxplot(formula = survival ~ cantype,
data = cancer,
ylab = "Patient Survival",
xlab = "Organ Affected by Cancer")
for(i in 1:length(levels(cancer$cantype))){
qqnorm(cancer[cancer$cantype == paste0(levels(cancer$cantype)[i]),"survival"],
main = paste0("Normal Q-Q Plot for ", levels(cancer$cantype)[i], " Cancer
Survival"))
qqline(cancer[cancer$cantype == paste0(levels(cancer$cantype)[i]),"survival"])
install.packages("car")
library(car)
spreadLevelPlot(survival ~ cantype,
data = cancer)
(e) Using the spread and level plot you computed in the previous problem set, perform
the power transformation suggested by the plot (round up or down to the nearest rung
on the ladder of power transformations, e.g., square, identity, sqrt, log, reciprocal). Redo
the boxplot but this time on the transformed variable. Does the boxplot look better, the
same, worse? Explain.
Fewer points fall outside the whiskers and the interquartile ranges among the
groups have more similar shapes.
SPSS Syntax
compute survival.cubed = survival ** (1/3).
examine
variables = survival.cubed by cantype
/plot boxplot
/statistics = none
/nototal.
R Syntax
cancer$survival.cubed <- with(cancer, survival^(1/3))
boxplot(formula = survival.cubed ~ cantype,
data = cancer,
ylab = "Survival Time in Days (cubed)",
xlab = "Fives types of Cancer Patients")
(f) Recompute the ANOVA F on the transformed variable from part 1e. Did the F change
much?
Raw survival time: F(4, 59) = 6.43
Squared survival time: F(4, 59) = 6.48
Cubed survival time: F(4, 59) = 5.93
SPSS Syntax
oneway survival survival.cubed survival.squared by cantype.
R Syntax
lapply(X = c("survival","survival.squared","survival.cubed"),
FUN = function(x){
summary(aov(formula = as.formula(paste0(x,"~ cantype")),
data = cancer))
})
cantype
Residuals
Df Sum Sq
Mean Sq
4 11535761 2883940
59 26448144 448274
F value Pr(>F)
6.433
0.000229 ***
cantype
Residuals
Df Sum Sq Mean Sq F value
Pr(>F)
4
3295
823.8
6.484 0.000215 ***
59
7495
127.0
cantype
Residuals
Df Sum Sq Mean Sq F value
Pr(>F)
4 169.2
42.31
5.939 0.000438 ***
59 420.3
7.12
(g) The researcher is interested in testing whether breast cancer patients who were
given ascorbate acid lived longer than stomach cancer patients given ascorbate acid.
What contrast would the researcher perform to test this research question? Calculate
the contrast value using the original untransformed data and test the null hypothesis
that the population contrast value is 0 (using α = 0.05). Also, calculate the 95% CI
around the contrast value.
If the levels are ordered Stomach, Bronchus, Colon, Ovary, and Breast, then we
can set our contrasts weights to -1, 0, 0, 0, 1, which compares survival days
between breast and stomach cancer patients. When given ascorbate acid, breast
cancer patients tended to live longer than stomach cancer patients, t(11.33) =
2.88. The 95% CI is (-1658.761, -561.057).
SPSS Syntax
oneway survival by cantype
/statistics all
/contrasts = -1, 0, 0, 0, 1.
R Syntax
install.packages("gmodels")
library(gmodels)
fit.contrast(model = aov1, varname = "cantype",
coeff = c(1, 0, 0, 0, -1),
conf.int = .95,
df = TRUE)
Estimate
cantype c=( 1 0 0 0 -1 ) -1109.909
upper CI
cantype c=( 1 0 0 0 -1 ) -561.057
Std. Error
274.2895
t value
Pr(>|t|)
DF lower CI
-4.046488 0.0001533048 59 -1658.761
(h) Perform a Tukey test on all pairwise means. Summarize in words your findings from
the Tukey test.
SPSS Syntax
oneway survival by cantype
/statistics all
/ranges=tukey.
R Syntax
TukeyHSD(aov3)
diff
lwr
upr
p adj
Bronchus-Stomach -0.2213573 -2.98852004 2.545805 0.9994136
Colon-Stomach
1.4429273 -1.32423543 4.210090 0.5874049
Ovary-Stomach
2.6847237 -1.02208101 6.391528 0.2610615
Breast-Stomach
4.2551424 1.17828191 7.332003 0.0022990
Colon-Bronchus
1.6642846 -0.91180367 4.240373 0.3732192
Ovary-Bronchus
2.9060810 -0.66035172 6.472514 0.1616993
Breast-Bronchus
4.4764998 1.57028022 7.382719 0.0005371
Ovary-Colon
1.2417964 -2.32463633 4.808229 0.8632854
Breast-Colon
2.8122151 -0.09400439 5.718435 0.0624480
Breast-Ovary
1.5704187 -2.24131623 5.382154 0.7741468
(i) One of the Tukey pairwise tests involves comparing the breast cancer group to the
stomach cancer group, as you did earlier. Explain the underlying difference between the
contrast you did earlier and the Tukey test that compared the means of the same two
groups.
When one plans to test the difference between groups before data is collected,
the probability of a false positive is 5%. However, when one can search the data
for the greatest possible difference (the greatest vs. the smallest mean), the
probability of a false positive exceeds 5%. The Tukey procedure uses the
sampling distribution of the range (i.e. snooping and comparing the smallest vs.
greatest mean), not the sampling distribution of the mean (e.g. planned contrast).
Tukey’s range distribution produces larger critical values, making it harder to
reject the null.
(j) Calculate the contrast you performed earlier between the means of the breast and
stomach cancer groups using Scheffe’s procedure at α = 0.05. Also, calculate the 95%
confidence interval for the contrast using the Scheffe procedure. Note: compute this part
using the equations presented in the lecture notes as in general Scheffe tests on
contrasts are not supported by statistical packages so in real situations you’ll need to do
them by hand if you use SPSS or you can write your own functions in R.
Scheffe: Estimate = -1109.9, S = 872.2, Scrit = 3.18, df = 59, 95% CI (-1982.1, -237.7)
R Syntax
From previous output, we know the estimate is 1109.91, SE = 274.29, and df = 59
Scrit = sqrt((5-1)*qf(1-.05,5-1,59))
S = Scrit*274.29
Lower = 1109.91 - S
Upper = 1109.91 + S
For those of you you want a more general purpose function/code in R:
scheffe <- function(ihat, se.ihat, ngroups, df.error,alpha=.05){
ScheffeFcritical <- (ngroups - 1) * qf(1-alpha,ngroups-1,df.error)
Scheffetcritical <- sqrt(ScheffeFcritical)
S <- Scheffetcritical*se.ihat
result <- data.frame(ihat = ihat,
se.ihat = se.ihat,
df.error = df.error,
Scheffetcritical = Scheffetcritical,
S = S,
lowerCI = ihat - S,
upperCI = ihat + S)
return(result)
}
ihat <- sum(with(cancer, tapply(survival, cantype, mean)) * c(1, 0, 0, 0, -1))
se.ihat <- fit.contrast(model = aov1, varname = "cantype",
coeff = c(1, 0, 0, 0, -1),
conf.int = .95,
df = TRUE)[1,"Std. Error"]
df.error <- fit.contrast(model = aov1, varname = "cantype",
coeff = c(1, 0, 0, 0, -1),
Co
nf.int = .95,
df = TRUE)[1,"DF"]
scheffe(ihat = ihat,
se.ihat = se.ihat,
ngroups = 5,
df.error = df.error,
alpha = .05)
ihat
se.ihat
-1109.909 274.2895
df.error
59
Scheffetcritical S
lowerCI
upperCI
3.179878
872.2073 -1982.116 -237.7018
(k) Explain the underlying difference between the contrast you did earlier between the
breast and cancer groups and the Scheffe test that compared the means of the same
two groups.
To determine the critical value for an inferential test, the Scheffe procedure uses
the sampling distribution of Fmaximum, which corresponds to the contrast that gives
the contrast with the greatest sums of squares (SSC). It uses the F critical for the
omnibus test, making the critical value for a given contrast harder to “beat.”