Download Exam III 2008 solutions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Analysis of variance wikipedia , lookup

Omnibus test wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Math 261
Solutions to Exam III
Spring 2008
Name:
Instructions: This is exam is given under the honor code. Any violation will result in failure of the
course. You have three hours to work on the exam. You may use your text and notes as well as a
calculator and online resources. You may not, however, communicate with anyone other than the
instructor during the exam.
Write your answers clearly clearly in your blue book. Partial credit can only be given if supporting calculations are shown. A total of 130 points are possible for this exam.
Be sure to refer to the attached output at the end of the exam.
1. [20 points] The table below shows the prices charged (in US$) for a simple random sample
of commonly prescribed drugs by i) a U.S. retail pharmacy and ii) a web-based pharmacy in
Canada.
Drug
A
B
C
D
E
F
G
H
I
J
K
L
Cost per 100 pills (US$)
United States Canada
131
136
374
370
61
252
263
349
243
166
365
216
83
72
219
166
17
214
112
50
134
42
200
105
Continued on the next page . . .
2
Math 261
Spring 2008
Problem 1 (continued)
(a) For each of the following questions, clearly indicate the appropriate statistical procedure
required. Assume drug prices are normally distributed. You do not need to carry
out any calculations.
If a hypothesis test is required, your answer should include these four components:
(1) type of parameter (proportions or means), (2) design (one-sample, two-sample, or
matched pairs), (3) distribution (z-test or t-test), and (4) alternative (one- or two-sided).
If a confidence interval is required, your answer should include these three components:
(1) type or parameter (proportions or means), (2) design (one-sample, two-sample, or
matched pairs), and (3) distribution (z-interval or t-interval).
(i) Are drug prices typically cheaper at the Canadian pharmacy than at the U.S. outlet?
Solution: This requires a hypothesis test: 1) means, 2) matched pairs, 3) t-test, and 4)
one-sided alternative because we are asking whether they are “cheaper.”
(ii) What average savings in dollars would be expected by using the Canadian web-based
pharmacy?
Solution: This requires a confidence interval: 1) means, 2) matched pairs, and 3) tinterval.
(iii) At the U.S. pharmacy, what percent of drugs cost more than $200 per 100 pills?
Solution: This requires a confidence interval: 1) proportions, 2) one sample, and 3)
z-interval.
(iv) Is the percent of drugs that cost more than $200 per 100 pills the same at the two
pharmacies?
Solution: This requires a hypothesis test: 1) proportions, 2) two samples, 3) z-test, and
4) two-sided because we’re testing whether there’s a difference.
(b) Suppose differences in drug prices were in fact not normally distributed but were skewed
with potential extreme values. Identify a test that would be appropriate for answering
the question “Are drug prices typically cheaper at the Canadian pharmacy than at
the U.S. outlet?” Briefly justify your test. [Note: You do not need to carry out the
procedures or perform any calculations.]
Solution: This would require a sign test with a one-sided alternative since the sign test does
not require the condition of normality to hold.
3
Math 261
Spring 2008
2. [35 points] A biologist wished to study the effects of ethanol on sleep time. A sample of
20 rats matched for age and other characteristics, was selected, and each rat was given an
oral injection having a particular concentration of ethanol per body weight. The rapid eye
movement (REM) sleep time for each rat was then recorded for a 24-hr period, with the results
shown in the following table. [Note: The data are available in the SPSS file ratrem.sav
available from the usual class web page. Be sure to use Group as the grouping variable. You
will use the variable Ethanol in the next problem.]
Group
1
2
3
4
Treatment
0
1
2
4
(control)
g/kg
g/kg
g/kg
Observations
88.6
63.0
44.9
31.0
73.2
53.9
59.5
39.6
91.4
69.2
40.2
45.3
68.0
50.1
56.3
25.2
75.2
71.5
38.7
22.7
(a) Consider an ANOVA for these data. First, define the parameters of interest and state
H0 and Ha for this study.
Solution: Let µ1 represent the mean REM for rats in the control group, µ2 the mean REM
for rats receiving 1 g/kg concentration of ethanol, and so on. Then we have H0 : µ1 = µ2 =
µ3 = µ4 versus Ha : the means are not all equal.
(b) Use SPSS to carry out the ANOVA computations. Report the ANOVA table and test
the global null hypothesis (F test) result at the α = .05 level.
Solution:
ANOVA
Sleep time in minutes for a 24-hour period
Sum of Squares df Mean Square
Between Groups
Within Groups
5882.4
1487.4
3
16
Total
7369.8
19
1960.8
93.0
F
Sig.
21.1
8.32E-06
We have strong evidence (F=21.1, p <.001) that the means are not all equal.
(c) State the ANOVA condition regarding SDs and use SPSS to check that it holds for this
data set. Support your conclusions using statistics and a rough sketch. Comment on
your findings.
Solution:
We assume the population standard deviations σi are the same for all 4 groups. We can use
SPSS to check this by calculating the SDs and by plotting residuals against the predicted
4
Math 261
Spring 2008
values. The ratio of the largest to smallest SD is 10.18/9.34 = 1.09 which is much smaller
than 2. A plot of the residual versus predicted plot is shown below:
The plot gives no reason to doubt the equality of standard deviations.
(d) In class and in the ANOVA homework assignment, you used a diagram with underlines
to summarize the results of a post hoc comparison of means. Produce such diagrams for
the REM data set using the Bonferroni and Newman-Keuls methods, one diagram for
each method. Summarize each diagram in a sentence or two.
Solution:
Using the Bonferroni method, we find significant differences, α = .05, in all pairs of means
except those adjacent to each other. That is, we fail to reject µ1 = µ2 , µ2 = µ3 , and
µ3 = µ4 .
Group
4
3
2
1
yi
32.76
47.92
61.54
79.28
In contrast, with the Newman-Keuls method, we reject equality of all pairs of means. Hence,
all group means are found significantly different from all other group means so that we simply
obtain the diagram below.
Group
4
3
2
1
yi
32.76
47.92
61.54
79.28
Math 261
5
Spring 2008
(e) Compared to other methods, the Bonferroni multiple comparisons procedure is often
described as “conservative.” Referring to your diagrams in the previous question, how
is this “conservative” characteristic of the Bonferroni method evident here? Explain
briefly.
Solution:
The Bonferroni method is conservative because it rejects fewer of the null hypotheses than
does the Newman-Keuls method. More specifically, it is less efficient and requires a larger
difference in sample means to find significance at the same level α.
(f) In a single sentence, what does your ANOVA say about how ethanol affects the REM
sleep time of rats?
Solution:
The greater the concentration of ethanol, the less REM experienced on average for rats in
the study. Increased concentration of ethanol in injections reduces REM of rats.
Math 261
6
Spring 2008
3. [30 points] Another approach to analyzing the data of Problem (2) would be to carry out a
regression of REM against the ethanol concentration. With this approach in mind, again use
the SPSS data file ratrem.sav and answer the questions which follow.
(a) Calculate the linear regression of Y (REM) on X (Ethanol concentration). Record the
equation of the line using the variable names.
Solution: 4
The fitted line is
d = 75.2 − 11.3 × Ethanol.
REM
(b) Assuming that the linear model is applicable, construct a 90% confidence interval for β1
and interpret your interval in the context of this setting.
Solution: 8
The confidence interval has the form b1 ± tα/2 (n − 2)SEb1 . Here, b1 = −11.3, t.025 (18) =
1.734 (from the calculator), and SEb1 = 1.49 from the output. Plugging in we get −11.3 ±
(1.734) × (1.49) or −11.3 ± 2.58 to get (−13.88, −8.72) . Interpreting, we expect that for
every 1 g/kg increase in ethanol concentration, the mean REM in minutes per 24 hours will
decrease by between 8.72 minutes and 13.88 minutes.
(c) Assuming that the linear model is applicable, find estimates of the mean and standard
deviation of REM at an ethanol level of 3 g/kg.
Solution: 6
The mean is µY |X = ŷ = 75.2 − 11.3 × (3) = 41.3 . The standard deviation is the standard
deviation of the regression which is found from the SPSS output to be sY |X = 9.85 .
(d) Assuming that the linear model is applicable, find the value of r2 and interpret it in the
context of this problem.
Solution: 4
From the output we have r2 = .763. We conclude that 76.3 percent of the variability in
REM is explained by the linear regression of REM on ethanol concentration.
(e) Use SPSS to plot the residuals versus predicted values from the regression in a). Then
make a normal probability plot of the residuals. Do these plots call into question the
use of a linear model and regression inference procedures? Explain briefly.
Solution: 8
The residual versus predicted plot (below) shows some curvature and suggests a linear model
is likely inappropriate:
Math 261
7
Spring 2008
The normal probability plot of the residuals (below) gives no basis for doubting the normality
of the residuals.
Overall, we do have concern that the linear model is not appropriate.
4. [45 points]
This
question
deals
with
the
Western
Electric
data
set
(electric.sav) we looked at in the first lab of the quarter, an observational prospective
study of health among men 40 to 55 years old at the beginning of the study.
For each of the following questions, please carry out the appropriate statistical procedure
using information provided in the output. Do not try to use the SPSS data file to
carry out the required procedure. Instead, just use the information provided in
Math 261
8
Spring 2008
the output at the end of the exam for this question. If a test of hypothesis is required,
define the parameters of interest, state the hypotheses, identify the required statistical test,
calculate the test statistic and p-value and interpret the results in context. If a confidence
interval is required, define the parameters of interest, identify the confidence interval needed,
calculate the interval, and interpret the interval in context.
(a) Based on this data set, was day of death of participants (for whom day of death is
known) consistent with an equal probability of death for each day of the week? Use
the appropriate portion of the output to carry out a statistical procedure to answer this
question.
Solution:
Let pi denote the probability that a death occurs on day i. We will carry out a goodness of fit
test of the hypothesis H0 : pi = 1/7 for all i versus the alternative that this equiprobability
model does not fit. We use the chi-square goodness of fit test with 6 degrees of freedom to
get
X (Oi − Ei )2
= 3.4 .
χ2S =
Ei
Since p-value= P (χ26 > 3.4) = 0.76 , we have insufficient evidence to reject the null
hypothesis that all days of death are equally probable. The variability in deaths from day to
day is easily explained by chance.
(b) Based on this data set, is incidence of coronary heart disease related to family history of
coronary heart disease (CHD) or are these two variables independent? Use the appropriate portion of the output to carry out a statistical procedure to answer this question.
Solution:
Let p1 represent the probability of CHD given no family history of CHD and p2 , the probability
given family history of CHD. A chi-square test of independence is needed to test H0 : p1 = p2
versus the alternative H1 : p1 < p2 . This test is equivalent to testing whether the two
variables are independent (H0 ) or related (H1 ). Using the data from the output, we calculate
the test statistic as
X (Oi − Ei )2
χ2S =
= 5.57 .
Ei
The p-value= (1/2) × P (χ21 > 5.57) = 0.5 × 0.018 = 0.009 . , we have strong evidence to
reject the null hypothesis of independence in favor of the alternative that the probability of
developing CHD is related to family history of the disease.
(c) What proportion of deaths of men in the population represented by this sample occur
on the weekend? [Note: Be sure to attach a 95% confidence interval and the other
information requested in the problem statement above.]
Math 261
9
Spring 2008
Solution:
We wish to make inference about p, the population proportion of deaths that occur on
p
the weekend. We will use the one-sample z-interval p̃ ± zα/2 p̃(1 − p̃)/(n + 4). Given
the information in the output, we have n = 110, p̃ = (35 + 2)/(110 + 4) = 0.324 , and
p
SEp̃ = 0.324(1 − 0.324)/(110 + 4) = 0.044. Since we require a 95% confidence interval,
we have 0.324 ± 1.96(0.044) or (0.239, 0.410) . Hence, we are 95% confident that the true
population proportion of deaths on the weekend is between 0.239 and 0.410.
Math 261
10
SPSS Output for Problem 4: The Western Electric Data
Spring 2008