Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Psychometrics wikipedia , lookup
Foundations of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Analysis of variance wikipedia , lookup
Omnibus test wikipedia , lookup
Student's t-test wikipedia , lookup
Math 261 Solutions to Exam III Spring 2008 Name: Instructions: This is exam is given under the honor code. Any violation will result in failure of the course. You have three hours to work on the exam. You may use your text and notes as well as a calculator and online resources. You may not, however, communicate with anyone other than the instructor during the exam. Write your answers clearly clearly in your blue book. Partial credit can only be given if supporting calculations are shown. A total of 130 points are possible for this exam. Be sure to refer to the attached output at the end of the exam. 1. [20 points] The table below shows the prices charged (in US$) for a simple random sample of commonly prescribed drugs by i) a U.S. retail pharmacy and ii) a web-based pharmacy in Canada. Drug A B C D E F G H I J K L Cost per 100 pills (US$) United States Canada 131 136 374 370 61 252 263 349 243 166 365 216 83 72 219 166 17 214 112 50 134 42 200 105 Continued on the next page . . . 2 Math 261 Spring 2008 Problem 1 (continued) (a) For each of the following questions, clearly indicate the appropriate statistical procedure required. Assume drug prices are normally distributed. You do not need to carry out any calculations. If a hypothesis test is required, your answer should include these four components: (1) type of parameter (proportions or means), (2) design (one-sample, two-sample, or matched pairs), (3) distribution (z-test or t-test), and (4) alternative (one- or two-sided). If a confidence interval is required, your answer should include these three components: (1) type or parameter (proportions or means), (2) design (one-sample, two-sample, or matched pairs), and (3) distribution (z-interval or t-interval). (i) Are drug prices typically cheaper at the Canadian pharmacy than at the U.S. outlet? Solution: This requires a hypothesis test: 1) means, 2) matched pairs, 3) t-test, and 4) one-sided alternative because we are asking whether they are “cheaper.” (ii) What average savings in dollars would be expected by using the Canadian web-based pharmacy? Solution: This requires a confidence interval: 1) means, 2) matched pairs, and 3) tinterval. (iii) At the U.S. pharmacy, what percent of drugs cost more than $200 per 100 pills? Solution: This requires a confidence interval: 1) proportions, 2) one sample, and 3) z-interval. (iv) Is the percent of drugs that cost more than $200 per 100 pills the same at the two pharmacies? Solution: This requires a hypothesis test: 1) proportions, 2) two samples, 3) z-test, and 4) two-sided because we’re testing whether there’s a difference. (b) Suppose differences in drug prices were in fact not normally distributed but were skewed with potential extreme values. Identify a test that would be appropriate for answering the question “Are drug prices typically cheaper at the Canadian pharmacy than at the U.S. outlet?” Briefly justify your test. [Note: You do not need to carry out the procedures or perform any calculations.] Solution: This would require a sign test with a one-sided alternative since the sign test does not require the condition of normality to hold. 3 Math 261 Spring 2008 2. [35 points] A biologist wished to study the effects of ethanol on sleep time. A sample of 20 rats matched for age and other characteristics, was selected, and each rat was given an oral injection having a particular concentration of ethanol per body weight. The rapid eye movement (REM) sleep time for each rat was then recorded for a 24-hr period, with the results shown in the following table. [Note: The data are available in the SPSS file ratrem.sav available from the usual class web page. Be sure to use Group as the grouping variable. You will use the variable Ethanol in the next problem.] Group 1 2 3 4 Treatment 0 1 2 4 (control) g/kg g/kg g/kg Observations 88.6 63.0 44.9 31.0 73.2 53.9 59.5 39.6 91.4 69.2 40.2 45.3 68.0 50.1 56.3 25.2 75.2 71.5 38.7 22.7 (a) Consider an ANOVA for these data. First, define the parameters of interest and state H0 and Ha for this study. Solution: Let µ1 represent the mean REM for rats in the control group, µ2 the mean REM for rats receiving 1 g/kg concentration of ethanol, and so on. Then we have H0 : µ1 = µ2 = µ3 = µ4 versus Ha : the means are not all equal. (b) Use SPSS to carry out the ANOVA computations. Report the ANOVA table and test the global null hypothesis (F test) result at the α = .05 level. Solution: ANOVA Sleep time in minutes for a 24-hour period Sum of Squares df Mean Square Between Groups Within Groups 5882.4 1487.4 3 16 Total 7369.8 19 1960.8 93.0 F Sig. 21.1 8.32E-06 We have strong evidence (F=21.1, p <.001) that the means are not all equal. (c) State the ANOVA condition regarding SDs and use SPSS to check that it holds for this data set. Support your conclusions using statistics and a rough sketch. Comment on your findings. Solution: We assume the population standard deviations σi are the same for all 4 groups. We can use SPSS to check this by calculating the SDs and by plotting residuals against the predicted 4 Math 261 Spring 2008 values. The ratio of the largest to smallest SD is 10.18/9.34 = 1.09 which is much smaller than 2. A plot of the residual versus predicted plot is shown below: The plot gives no reason to doubt the equality of standard deviations. (d) In class and in the ANOVA homework assignment, you used a diagram with underlines to summarize the results of a post hoc comparison of means. Produce such diagrams for the REM data set using the Bonferroni and Newman-Keuls methods, one diagram for each method. Summarize each diagram in a sentence or two. Solution: Using the Bonferroni method, we find significant differences, α = .05, in all pairs of means except those adjacent to each other. That is, we fail to reject µ1 = µ2 , µ2 = µ3 , and µ3 = µ4 . Group 4 3 2 1 yi 32.76 47.92 61.54 79.28 In contrast, with the Newman-Keuls method, we reject equality of all pairs of means. Hence, all group means are found significantly different from all other group means so that we simply obtain the diagram below. Group 4 3 2 1 yi 32.76 47.92 61.54 79.28 Math 261 5 Spring 2008 (e) Compared to other methods, the Bonferroni multiple comparisons procedure is often described as “conservative.” Referring to your diagrams in the previous question, how is this “conservative” characteristic of the Bonferroni method evident here? Explain briefly. Solution: The Bonferroni method is conservative because it rejects fewer of the null hypotheses than does the Newman-Keuls method. More specifically, it is less efficient and requires a larger difference in sample means to find significance at the same level α. (f) In a single sentence, what does your ANOVA say about how ethanol affects the REM sleep time of rats? Solution: The greater the concentration of ethanol, the less REM experienced on average for rats in the study. Increased concentration of ethanol in injections reduces REM of rats. Math 261 6 Spring 2008 3. [30 points] Another approach to analyzing the data of Problem (2) would be to carry out a regression of REM against the ethanol concentration. With this approach in mind, again use the SPSS data file ratrem.sav and answer the questions which follow. (a) Calculate the linear regression of Y (REM) on X (Ethanol concentration). Record the equation of the line using the variable names. Solution: 4 The fitted line is d = 75.2 − 11.3 × Ethanol. REM (b) Assuming that the linear model is applicable, construct a 90% confidence interval for β1 and interpret your interval in the context of this setting. Solution: 8 The confidence interval has the form b1 ± tα/2 (n − 2)SEb1 . Here, b1 = −11.3, t.025 (18) = 1.734 (from the calculator), and SEb1 = 1.49 from the output. Plugging in we get −11.3 ± (1.734) × (1.49) or −11.3 ± 2.58 to get (−13.88, −8.72) . Interpreting, we expect that for every 1 g/kg increase in ethanol concentration, the mean REM in minutes per 24 hours will decrease by between 8.72 minutes and 13.88 minutes. (c) Assuming that the linear model is applicable, find estimates of the mean and standard deviation of REM at an ethanol level of 3 g/kg. Solution: 6 The mean is µY |X = ŷ = 75.2 − 11.3 × (3) = 41.3 . The standard deviation is the standard deviation of the regression which is found from the SPSS output to be sY |X = 9.85 . (d) Assuming that the linear model is applicable, find the value of r2 and interpret it in the context of this problem. Solution: 4 From the output we have r2 = .763. We conclude that 76.3 percent of the variability in REM is explained by the linear regression of REM on ethanol concentration. (e) Use SPSS to plot the residuals versus predicted values from the regression in a). Then make a normal probability plot of the residuals. Do these plots call into question the use of a linear model and regression inference procedures? Explain briefly. Solution: 8 The residual versus predicted plot (below) shows some curvature and suggests a linear model is likely inappropriate: Math 261 7 Spring 2008 The normal probability plot of the residuals (below) gives no basis for doubting the normality of the residuals. Overall, we do have concern that the linear model is not appropriate. 4. [45 points] This question deals with the Western Electric data set (electric.sav) we looked at in the first lab of the quarter, an observational prospective study of health among men 40 to 55 years old at the beginning of the study. For each of the following questions, please carry out the appropriate statistical procedure using information provided in the output. Do not try to use the SPSS data file to carry out the required procedure. Instead, just use the information provided in Math 261 8 Spring 2008 the output at the end of the exam for this question. If a test of hypothesis is required, define the parameters of interest, state the hypotheses, identify the required statistical test, calculate the test statistic and p-value and interpret the results in context. If a confidence interval is required, define the parameters of interest, identify the confidence interval needed, calculate the interval, and interpret the interval in context. (a) Based on this data set, was day of death of participants (for whom day of death is known) consistent with an equal probability of death for each day of the week? Use the appropriate portion of the output to carry out a statistical procedure to answer this question. Solution: Let pi denote the probability that a death occurs on day i. We will carry out a goodness of fit test of the hypothesis H0 : pi = 1/7 for all i versus the alternative that this equiprobability model does not fit. We use the chi-square goodness of fit test with 6 degrees of freedom to get X (Oi − Ei )2 = 3.4 . χ2S = Ei Since p-value= P (χ26 > 3.4) = 0.76 , we have insufficient evidence to reject the null hypothesis that all days of death are equally probable. The variability in deaths from day to day is easily explained by chance. (b) Based on this data set, is incidence of coronary heart disease related to family history of coronary heart disease (CHD) or are these two variables independent? Use the appropriate portion of the output to carry out a statistical procedure to answer this question. Solution: Let p1 represent the probability of CHD given no family history of CHD and p2 , the probability given family history of CHD. A chi-square test of independence is needed to test H0 : p1 = p2 versus the alternative H1 : p1 < p2 . This test is equivalent to testing whether the two variables are independent (H0 ) or related (H1 ). Using the data from the output, we calculate the test statistic as X (Oi − Ei )2 χ2S = = 5.57 . Ei The p-value= (1/2) × P (χ21 > 5.57) = 0.5 × 0.018 = 0.009 . , we have strong evidence to reject the null hypothesis of independence in favor of the alternative that the probability of developing CHD is related to family history of the disease. (c) What proportion of deaths of men in the population represented by this sample occur on the weekend? [Note: Be sure to attach a 95% confidence interval and the other information requested in the problem statement above.] Math 261 9 Spring 2008 Solution: We wish to make inference about p, the population proportion of deaths that occur on p the weekend. We will use the one-sample z-interval p̃ ± zα/2 p̃(1 − p̃)/(n + 4). Given the information in the output, we have n = 110, p̃ = (35 + 2)/(110 + 4) = 0.324 , and p SEp̃ = 0.324(1 − 0.324)/(110 + 4) = 0.044. Since we require a 95% confidence interval, we have 0.324 ± 1.96(0.044) or (0.239, 0.410) . Hence, we are 95% confident that the true population proportion of deaths on the weekend is between 0.239 and 0.410. Math 261 10 SPSS Output for Problem 4: The Western Electric Data Spring 2008