Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SOLUTIONS TO FINAL EXAM VERSION 1 1) A) No, since the Minitab p-value for the intercept is not less than .01. This p-value corresponds to a two-tailed alternative hypothesis, which is what we want here. We do not divide Minitab’s p-value by 2, since we do not have a right-tailed alternative hypothesis. B) We compute the t-statistic by hand from the Minitab output as t = (10.905−10)/1.124 = .8052. We want to test the null hypothesis H 0 : 1 10 versus the two-tailed alternative hypothesis H A : 1 10 . We can reject the null hypothesis if |t| > t.025 where from Table 6 with DF=12 we find that t.025 2.179. (We can read the Degrees of Freedom DF=12 directly from the Minitab output for Residual, or calculate it for this simple regression as n−2=14−2=12.) Now, since |.8052| does not exceed 2.179, we do not reject the null hypothesis. So we do not have evidence at the 5% level of significance that the true coefficient of Nicotine is different from 10. C) Looking at Figure 1, we see that the outlier has a Nicotine content of approximately 2 and a CO of approximately 24. From the table, we find that the data point for Bull Durham is the one. D) Here, we want to test the null hypothesis H 0 : 1 0 versus the alternative hypothesis H A : 1 0 . Since the estimated coefficient ( ̂1 10.905) is positive, we can calculate our right-tailed p-value by taking Minitab’s 2-tailed p-value and dividing by two. We don’t know exactly what Minitab’s p-value is, since it only provides us with the first three digits after the decimal point. But we can say that the two-tailed p-value is less than .0005, and therefore the right-tailed p-value is less than .00025. 2) A) Note that the standard error for Nicotine has gone way up. It is 3.939 in this multiple regression model, compared with 1.124 in the simple regression model. Furthermore, the estimated coefficient (which is now negative!) has gotten closer to zero. The net result is that we have a much smaller t-statistic than in the simple regression. The main problem here is presumably collinearity. Figure 4 shows that the two explanatory variables in this multiple regression, Nicotine and Tar, are themselves highly correlated. So we have a similar problem to what happened with the Satellite Dish data set that we discussed in class, or the Home Price data set given in the handouts. B) The p-value for the coefficient of Nicotine in a left-tailed test is (since the estimated coefficient is negative) .357/2 = .1785, which is not close to .05, so we do not have strong evidence that the true coefficient of Nicotine is negative. Even if the true coefficient were zero, we would find an estimated coefficient at least as negative as the one we got here (as measured by the t-statistic) about 18% of the time. C) Apparently yes, since the R2 has gone up from 88.7% in the simple regression to 95.1% in the multiple regression. But we know that R2 always goes up when a new variable is included in a regression. So maybe we’re just seeing that effect here. You might try to argue that the R2 has gone up by “a lot”, but we haven’t studied ways to gauge such an increase in R2 more formally. The tool we did examine to compare different models is the model selection method, AICC. But I didn’t ask you to calculate it here. 3) There is no contradiction here, since the meaning and interpretation of the coefficient of Nicotine depends on what other variables are in the model. We had a similar situation with the Home Price data set (for the coefficient of Age). Note that the coefficient of Nicotine is not significant in the multiple regression model. So given a model containing both Tar and Nicotine, there’s no evidence that Nicotine affects the expected value of CO. It’s hard to say if one of the values for the coefficient of Nicotine is more trustworthy than the other (we would be trying to compare two very different things, as described above), but we could decide whether one of the models is more trustworthy than the other by comparing the AICC values. 4) A) Plugging in to the estimated regression equation with x1 1, x2 10, x3 1 yields y 9.684 2.916 1 .9227 10 6.207 1 9.788 (mg/cigarette of CO). B) We estimate that for every additional milligram of Tar per cigarette, the expected CO output increases by .9227 milligrams/cigarette, after controlling for Nicotine content and Weight. C) We have F=MSR/MSE, that is, 87.93=78.526/.893, where MSR is the regression mean square, MSR=SSR/DF(Regression)=235.579/3=78.526, and MSE=SSE/DF(Residual)=8.930/10=.893. D) The p-value for the F-test is 0.000, in other words, less than 5/10000. If the true coefficients of Nicotine, Tar and Weight were all zero (so that the multiple regression is completely useless) we would be extremely unlikely to find such a large signal-to-noise ratio, F=87.93 or larger. This provides strong evidence that the regression is not useless, in other words that at least one of the true coefficients is nonzero. The validity of these statements depends on all of the assumptions of the regression model holding, for example that the errors are independent and normally distributed with mean zero and constant variance. 5) A) The CI takes form ˆ3 t.025SE(ˆ3 ) . We find from Table 6 with DF=10 that t.025 2.228 . Thus the CI is 6.207 2.228(3.386) 6.207 7.544 (13.751, 1.337) B) The linear regression model assumptions must hold, including the assumptions on the errors mentioned above, and also the assumption that the expected value of CO given the explanatory variables is a linear combination of Nicotine, Tar and Weight. 6) A) The 99% CI has form x t.005 s / n . We have n=14, DF=n−1=13, and we find from Table 6 that t.005 3.012 . So the 99%CI is 13.21 3.012(1.16) 13.21 3.49 (9.72, 16.7) B) The method used to construct this interval would, in 95% of all random samples that could be selected from the population of all brands, contain the true expected CO content, that is, the population mean. C) We cannot talk about the probability that the true mean μ is between 9.72 and 16.7, since there is nothing random about μ or the endpoints 9.72 and 16.7. D) We need to assume that we took a random sample from the population of all brands (this assumption is impossible to check, based on the information given) and that CO content is normally distributed. To check the normal distribution assumption, we can look at Figure 1. Considering just the vertical coordinate (CO), we see that the outlier, Bull Durham (with a CO content of 23.5 from Table 1), is far above the average. More precisely, from the one-sample T output above and we see that the sample standard deviation of CO is 4.34 and the sample mean is 13.21. Therefore, the CO content for Bull Durham is z=(23.5−13.21)/4.34=2.37 standard deviations above the mean. So we have some cause to worry that the normality assumption may not hold here. If we ignore the estimation error, assuming the data is normally distributed the chances of finding a standard normal at least 2.37 standard deviations above the mean is (approximately) just .01. So this seems to suggest reasonably strongly that the population is not normal. On the other hand, the fact that the sample mean and sample standard deviation are subject to sampling error, combined with the fact that we are looking at the maximum of 14 values, weakens the evidence of non-normality. No precise conclusion on normality can be drawn on the basis of the information provided. One could do a normality test in Minitab and examine the p-value, but that output was not provided. 7) Based on the linear regression assumptions, we know that the errors are independent and normally distributed with mean zero and constant variance, 2 . We need to determine the value of . Let’s use the notation from the first part of the course and refer to the 100 values of the errors as x1 , x2 ,, xn . Then since n=100, we have .025 Pr{x1 x100 10} Pr{ X .1} . Since the errors have mean zero, we have E[ X ] x 0 . The standard deviation of the sampling distribution of the sample mean is x / 100 . And since X is normally distributed (an average of normals is normal), we have after converting to z-scores that .025 Pr{Std Normal 10(.1) / } . Since Pr{Std Normal 1.96} .025 we conclude that 1.96 1/ so that 1/ 1.96 .5102. Now, we can answer the original question. The probability that at least one of the errors is less than −1 is equal to one minus the probability that they all exceed −1, that is, 1 [Pr{Std Normal 1/ .5102}]100 1 [Pr{Std Normal 1.96}]100 1 .975100 1 .0795 .9205. 8) If all of the y values are doubled, y will double, and the total sum of squares SST= ( y i y ) 2 will quadruple. Due to the doubling of the least squares estimates, the ŷ values will also double, since yˆ ˆ ˆ x ˆ x and the explanatory 1 1 k k variables are being held fixed. Thus the regression sum of squares SSR ( yˆ i y ) 2 will quadruple. We conclude that R2=SSR/SST will remain unchanged. 9) The 95% CI has form x t.025s / 10 . Thus from the information given we can compute the sample mean as x (.1453 .2511) / 2 .0529. From Table 6 with DF=n−1=9 we find that t.025 2.262 . The width of the CI is .2511 .1453 .3964 2(2.262) s / 10 and therefore s 10 (.3964) /[ 2(2.262)] .2771. Finally, we can compute the t-statistic as x .1 .0529 .1 t .5375. s / 10 .2771 / 10 10) Since the null hypothesis is true here, there is a 5% probability that a particular one of the samples (chosen in advance, say, the first one) leads to a p-value less than .05. Since the samples are independent of each other, Y is the number of “successes” in 50 independent trials, where “success” corresponds to “p-value<.05”. Thus Y has a binomial distribution with n=50 and p=.05, where this time p represents the probability of “success” at a given trial. From the formula for the standard deviation of a binomial distribution we have Var(Y ) npq 50(.05)(.95) 2.375 1.541.