Download FinF11V1Sol

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
SOLUTIONS TO FINAL EXAM
VERSION 1
1)
A) No, since the Minitab p-value for the intercept is not less than .01. This p-value
corresponds to a two-tailed alternative hypothesis, which is what we want here.
We do not divide Minitab’s p-value by 2, since we do not have a right-tailed
alternative hypothesis.
B) We compute the t-statistic by hand from the Minitab output as t =
(10.905−10)/1.124 = .8052. We want to test the null hypothesis H 0 : 1  10
versus the two-tailed alternative hypothesis H A : 1  10 . We can reject the null
hypothesis if |t| > t.025 where from Table 6 with DF=12 we find that
t.025  2.179. (We can read the Degrees of Freedom DF=12 directly from the
Minitab output for Residual, or calculate it for this simple regression as
n−2=14−2=12.) Now, since |.8052| does not exceed 2.179, we do not reject the
null hypothesis. So we do not have evidence at the 5% level of significance that
the true coefficient of Nicotine is different from 10.
C) Looking at Figure 1, we see that the outlier has a Nicotine content of
approximately 2 and a CO of approximately 24. From the table, we find that the
data point for Bull Durham is the one.
D) Here, we want to test the null hypothesis H 0 : 1  0 versus the alternative
hypothesis H A : 1  0 . Since the estimated coefficient ( ̂1  10.905) is positive,
we can calculate our right-tailed p-value by taking Minitab’s 2-tailed p-value and
dividing by two. We don’t know exactly what Minitab’s p-value is, since it only
provides us with the first three digits after the decimal point. But we can say that
the two-tailed p-value is less than .0005, and therefore the right-tailed p-value is
less than .00025.
2)
A) Note that the standard error for Nicotine has gone way up. It is 3.939 in this
multiple regression model, compared with 1.124 in the simple regression model.
Furthermore, the estimated coefficient (which is now negative!) has gotten closer
to zero. The net result is that we have a much smaller t-statistic than in the simple
regression. The main problem here is presumably collinearity. Figure 4 shows that
the two explanatory variables in this multiple regression, Nicotine and Tar, are
themselves highly correlated. So we have a similar problem to what happened
with the Satellite Dish data set that we discussed in class, or the Home Price data
set given in the handouts.
B) The p-value for the coefficient of Nicotine in a left-tailed test is (since the
estimated coefficient is negative) .357/2 = .1785, which is not close to .05, so we
do not have strong evidence that the true coefficient of Nicotine is negative. Even
if the true coefficient were zero, we would find an estimated coefficient at least as
negative as the one we got here (as measured by the t-statistic) about 18% of the
time.
C) Apparently yes, since the R2 has gone up from 88.7% in the simple regression to
95.1% in the multiple regression. But we know that R2 always goes up when a
new variable is included in a regression. So maybe we’re just seeing that effect
here. You might try to argue that the R2 has gone up by “a lot”, but we haven’t
studied ways to gauge such an increase in R2 more formally. The tool we did
examine to compare different models is the model selection method, AICC. But I
didn’t ask you to calculate it here.
3) There is no contradiction here, since the meaning and interpretation of the coefficient
of Nicotine depends on what other variables are in the model. We had a similar
situation with the Home Price data set (for the coefficient of Age). Note that the
coefficient of Nicotine is not significant in the multiple regression model. So given a
model containing both Tar and Nicotine, there’s no evidence that Nicotine affects the
expected value of CO. It’s hard to say if one of the values for the coefficient of
Nicotine is more trustworthy than the other (we would be trying to compare two very
different things, as described above), but we could decide whether one of the models
is more trustworthy than the other by comparing the AICC values.
4)
A) Plugging in to the estimated regression equation with x1  1, x2  10, x3  1 yields

y  9.684  2.916  1  .9227  10  6.207  1  9.788 (mg/cigarette of CO).
B) We estimate that for every additional milligram of Tar per cigarette, the expected
CO output increases by .9227 milligrams/cigarette, after controlling for Nicotine
content and Weight.
C) We have F=MSR/MSE, that is, 87.93=78.526/.893, where MSR is the regression
mean square, MSR=SSR/DF(Regression)=235.579/3=78.526, and
MSE=SSE/DF(Residual)=8.930/10=.893.
D) The p-value for the F-test is 0.000, in other words, less than 5/10000. If the true
coefficients of Nicotine, Tar and Weight were all zero (so that the multiple
regression is completely useless) we would be extremely unlikely to find such a
large signal-to-noise ratio, F=87.93 or larger. This provides strong evidence that
the regression is not useless, in other words that at least one of the true
coefficients is nonzero. The validity of these statements depends on all of the
assumptions of the regression model holding, for example that the errors are
independent and normally distributed with mean zero and constant variance.
5)
A) The CI takes form ˆ3  t.025SE(ˆ3 ) . We find from Table 6 with DF=10 that
t.025  2.228 . Thus the CI is
 6.207  2.228(3.386)  6.207  7.544  (13.751, 1.337)
B) The linear regression model assumptions must hold, including the assumptions on
the errors mentioned above, and also the assumption that the expected value of CO
given the explanatory variables is a linear combination of Nicotine, Tar and Weight.
6)
A) The 99% CI has form x  t.005 s / n . We have n=14, DF=n−1=13, and we find
from Table 6 that t.005  3.012 . So the 99%CI is
13.21  3.012(1.16)  13.21  3.49  (9.72, 16.7)
B) The method used to construct this interval would, in 95% of all random samples
that could be selected from the population of all brands, contain the true expected
CO content, that is, the population mean.
C) We cannot talk about the probability that the true mean μ is between 9.72 and
16.7, since there is nothing random about μ or the endpoints 9.72 and 16.7.
D) We need to assume that we took a random sample from the population of all
brands (this assumption is impossible to check, based on the information given)
and that CO content is normally distributed. To check the normal distribution
assumption, we can look at Figure 1. Considering just the vertical coordinate
(CO), we see that the outlier, Bull Durham (with a CO content of 23.5 from Table
1), is far above the average. More precisely, from the one-sample T output above
and we see that the sample standard deviation of CO is 4.34 and the sample mean
is 13.21. Therefore, the CO content for Bull Durham is z=(23.5−13.21)/4.34=2.37
standard deviations above the mean. So we have some cause to worry that the
normality assumption may not hold here. If we ignore the estimation error,
assuming the data is normally distributed the chances of finding a standard normal
at least 2.37 standard deviations above the mean is (approximately) just .01. So
this seems to suggest reasonably strongly that the population is not normal. On the
other hand, the fact that the sample mean and sample standard deviation are
subject to sampling error, combined with the fact that we are looking at the
maximum of 14 values, weakens the evidence of non-normality. No precise
conclusion on normality can be drawn on the basis of the information provided.
One could do a normality test in Minitab and examine the p-value, but that output
was not provided.
7) Based on the linear regression assumptions, we know that the errors are independent
and normally distributed with mean zero and constant variance,  2 . We need to
determine the value of  . Let’s use the notation from the first part of the course and
refer to the 100 values of the errors as x1 , x2 ,, xn . Then since n=100, we have
.025  Pr{x1    x100  10}  Pr{ X  .1} . Since the errors have mean zero, we have
E[ X ]   x  0 . The standard deviation of the sampling distribution of the sample
mean is  x   / 100 . And since X is normally distributed (an average of normals
is normal), we have after converting to z-scores that
.025  Pr{Std Normal  10(.1) /  } . Since Pr{Std Normal  1.96}  .025 we
conclude that 1.96  1/  so that   1/ 1.96  .5102. Now, we can answer the
original question. The probability that at least one of the errors is less than −1 is equal
to one minus the probability that they all exceed −1, that is,
1  [Pr{Std Normal  1/ .5102}]100  1  [Pr{Std Normal  1.96}]100  1  .975100
 1  .0795  .9205.
8) If all of the y values are doubled, y will double, and the total sum of squares
SST= ( y i  y ) 2 will quadruple. Due to the doubling of the least squares estimates,
the ŷ values will also double, since yˆ  ˆ  ˆ x    ˆ x and the explanatory
1 1
k
k
variables are being held fixed. Thus the regression sum of squares
SSR  ( yˆ i  y ) 2 will quadruple. We conclude that R2=SSR/SST will remain
unchanged.
9) The 95% CI has form x  t.025s / 10 . Thus from the information given we can
compute the sample mean as x  (.1453  .2511) / 2  .0529. From Table 6 with
DF=n−1=9 we find that t.025  2.262 . The width of the CI is
.2511  .1453  .3964  2(2.262) s / 10 and therefore
s  10 (.3964) /[ 2(2.262)]  .2771. Finally, we can compute the t-statistic as
x  .1
.0529  .1
t

 .5375.
s / 10 .2771 / 10
10) Since the null hypothesis is true here, there is a 5% probability that a particular one of
the samples (chosen in advance, say, the first one) leads to a p-value less than .05.
Since the samples are independent of each other, Y is the number of “successes” in 50
independent trials, where “success” corresponds to “p-value<.05”. Thus Y has a
binomial distribution with n=50 and p=.05, where this time p represents the probability
of “success” at a given trial. From the formula for the standard deviation of a binomial
distribution we have Var(Y )  npq  50(.05)(.95)  2.375  1.541.