Download Solutions by Victor Ying - Full

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Confidence interval wikipedia , lookup

Forecasting wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Exercise 10.1
(a) Plot wages versus LOS. Describe the relationship. There is one woman with relatively high
wages for her length of service. Circle this point and do not use it in the rest of this exercise.
(b) Find the least-squares line. Summarize the significance test for the slope. What do you
conclude?
The least-squares line is
(
)
̂
The standard error for the estimate of the slope is 0.02560. The corresponding t- and p-values
are 2.85 and 0.00605, respectively. With a p-value so small, we conclude that there is significant
evidence against the mull hypothesis that the slope is zero.
(c) State carefully what the slope tells you about the relationship between wages and length of
service.
Page 1 of 32
The slope tells us that for every unit increase in LOS, the average wage will increase by 0.07295
units.
(d) Give a 95% confidence interval for the slope.
A 95% confidence interval for the slope is [0.02169093, 0.1242139].
Code and Output:
> confint( m1)
2.5 %
97.5 %
(Intercept) 38.95561398 47.9195425
LOS
0.02169093 0.1242139
Exercise 10.2
Refer to the previous exercise. Analyze the data with the outlier included. How does this change the
estimates of the parameters , , and ? What effect does the outlier have on the results of the
significance test for the slope?
The estimates of the parameters from the previous exercise (without the outlier) were:
̂
̂
̂
With the outlier included, the estimates become:
̂
̂
̂
The outlier has the effect of lowering the significance of the test for the slope—increasing the p-value
from 0.00605 to 0.0186.
Exercise 10.5
In Example 10.8 we examined the yield in bushels per acre of corn for the years 1966, 1976, 1986, and
1996. Data for all years between 1957 and 1996 appear in Table 10.2
(a) Plot the yield versus year. Describe the relationship. Are there any outliers or unusual years?
Page 2 of 32
60
80
Yield
100
120
140
Exercise 10.5
1960
1970
1980
1990
Year
Comment:
The relationship looks roughly linear. There don’t appear to be any outliers. As for unusual years,
there are a few years in the 70s and 80s that appear to have a lower yield than what might be
expected.
(b) Perform the regression analysis and summarize the results. How rapidly has yield increased
over time?
The least-squares line is
(
)
̂
Page 3 of 32
The significance tests for intercept and slope are both highly significant—the p-value for the test
of the intercept being equal to zero was p = 2.77e-15 and for the test of the slope being equal
to zero was p = 1.26e-15.
The average yield has increases by approximately 1.84 bushels per acre per year.
Code and Output:
> m5= lm( Yield ~ Year)
> summary( m5)
Call:
lm(formula = Yield ~ Year)
Residuals:
Min
1Q
-27.890 -6.102
Median
1.240
3Q
7.270
Max
15.072
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3544.6463
278.3787 -12.73 2.77e-15 ***
Year
1.8396
0.1408
13.06 1.26e-15 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.28 on 38 degrees of freedom
Multiple R-squared: 0.8178,
Adjusted R-squared: 0.813
F-statistic: 170.6 on 1 and 38 DF, p-value: 1.257e-15
Exercise 10.6
(a) Find the equation of the least-squares line.
The equation is:
(
̂
Code and Output:
)
> m6= lm( Y ~ X)
> summary( m6)
Call:
lm(formula = Y ~ X)
Residuals:
Min
1Q
Median
-0.55162 -0.17595 -0.09349
3Q
0.17381
Max
0.81315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.23235
0.28604
4.308 0.00353 **
X
0.20221
0.01145 17.663 4.6e-07 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Page 4 of 32
Residual standard error: 0.4345 on 7 degrees of freedom
Multiple R-squared: 0.9781,
Adjusted R-squared: 0.9749
F-statistic:
312 on 1 and 7 DF, p-value: 4.596e-07
(b) Test the null hypothesis that the slope is zero and describe your conclusion.
The p-value for this test is p = 4.6e-07, so we reject at the 0.05 level. The conclusion is that it is
extremely unlikely that the slope is zero.
(c) Give a 95% confidence interval for the slope.
A 95% confidence interval for the slope is
[
]
Code and Output:
> confint( m6)
2.5 %
97.5 %
(Intercept) 0.5559843 1.9087240
X
0.1751401 0.2292829
(d) The parameter
corresponds to natural gas consumption for cooking, hot water, and other
uses when there is no demand for heating. Give a 95% confidence interval for this parameter.
A 95% confidence interval for the intercept is
[
]
Exercise 10.8
(a) Plot the data. Does the trend in lean over time appear to be linear?
Comment:
Yes, the trend appears to be linear.
Plot:
700
640
660
680
Lean
720
740
760
Exercise 10.8
76
78
80
82
84
86
Year
(b) What is the equation of the least-squares line? What percentage of the variation in lean is
explained by this line?
Page 5 of 32
The equation of the least-squares line is
(
)
̂
From the R-squared value, approximately 98.7% of the variation is explained by this line.
Code and Output:
> m8= lm( Lean ~ Year)
> summary( m8)
Call:
lm(formula = Lean ~ Year)
Residuals:
Min
1Q
-5.9670 -3.0989
Median
0.6703
3Q
2.3077
Max
7.3956
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.1209
25.1298 -2.432
0.0333 *
Year
9.3187
0.3099 30.069 6.5e-12 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.181 on 11 degrees of freedom
Multiple R-squared: 0.988,
Adjusted R-squared: 0.9869
F-statistic: 904.1 on 1 and 11 DF, p-value: 6.503e-12
(c) Give a 95% confidence interval for the average rate of change (tenths of a millimeter per year)
of the lean.
A 95% confidence interval for the average rate of change (tenths of a millimeter per year) of the
lean is
[
]
Exercise 10.9
(a) In 1918 the lean was 2.9071. (The coded value is 71.) Using the least-squares equation for the
years 1975 to 1987, calculate a predicted value for the lean in 1918.
The predicted value for the lean in 1918 is 106.6154 tenths of a millimeter.
Code and Output:
> predict( m8, newdata=data.frame( Year=c( 18)), interval="prediction",
level=0.95)
fit
lwr
upr
1 106.6154 62.59376 150.637
(b) Although the least-squares line gives an excellent fit to the data for 1975 to 1987, this pattern
did not extend back to 1918. Write a short statement explaining why this conclusion follows
from the information available. Use numerical and graphical summaries to support your
explanation.
The reason why the conclusion (that the linear pattern does not extend back to 1918) follows
from the information available is because the predicted value for lean in 1918 (2.9107
Page 6 of 32
meters/coded value 106.6) does not match the value of the actual lean in 1918 (2.9071 meters/
coded value 71). As for the part of the question asking for numerical and graphical summaries,
I’m not sure which numerical/graphical summaries the author has in mind; also, I don’t see how
any numerical/graphical summary would explain why the pattern does not extend back to 1918.
Exercise 10.10
(a) The engineers working on the Leaning Tower of Pisa are most interested in how much the
tower will lean if no corrective action is taken. Use the least-squares equation to predict the
tower’s lean in the year 1997.
The predicted value for the lean in 1997 is 842.7912 tenths of a millimeter.
Code and Output:
> predict( m8, newdata=data.frame( Year=c( 97)), interval="prediction",
level=0.95)
fit
lwr
upr
1 842.7912 828.2892 857.2932
(b) To give a margin of error for the lean in 1997, would you use a confidence interval for a mean
response or a prediction interval? Explain your choice.
We would use a prediction interval because this was a prediction.
Exercise 10.13
Exercise 10.6 gives information about the regression of natural gas consumptionon degree-days for a
particular household.
(a) What is the t statistic for testing
?
The t statistic was 17.663.
(b) For the alternative
, what critical value would you use for a test at the
significance level? Do you reject
at this level?
We would use the critical value
.
Yes, we would reject at this level.
(c) How would you report the P-value for this test?
We would report p= 2.297770e-07.
Code and Output:
> pt( q=17.663, df=7, lower.tail=F)
[1] 2.297770e-07
Exercise 10.15
(a) Find ̅ and ∑(
We have:
̅) from the data.
̅
and from the equation:
Page 7 of 32
(̂ )
we have:
̂ (̂ )
̂
̂
̂ (̂ )
Thus
̅)
∑(
̂
̂ (̂ )
Code and Output:
> mean( Spheres)
[1] 13.07
> m15= lm( Vein ~ Spheres)
> summary( m15)
Call:
lm(formula = Vein ~ Spheres)
Residuals:
Min
1Q Median
-2.2136 -1.2829 -0.3227
3Q
0.9887
Max
3.0295
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.03119
1.22391
0.843
0.424
Spheres
0.90197
0.08344 10.810 4.73e-06 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.757 on 8 degrees of freedom
Multiple R-squared: 0.9359,
Adjusted R-squared: 0.9279
F-statistic: 116.8 on 1 and 8 DF, p-value: 4.733e-06
(b) We expect x and y to be positively associated. State hypotheses in terms of the slope of the
population regression line that expresses this expectation, and carry out a significance test.
What conclusions do you draw?
The hypotheses are
Page 8 of 32
The p-value of the significance test, carried out on a t-value of 10.810 from a
distributions
is 2.365e-06 (get this either by dividing the p-value from the two-sided test from the output
from part (a) above (i.e. p= 4.73e-06) by 2 or by using the pt() function in R).
The conclusion is that the slope is positive.
Code and Output:
Either
> 4.73e-06/2
[1] 2.365e-06
or
> pt( q=10.810 , df=8, lower.tail=F)
[1] 2.365891e-06
(c) Find a 99% confidence interval for the slope.
A 99% confidence interval is
[
Code and Output:
]
> confint( m15, level=.99)
0.5 %
99.5 %
(Intercept) -3.0754859 5.137872
Spheres
0.6219957 1.181953
(d) Suppose that we observe a value of Spheres equal to 15.0 for one dog. Give a 90% interval for
predicting the variable Vein for that dog.
A 90% prediction interval for Vein corresponding to a value of 15.0 for Spheres is
[
]
Code and Output:
> predict( m15, newdata=data.frame( Spheres=c( 15.0)),
interval="prediction", level=0.90)
fit
lwr
upr
1 14.56081 11.12175 17.99987
Exercise 10.17
(a) Plot the data. Are there any outliers or unusual points?
There are not outliers or unusual points.
Plot:
Page 9 of 32
1.5
0.5
1.0
I
2.0
Exercise 10.17
0.5
1.0
1.5
2.0
V
(b) Find the least-squares fit to the data, and estimate 1/R for this wire. Then give a 95%
confidence interval for 1/R.
The least-squares fit is
(
)
the estimate for 1/R is
and a 95% confidence interval for 1/R is
[
Code and Output:
> m17= lm( I ~ V)
> summary( m17)
Call:
lm(formula = I ~ V)
Page 10 of 32
]
Residuals:
1
2
3
4
-0.007373 0.070402 -0.091823 -0.067158
Coefficients:
Estimate Std. Error t value
(Intercept) -0.06485
0.11416 -0.568
V
1.18445
0.07790 15.206
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
5
0.095952
Pr(>|t|)
0.609729
0.000618 ***
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09515 on 3 degrees of freedom
Multiple R-squared: 0.9872,
Adjusted R-squared: 0.9829
F-statistic: 231.2 on 1 and 3 DF, p-value: 0.0006176
> confint( m17)
2.5 %
97.5 %
(Intercept) -0.4281749 0.2984698
V
0.9365513 1.4323495
(c) If
estimates 1/R, then 1/ estimates R. Estimate the resistance R. Similarly, if L and U
represent the lower and upper confidence limits for 1/R, then the corresponding limits for R
are given by 1/U and 1/L, as long as L and U are positive. Use this face and your answer to (b)
to find a 95% confidence interval for R.
A 95% confidence interval for R is
[
]
[
]
(d) Ohm’s law states that
in the model is 0. Calculate the test statistic for this hypothesis and
give an approximate P-value.
The test statistic for this hypothesis (see Code and Output for part (b)) is:
and the corresponding p-value from a two-sided test is approximately:
Exercise 10.19
(a) Plot the data. Are there any outliers or unusual points?
There don’t appear to be any outliers. There are a few different VO2 values corresponding to
the same HR value, which might be unusual.
Plot:
Page 11 of 32
0.5
1.0
V02
1.5
2.0
Exercise 10.19
100
110
120
130
HR
(b) Compute the least-squares regression line for predicting oxygen uptake from hear rate for this
individual.
The least-squares regression line is
(
)
Code and Output:
> m19= lm( V02 ~ HR)
> m19
Call:
lm(formula = V02 ~ HR)
Coefficients:
(Intercept)
-2.80435
HR
0.03865
(c) Test the null hypothesis that the slope of the regression line is 0. Explain in words the
meaning of your conclusions from this test.
Page 12 of 32
The p-value for a two-sided test of the slope being 0 is 1.00e-11.
The meaning is that there is a statistically significant linear relationship between the two
variables.
Code and Output:
> summary( m19)
Call:
lm(formula = V02 ~ HR)
Residuals:
Min
1Q
-0.355944 -0.035899
Median
0.003056
3Q
0.056513
Coefficients:
Estimate Std. Error t value
(Intercept) -2.80435
0.25833 -10.86
HR
0.03865
0.00240
16.10
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Max
0.220056
Pr(>|t|)
4.59e-09 ***
1.00e-11 ***
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1205 on 17 degrees of freedom
Multiple R-squared: 0.9385,
Adjusted R-squared: 0.9348
F-statistic: 259.3 on 1 and 17 DF, p-value: 1.000e-11
(d) Calculate a 95% confidence interval for the oxygen uptake of this individual on a future
occasion when his heart rate is 95. Repeat the calculation for heart rate 110.
95% confidence intervals for the oxygen uptake of this individual when his heart rate is 95 and
110 are
[
]
and
[
]
respectively.
Code and Output:
> predict( m19, newdata=data.frame( HR=c( 95, 110)),
interval="confidence", level=0.95)
fit
lwr
upr
1 0.8675958 0.7833762 0.9518154
2 1.4473774 1.3871265 1.5076283
(e) From what you have learned in (a), (b), (c), and (d) of this exercise, do you think that the
researchers should use predicted VO2 in place of measured VO2 for this individual under
similar experimental conditions? Explain your answer.
Yes, researchers may use predicted VO2 in place of measured VO2 for this individual.
I don’t know how to explain it using parts (a), (b), (c), and (d), but I would say, using the Adjusted
R-squared value of 0.9348 that a straight-line model does a good job of explaining the
relationship between VO2 and heart rate.
Exercise 10.20
Calculate the t statistic for testing
Specify an appropriate alternative hypothesis for this
Page 13 of 32
problem and give an approximate p-value for the test. Then explain your conclusion in words a
physician can understand.
The t statistic is
, an appropriate alternative hypothesis is
and an
approximate p-value is
The conclusion is that there is a statistically significant straight line relationship between the traditional
procedure and the new procedure; and every unit increase in the new method results in 0.83 units
increase in the old method?
Code and Output:
> 2* pt( 0.83/0.065, df=81-2, lower.tail=F)
[1] 6.71074e-21
Exercise 10.21
(a) It is reasonable to suppose that greater airflow will cause more evaporation. State
hypotheses to test this belief and calculate the test statistic. Find an approximate P-value for
the significance test and report your conclusion.
The hypotheses are
The test statistic is
and an approximate P-value is
The conclusion is that greater airflow will cause more evaporation.
(b) Construct a 95% confidence interval for the additional evaporation experience when airflow
increases by 1 unit.
A 95% confidence interval or the slope is
̂
(
)(
)
̂
[
]
Exercise 10.26
Return to the data on current versus voltage given in the Ohm’s law experiment in Exercise 10.17.
(a) Compute all values for the ANOVA table.
Code and Output:
> anova( m17)
Analysis of Variance Table
Response: I
Df
Sum Sq Mean Sq F value
Page 14 of 32
Pr(>F)
V
Residuals
1 2.09316 2.09316
3 0.02716 0.00905
231.21 0.0006176 ***
(b) State the null hypothesis tested by the ANOVA F statistic, and explain in plain language what
this hypothesis says.
The null hypothesis test by the ANOVA F statistic is (Moore and McCabe 2006, p 655):
In plain language, this hypothesis says that y is not linearly related to x.
(c) What is the distribution of this F statistic when
is true? Find an approximate P-value for
the test of
The distribution of this F statistic when
is true is an (
) distribution.
An approximate P-value is
.
Exercise 10.28
(a) The correlation between monthly income and birth weight was r=0.39. Calculate the t statistic
for testing the null hypothesis that the correlation is 0 in the entire population of infants.
The t statistic for testing the null hypothesis that the correlation is 0 is given by (Moore and
McCabe, p 664)
√
√
√
√
(b) The researchers expected that higher birth weights would be associated with higher incomes.
Express this expectation as an alternative hypothesis for the population correlation.
The alternative hypothesis expressing this expectation is:
(c) Determine a P-value for
versus the alternative that you specified in (b). What conclusion
does your test suggest?
The P-value is 0.006427687. This suggests that monthly income and birth weight are related—
specifically, it suggests that there is a positive correlation between the two.
Code and output:
> pt( 2.61086302, df=38, lower.tail=F)
[1] 0.006427687
Exercise 10.29
(a) The correlation between parental control and self-esteem was r = -0.19. Calculate the t
statistic for testing the null hypothesis that the population correlation is 0.
The t statistic is:
Page 15 of 32
√
√
(
)
(b) Find an approximate P-value for testing
versus the two-sided alternative and report your
conclusion.
An approximate p-value for testing the null hypothesis against the two-sided alternative is p =
3.202573e-07. The conclusion is that the correlation coefficient is significant and that since the
value of the correlation coefficient is only -0.19, we can conclude that there is no strong linear
relationship?
Exercise 10.34
(a) Plot the data and describe the pattern. Is it reasonable to summarize this kind of relationship
with a correlation?
The pattern looks somewhat linear. It seems reasonable to summarize this kind of relationship
with a correlation because the correlation between the variables is
40 50 60 70 80
Humerus
Exercise 10.34
40
45
50
55
60
65
70
75
40 50 60 70
Femur
Femur
40
50
60
70
80
Humerus
(b) Find the correlation and perform the significance test. Summarize the results and report your
conclusion.
The correlation is
, the t statistic is
⏞
, and the
corresponding p-value for testing the null hypothesis against the two-sided alternative is
⏞
Page 16 of 32
Code and Output:
> cor.test( Humerus, Femur)
Pearson's product-moment correlation
data: Humerus and Femur
t = 15.9405, df = 3, p-value = 0.0005368
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.910380 0.999633
sample estimates:
cor
0.9941486
Exercise 10.35
(a) Plot the data and describe the relationship between the two scores.
There appears to be a somewhat linear relationship between the two scores.
90
75
80
85
Round2
95
100
105
Exercise 10.35
80
85
90
95
100
105
Round1
(b) Find the correlation between the two scores and test the null hypothesis that the population
correlation is 0. Summarize your results.
The correlation is
, the t statistic is
⏞
, and the corresponding p-
value for testing the null hypothesis against the two-sided alternative is
summary, there is evidence of a linear relationship.
Code and Output:
Page 17 of 32
⏞
In
> cor.test( Round1, Round2)
Pearson's product-moment correlation
data: Round1 and Round2
t = 2.9903, df = 10, p-value = 0.01357
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1868394 0.9043687
sample estimates:
cor
0.6870684
(c) The plot shows one outlier. Recompute the correlation and redo the significance test without
this observation. Write a short summary explaining the effect of the outlier on the correlation
and significance test in (b).
The correlation becomes
, the t statistic
⏞
, and the
corresponding p-value for testing the null hypothesis against the two-sided alternative
⏞
In summary, the outlier in part (b) had the effect of reducing the correlation
and increasing the p-value.
Code and Output:
>
>
>
>
detach( data)
data=data[-8,]
attach( data)
cor.test( Round1, Round2)
Pearson's product-moment correlation
data: Round1 and Round2
t = 4.6806, df = 9, p-value = 0.001151
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4890076 0.9579716
sample estimates:
cor
0.841913
Exercise 10.36
(a) Find the equation of the least-squares line for predicting GHP from FVC.
The equation is
(
)
where the slope and intercept were found using the equations (Moore and McCabe, p 157)
Page 18 of 32
and
̅
̅
(b) Give the results of the significance test for the null hypothesis that the slope is 0. (Hint: What
is the relation between this test and the test for a zero correlation?)
Testing for the null hypothesis that the slope is 0 is the same as testing for the null hypothesis
that the correlation is zero? Recall that the t statistic for testing zero correlation is
√
√
√
√
and hence the p-value for a test of zero correlation against the two-sided alternative is
⏞
Exercise 10.39
(a) Plot the data with SAT on the x axis and ACT on the y axis. Describe the overall pattern and
any unusual observations.
The overall relationship looks linear. There’s a potential outlier (observation 42).
Plot:
Page 19 of 32
(b) Find the least-squares regression line and draw it on your plot. Give the results of the
significance test for the slope.
The least-squares line is
(
)
The significance test for the slope yields a p-value of
Thus we strongly reject
the null hypothesis that the slope is zero.
Code and Outputs:
> m39=lm( ACT ~ SAT)#-> a=3.47576, b=0.01894
> summary(m39)
Call:
lm(formula = ACT ~ SAT)
Residuals:
Min
1Q
-6.29776 -1.78047
Median
0.06872
3Q
Max
1.29578 10.39677
Page 20 of 32
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.626282
1.844230
0.882
0.382
SAT
0.021374
0.001983 10.778 1.80e-15 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.744 on 58 degrees of freedom
Multiple R-squared: 0.667,
Adjusted R-squared: 0.6612
F-statistic: 116.2 on 1 and 58 DF, p-value: 1.796e-15
(c) What is the correlation between the two tests?
The correlation is
.
Exercise 10.40
(a) What is the mean of these predicted values? Compare it with the mean of the ACT scores.
The mean of predicted values is 21.13333. The mean of the ACT scores is also 20.75833.
Code and Outputs:
> ACT.predicted=predict( m39, newdata=data.frame( HR=SAT))
> mean( ACT.predicted)
[1] 20.75833
(b) Compare the standard deviation of the predicted values with the standard deviation of the
actual ACT scores. If least-squares regression is used to predict ACT scores for a large number
of students such as these, the average predicted value will be accurate but the variability of
the predicted scores will be too small.
The standard deviation of predicted values is 3.410658, while the standard deviation of actual
ACT scores is 5.263797.
Code and Output:
> s.ACT.predicted=sd( ACT.predicted)#== 3.410658
> s.ACT=sd( ACT)#== 5.263797
(c) Find the SAT score for a student who is one standard deviation above the mean (
(
̅)
). Find the predicted ACT score and standardize this score. (Use the means and
standard deviations from this set of data for these calculations.)
Student #6 scored a 1440 on the SAT, which is 2.92781269 standard deviations above the mean.
The predicted ACT score for this student is 32.40439, which when standardized also becomes
2.927812—in other words it is 2.927812 standard deviations above the mean predicted ACT
score.
Code and Output:
> predict( m39, newdata=data.frame( SAT=c(1440)))#== 32.40439
> (32.40439-mean( ACT.predicted))/s.ACT.predicted#== 2.927812
(d) Repeat part (c) for a student whose SAT score is one standard deviation below the mean (z = 1).
Page 21 of 32
Student #7 scored a 490 on the SAT, which is 2.34669184 standard deviations below the mean.
The predicted ACT score for this student is 12.09939, which is also 2.34669184 standard
deviations below the mean predicted ACT score.
(e) What do you conclude from parts (c) and (d)? Perform additional calculation for different z’s
if needed.
We conclude that when using this least-squares line to predict values, the prediction will be the
same number of standard deviations above/below the mean of predicted values as the
explanatory variable is above/below the mean of explanatory variables.
Exercise 10.41
(a) Using the data in Table 10.4, find the values of
Using the formula
We get the values
However, if we use the formula
and
.
and
, which seems wrong.
we get the values
and
(b) Plot the data with the least-squares line and the new prediction line.
20
least-squares line
new prediction line
10
ACT
30
Exercise 10.41: Using a.1= s.x/s.y formula
400
600
800
1000
1200
1400
SAT
20
least-squares line
new prediction line
10
ACT
30
Exercise 10.41: Using a.1= s.y/s.x formula
400
600
800
1000
1200
1400
SAT
(c) Use the new line to find predicted ACT scores. Find the mean and the standard deviation of
these scores. How do they compare with the mean and standard deviation of the ACT scores?
Page 22 of 32
Using the formula
instead of the formula
to determine the slope of the least-squares equation yields new predicted ACT scores with a
mean of 21.13333 and a standard deviation of 4.713725. Compared with the mean and
standard deviation of the ACT scores, they are the same.
Cod and Outpue:
> ACT.predicted.new <- (function(SATscore){ -2.752178 +
0.02617112*SATscore})( SAT)
> mean( ACT.predicted.new)#== 21.13333
[1] 21.13333
> mean( ACT)#== 21.13333
[1] 21.13333
> sd( ACT.predicted.new)#== 4.713725
[1] 4.713725
> sd( ACT)#== 4.713726
[1] 4.713726
Page 23 of 32
Exercise 3.4.1
(a) Based on the output for model (3.7) a business analyst concluded:
[…]
Provide a detailed critique of this conclusion.
The discernable pattern indicates that an improper model has been fit. Also, the outlier
(observation 13) in the plot of studentized residuals versus Distance warrants some concern.
(b) Does the ordinary straight line regression model (3.7) seem to fit the data well? If not,
carefully describe how the model can be improved.
Page 24 of 32
The ordinary straight line regression does sort of fit the data well, but the model can be
improved by removing the outlier. Even after the model is refit with this outlier removed, there
is still a discernable pattern in the plot of studentized residuals versus Distance:
0
-1
-2
Standardized Residuals
1
Exercise 3.4.1:
Standardized Residuals versus Distance
500
1000
1500
Distance
The pattern indicated that a quadratic term should be added. Addition of a quadratic term
yields:
Page 25 of 32
0
-1
-3
-2
Standardized Residuals
1
Exercise 3.4.1:
Standardized Residuals versus Distance
500
1000
1500
Distance
The problem of non-random residuals appears to have been fixed, however the improvement in
adjusted R-squared when going from the
Fare ~ Distance
model to the
Fare ~ Distance + DistanceSquared
model is small—the change is only from 99.63% to 99.87%. It might be possible to improve the
model further by looking for more outliers and then refitting the model with them removed.
Page 26 of 32
Exercise 3.4.2
Is the following statement true or false? If you believe that the statement is false, provide a brief
explanation.
Suppose that a straight line regression model has been fit to bivariate data set of the form
(
)(
)
(
) Furthermore, suppose that the distribution of X appears to be normal
while the Y variable is highly skewed. A plot of standardized residuals from the least squares regression
line produce a quadratic pattern with increasing variance when plotted against (
). In this
case, one should consider adding a quadratic term in X to the regression model and thus consider a
model of the form
.
Response:
I agree. Regarding the plot of standardized residuals, Sheather writes that if a plot of residuals against
X that produces a discernible pattern, then the shape of the pattern provides information on the
function of x that is missing from the model (p 49). He goes on to write that if the residuals from the
straight-line fit of Y and X have a quadratic pattern, then we can conclude that there is need for a
quadratic term to be added to the original straight-line regression model (p 50).
Regarding the issue of increasing variance, Sheather writes that there are two methods to deal with
it—transformations and weighted least squares.
Exercise 3.4.3
Part A
(a) Develop a simple linear regression model based on least squares that predicts advertising
revenue per page from circulation (i.e. feel free to transform either the predictor or the
response variable or both variables). Ensure that you provide justification for you choice of
model.
A simple linear regression model is
(
) (
)
(
)
Justification:
The plot of AdRevenue ~ Circulation has x values that are spread too far—a log transformation
on the Xs will help “bring” them closer together. The plot of AdRevenue ~ log( Circulation) has y
values that are too care apart—a log transformation on the Ys will, again, help bring them closer
together. The final model
(
)
(
)
looks good visiually. A plot of all three models is below:
Page 27 of 32
600
200
400
AdRevenue
600
400
200
AdRevenue
800
Exercise 3.4.3:
AdRevenue ~ log( Circulation)
800
Exercise 3.4.3:
AdRevenue ~ Circulation
0
5
10
15
20
25
30
Circulation
-1
0
1
2
log(Circulation)
6.0
5.5
5.0
4.5
log(AdRevenue)
6.5
Exercise 3.4.3:
log( AdRevenue) ~ log( Circulation)
-1
0
1
2
3
log(Circulation)
(b) Find a 95% prediction interval for the advertising revenue per page for magazines with the
following circulations:
1) 0.5 million
A 95% prediction interval for advertising revenue is [
].
2) 20 million
].
A 95 prediction interval for advertising revenue is [
Code and Output:
> logCirculation= log( Circulation)
> logAdRevenue= log( AdRevenue)
> m343= lm( logAdRevenue ~ logCirculation)
Page 28 of 32
3
> logAdRevenue.predicted= predict( m343,
newdata=data.frame( logCirculation= log( c( 0.5, 20))),
interval="prediction", level=.95)
> AdRevenue.predicted= exp( logAdRevenue.predicted)
> AdRevenue.predicted
fit
1
74.30864
lwr
upr
51.82406 106.5485
2 522.56626 359.89585 758.7626
(c) Describe any weaknesses in your model.
Interpretation of the least squares coefficients becomes difficult with the log transformation
applied to both explanatory and response variables.
Part B
(a) Develop a polynomial regression model based on least-squares that directly predicts the
effect on advertising revenue per page of an increase in circulation of 1 million people (i.e. do
not transform either the predictor nor the response variable). Ensure that you provide
detailed justification for your choice of model. [Hint: Consider polynomial model of order up
to 3.]
A polynomial regression model based on least-squares that directly predicts the effect on
advertising revenue per page of an increase in circulation of 1 million people is:
(
(
)
(
)
)
Detailed Justification:
This is how we arrived at the above model:
1) we first fit three models:
i.
ii.
iii.
2) we identified the leverage points
3) For each model, we identified which of the leverage points were bad (using the rule that
identifies points outside the interval -2 to 2 as bad)—the bad leverage points were 2, 20,
and 49 for the first model; 2, 4, 20, 49 for the second; and 2, 8, 20, 49 for the third.
4) we removed the bad leverage points for each model and then refit each model to the
new data set (the data set with the bad leverage points removed)
5) Finally, we compared the R squared vales for all three models.
Model 3 resulted in the highest adjusted R- squared value, so that is why it was chosen.
Code and Outputs:
#IDENTIFY LEVERAGE POINTS:
leverage.vals= lm.influence(m343b.2)$hat
Page 29 of 32
leverage.vals[ leverage.vals > 4/ length( Circulation)]# these points are
levereage points: 2, 4, 6, 8, 20, 46, 49.
#CALCULATE STANDARDIZED RESIDUALS FOR ABOVE LEVERAGE PTS AND DETERMINE
WHICH ONES ARE OUTLIERS/BAD:
rstandard.vals1= rstandard( m343b.1)[ c( 2, 4, 6, 8, 20, 46, 49)]
rstandard.vals1[ rstandard.vals1 > 2 | rstandard.vals < -2]# these are
bad leverage points: 2, 20, 49
rstandard.vals2= rstandard( m343b.2)[ c( 2, 4, 6, 8, 20, 46, 49)]
rstandard.vals2[ rstandard.vals2 > 2 | rstandard.vals < -2]# these are
bad leverage points: 2, 4, 20, 49
rstandard.vals3= rstandard( m343b.3)[ c( 2, 4, 6, 8, 20, 46, 49)]
rstandard.vals3[ rstandard.vals3 > 2 | rstandard.vals < -2]# these are
bad leverage points: 2, 8, 20, 49
#REMOVE BAD LEVERAGE POINTS (i.e., OUTLIERS) AND REFIT EACH MODEL:
ad.new1= ad[ -c(2, 20, 49),]
detach( ad)
attach( ad.new1)
m343b.1= lm( AdRevenue ~ Circulation)
ad.new2= ad[ -c(2, 4, 20, 49),]
detach( ad.new1)
attach( ad.new2)
m343b.2= lm( AdRevenue ~ Circulation + CirculationSquared)
ad.new3= ad[ -c(2, 8, 20, 49),]
detach( ad.new2)
attach( ad.new3)
m343b.3= lm( AdRevenue ~ Circulation + CirculationSquared +
CirculationCubed)
#COMPARE THE THREE MODELS AGAIN WHEN OUTLIER FOR EACH MODEL ARE REMOVED:
> summary( m343b.1)
Call:
lm(formula = AdRevenue ~ Circulation)
Residuals:
Min
1Q
-51.266 -22.737
Median
-7.741
3Q
Max
12.815 129.115
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 95.8273
4.8479
19.77
<2e-16 ***
Circulation 24.7248
0.9011
27.44
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 34.56 on 65 degrees of freedom
Multiple R-squared: 0.9205,
Adjusted R-squared: 0.9193
F-statistic: 752.8 on 1 and 65 DF, p-value: < 2.2e-16
Page 30 of 32
> summary( m343b.2)
Call:
lm(formula = AdRevenue ~ Circulation + CirculationSquared)
Residuals:
Min
1Q
-74.773 -13.330
Median
-2.778
3Q
Max
11.169 107.490
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
66.2968
7.1849
9.227 2.62e-13 ***
Circulation
43.2744
3.7622 11.502 < 2e-16 ***
CirculationSquared -0.8077
0.1681 -4.806 9.92e-06 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 29.63 on 63 degrees of freedom
Multiple R-squared: 0.8789,
Adjusted R-squared: 0.875
F-statistic: 228.6 on 2 and 63 DF, p-value: < 2.2e-16
> summary( m343b.3)
Call:
lm(formula = AdRevenue ~ Circulation + CirculationSquared +
CirculationCubed)
Residuals:
Min
1Q
-81.992 -12.923
Median
-1.105
3Q
11.428
Max
99.918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
47.95376
9.77958
4.903 7.12e-06 ***
Circulation
62.65352
8.13960
7.697 1.33e-10 ***
CirculationSquared -4.67795
1.44208 -3.244 0.00190 **
CirculationCubed
0.10817
0.03726
2.903 0.00511 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.36 on 62 degrees of freedom
Multiple R-squared: 0.933,
Adjusted R-squared: 0.9298
F-statistic: 287.8 on 3 and 62 DF, p-value: < 2.2e-16
(b) Find a 95% prediction interval for the advertising revenue per page for magazines with the
following circulations:
(i) 0.5 million
A 95% prediction interval for the first model is [
].
A 95% prediction interval for the second model is [
].
A 95% prediction interval for the third model is [
].
(ii) 20 million
Page 31 of 32
A 95% prediction interval for the first model is [
].
A 95% prediction interval for the second model is [
].
A 95% prediction interval for the third model is [
].
(c) Describe any weaknesses in your model.
The weakness in our model (i.e. model 3) is how much greater the lengths of its predication
intervals are compared to the other two models.
Part C
(a) Compare the model in Part A with that in Part B. Decide which provides a better model. Give
reasons to justify your choice.
Not sure. Part A? Because the models in Part B give such different prediction intervals?
(b) Compare the prediction intervals in Part A with those in Part B. In each case, decide which
interval you would recommend. Give reasons to justify each choice.
Not sure—the intervals are all so different and it isn’t clear why any particular one would be
better than the others.
Page 32 of 32