Download Homework 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Bias of an estimator wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 103
Homework Three Solutions
Spring 2014
Instructions: The tentative due date is next Wednesday, 2-26-2014.
1. When asked the model used in simple linear regression, a student responds: EYi   0  1 Xi  i .
What was the student's mistake? The student has confused the mean value of Yi (which lies on the line
Y  0  1 X ) with the value of one particular Yi , which includes a term for random error.
2. In a study of the relationship between physical activity and the frequency of cold for senior citizens,
participants were asked to record their time spent exercising over a 5-year period. The study
demonstrated a negative linear relationship between time spent exercising and the frequency of colds.
The researchers claimed that increasing the time spent in exercise is an effective strategy for reducing
the frequency of colds for senior citizens. What mistake has the researcher committed? The study was
observational, and it is impossible to establish cause and effect in regression from observational
studies (which is why controlled experiments are conducted whenever feasible). There are lots of
confounding variables which may have influenced the results. For example, senior citizens who have
health issues may be unable to exercise regularly.
3. When asked the importance to simple linear regression model of assuming that the distribution of the
error variable is normal, a student responds "for the least squares to be valid, the distribution of  must
be normal." What was the student's mistake? The assumption of normality is not a requirement for
fitting the regression line to the observations. Normality only becomes a factor when we wish to
conduct inference on the parameters, construct confidence intervals for predictions, etc.
4. A special case of the simple linear regression model is the model given by Yi  1 X i  i , which
corresponds to fitting a regression line through the origin. (We may have reasons for believing this is
appropriate.) Using calculus, find the least squares estimator of 1 . (Hint: you don't need to take
partial derivatives here because there is only one independent variable.) The sum of squares to be
2
2
minimized is  Y  Yˆ   Y  b X  , where we've replaced  by its estimator b . The

i
i

i
1
1 i
ordinary derivative to be taken is with respect to b1 ,
1
d
Yi  b1 X i 2  2 X i Yi  b1 X i  .

db1
Setting the derivative equal to 0, we get 2 X i Yi  b1 X i   0 . Solving for b1 , b1 
 X iYi .
 X i2
5. A special case of the simple linear regression model might be given by Yi  0  i , which
corresponds to fitting a regression line with zero slope. (We would never do such a thing in practice of
course.) Using calculus, and the hint in problem 4, find the least squares estimator of  0 . Comment on
the result. The sum of squares to be minimized is
 Yi  Yˆi    Yi  b0 
2
2
, where we've replaced
 0 by its estimator b0 . The ordinary derivative to be taken is with respect to b0 ,
d
Yi  b0 2  2 Yi  b0  . Setting the derivative equal to 0, we get 2Yi  b0   0 .

db0
Solving for b0 , b0 
Yi  Y . This should seem obvious in hindsight. If we have no information on
n
the value of X, or the variables are not linearly related, then our best estimate for the mean value of the
variable Y is simply the mean of the sample, Y .
6. Prove that b0 , the least squares estimator of  0 in the simple linear regression model Y  0  1 X   ,
is unbiased. From the equation b0  Y  b1 X , E b0   E Y  b1 X   E Y   XE b1  E Y   X 1 .
Now, E Y  
 E Yi    0  1 X i   n0   X i 
1  0  X 1 (see discussion in problem 11).
n
n
n
n
Finally, we have E b0   E Y   X 1  0  X 1   X 1  0 as we were asked to prove.
7. In fitting the simple linear regression model, it was found that the ith observation fell exactly on the
line. Would removing the observation from the data affect the least squares line fitted to the remaining
n - 1 observations? Explain. No. I think it may be easier to imagine adding a new observation to data
for which a regression line has been calculated. If the new point lies on the existing line, then its
addition will not change the line. This is essentially just our problem reversed (since "removing" the
point again obviously doesn't affect the line). * A formal proof appears at the end of this document *
8. A simple linear regression model for sales versus advertising expenditures for a product returns the
following output
 Estimated regression equation: Sˆ  350  0.21A
 Two-sided P-value for estimated slope = 0.91
What is wrong with the conclusion, "The more spent on advertising, the fewer units are sold." The Pvalue for the test of the slope tells us that the linear association between sales and advertising
expenditures is weak or nonexistent (since the slope 1 is plausibly zero). So we should not be
interpreting the value returned by our formula (or software) for the slope.
9. A value of the coefficient of determination, R2, near 1 is sometimes interpreted to imply that the linear
regression line can be used to make precise predictions of Y given knowledge of X. Is this
interpretation a necessary consequence of the definition of R2? Explain. No. R2 measures the relative
reduction in the variation of Y when the variation in X is considered, versus the variation in Y without
considering X. If the data contains an outlier for Y (which might be nothing more than a typo), the
regression line may be able to "fit" the outlier much better than is otherwise possible. Thus the
"relative" improvement is substantial, yet X may actually have at best a weak linear association with Y


10. Suppose that the assumption of the simple linear regression model that  ~ N 0,2 is violated in the
following way: the variance is greater for larger X.
a. Does 1  0 still imply that there is no linear association between X and Y? Explain. Yes.
If the slope parameter for the simple linear regression model, 1 , equals zero, then there is no
linear association between the variables. (That is why we conduct a hypothesis test of the slope.)
b. Does 1  0 still imply that there is no association between X and Y? Explain. No! Good
explanation: There may be another form to the relationship, e.g., quadratic. Better explanation:
There absolutely must be an association between the variables, even if E Y | X   E Y  for all
values of X, because the value of X affects the distribution of Y, since the variance of Y is a
function of X.
11. Five observations on Y were obtained corresponding to X = 1, 4, 10, 11, and 14. Assuming that
  0.6 , 0  5 , and 1  3 , what are the expected values of MSE and MSR? E MSE   0.62  0.36
E MSR requires more work. Because SSR =
 Ŷi  Y 
2
, we have to find the expected values of
 
each of the fitted values Ŷi and of Y . We've proved in class that E Yˆi  0  1 X i , so we use the true
regression line Ŷ  5  3X , for X = 1, 4, 10, 11, and 14, to generate the five fitted values Ŷ = 8, 17,
35, 38, and 47. Next, the expected value of the mean, E Y  , should be the mean of the expected
values, so E Y  
E MSR 
E Y1  E Y2   E Y3  E Y4   E Y5
 29 . These calculations lead to
5
E SSR
 1026 .
1
*** Proof for Problem 7 ***
Suppose b0 and b1 solve the "normal" equations (see the linear regression notes)
nb0  b1  X i  Yi
(1.1)
b0  X i  b1  X i2   X iYi
Next, suppose (without loss of generality) that the nth observation lies on the regression line. Then the
normal equations can be rewritten as
n 1
n 1
i 1
i 1
n 1
 n  1 b0  b1  X i   b0  b1 X n    Yi   b0  b1 X n 
(1.2)
n 1
n 1
i 1
i 1


b0  X i  b1  X i2  b0 X n  b1 X n2   X iYi  X n  b0  b1 X n 
i 1
where I've used the fact that the n observation lies on the line defined by b0 and b1 , so Yn  b0  b1 X n .
After cancelling common terms, we see that b0 and b1 also solve the new "normal" equations for the
remaining n - 1 observations,
th
n 1
n 1
n 1
i 1
n 1
i 1
n 1
i 1
i 1
i 1
 n  1 b0  b1  X i   Yi
(1.3)
b0  X i  b1  X i2   X iYi
This shows that omitting an observation that lies on the line has no effect on the value of the estimated
regression coefficients b0 and b1 , and hence, has no effect upon the regression line.


{Note that if E Y | X   E Y  for all values of X, then E Y 2 | X  f  X  for some increasing function f
(because the variance of Y increases as X increases.) It may even turn out that it is the variable Y 2 that is
linearly related to X., which could be explored through a transformation of the variable Y.}