* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Homework 3
Data assimilation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Bias of an estimator wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Choice modelling wikipedia , lookup
Regression analysis wikipedia , lookup
Stat 103 Homework Three Solutions Spring 2014 Instructions: The tentative due date is next Wednesday, 2-26-2014. 1. When asked the model used in simple linear regression, a student responds: EYi 0 1 Xi i . What was the student's mistake? The student has confused the mean value of Yi (which lies on the line Y 0 1 X ) with the value of one particular Yi , which includes a term for random error. 2. In a study of the relationship between physical activity and the frequency of cold for senior citizens, participants were asked to record their time spent exercising over a 5-year period. The study demonstrated a negative linear relationship between time spent exercising and the frequency of colds. The researchers claimed that increasing the time spent in exercise is an effective strategy for reducing the frequency of colds for senior citizens. What mistake has the researcher committed? The study was observational, and it is impossible to establish cause and effect in regression from observational studies (which is why controlled experiments are conducted whenever feasible). There are lots of confounding variables which may have influenced the results. For example, senior citizens who have health issues may be unable to exercise regularly. 3. When asked the importance to simple linear regression model of assuming that the distribution of the error variable is normal, a student responds "for the least squares to be valid, the distribution of must be normal." What was the student's mistake? The assumption of normality is not a requirement for fitting the regression line to the observations. Normality only becomes a factor when we wish to conduct inference on the parameters, construct confidence intervals for predictions, etc. 4. A special case of the simple linear regression model is the model given by Yi 1 X i i , which corresponds to fitting a regression line through the origin. (We may have reasons for believing this is appropriate.) Using calculus, find the least squares estimator of 1 . (Hint: you don't need to take partial derivatives here because there is only one independent variable.) The sum of squares to be 2 2 minimized is Y Yˆ Y b X , where we've replaced by its estimator b . The i i i 1 1 i ordinary derivative to be taken is with respect to b1 , 1 d Yi b1 X i 2 2 X i Yi b1 X i . db1 Setting the derivative equal to 0, we get 2 X i Yi b1 X i 0 . Solving for b1 , b1 X iYi . X i2 5. A special case of the simple linear regression model might be given by Yi 0 i , which corresponds to fitting a regression line with zero slope. (We would never do such a thing in practice of course.) Using calculus, and the hint in problem 4, find the least squares estimator of 0 . Comment on the result. The sum of squares to be minimized is Yi Yˆi Yi b0 2 2 , where we've replaced 0 by its estimator b0 . The ordinary derivative to be taken is with respect to b0 , d Yi b0 2 2 Yi b0 . Setting the derivative equal to 0, we get 2Yi b0 0 . db0 Solving for b0 , b0 Yi Y . This should seem obvious in hindsight. If we have no information on n the value of X, or the variables are not linearly related, then our best estimate for the mean value of the variable Y is simply the mean of the sample, Y . 6. Prove that b0 , the least squares estimator of 0 in the simple linear regression model Y 0 1 X , is unbiased. From the equation b0 Y b1 X , E b0 E Y b1 X E Y XE b1 E Y X 1 . Now, E Y E Yi 0 1 X i n0 X i 1 0 X 1 (see discussion in problem 11). n n n n Finally, we have E b0 E Y X 1 0 X 1 X 1 0 as we were asked to prove. 7. In fitting the simple linear regression model, it was found that the ith observation fell exactly on the line. Would removing the observation from the data affect the least squares line fitted to the remaining n - 1 observations? Explain. No. I think it may be easier to imagine adding a new observation to data for which a regression line has been calculated. If the new point lies on the existing line, then its addition will not change the line. This is essentially just our problem reversed (since "removing" the point again obviously doesn't affect the line). * A formal proof appears at the end of this document * 8. A simple linear regression model for sales versus advertising expenditures for a product returns the following output Estimated regression equation: Sˆ 350 0.21A Two-sided P-value for estimated slope = 0.91 What is wrong with the conclusion, "The more spent on advertising, the fewer units are sold." The Pvalue for the test of the slope tells us that the linear association between sales and advertising expenditures is weak or nonexistent (since the slope 1 is plausibly zero). So we should not be interpreting the value returned by our formula (or software) for the slope. 9. A value of the coefficient of determination, R2, near 1 is sometimes interpreted to imply that the linear regression line can be used to make precise predictions of Y given knowledge of X. Is this interpretation a necessary consequence of the definition of R2? Explain. No. R2 measures the relative reduction in the variation of Y when the variation in X is considered, versus the variation in Y without considering X. If the data contains an outlier for Y (which might be nothing more than a typo), the regression line may be able to "fit" the outlier much better than is otherwise possible. Thus the "relative" improvement is substantial, yet X may actually have at best a weak linear association with Y 10. Suppose that the assumption of the simple linear regression model that ~ N 0,2 is violated in the following way: the variance is greater for larger X. a. Does 1 0 still imply that there is no linear association between X and Y? Explain. Yes. If the slope parameter for the simple linear regression model, 1 , equals zero, then there is no linear association between the variables. (That is why we conduct a hypothesis test of the slope.) b. Does 1 0 still imply that there is no association between X and Y? Explain. No! Good explanation: There may be another form to the relationship, e.g., quadratic. Better explanation: There absolutely must be an association between the variables, even if E Y | X E Y for all values of X, because the value of X affects the distribution of Y, since the variance of Y is a function of X. 11. Five observations on Y were obtained corresponding to X = 1, 4, 10, 11, and 14. Assuming that 0.6 , 0 5 , and 1 3 , what are the expected values of MSE and MSR? E MSE 0.62 0.36 E MSR requires more work. Because SSR = Ŷi Y 2 , we have to find the expected values of each of the fitted values Ŷi and of Y . We've proved in class that E Yˆi 0 1 X i , so we use the true regression line Ŷ 5 3X , for X = 1, 4, 10, 11, and 14, to generate the five fitted values Ŷ = 8, 17, 35, 38, and 47. Next, the expected value of the mean, E Y , should be the mean of the expected values, so E Y E MSR E Y1 E Y2 E Y3 E Y4 E Y5 29 . These calculations lead to 5 E SSR 1026 . 1 *** Proof for Problem 7 *** Suppose b0 and b1 solve the "normal" equations (see the linear regression notes) nb0 b1 X i Yi (1.1) b0 X i b1 X i2 X iYi Next, suppose (without loss of generality) that the nth observation lies on the regression line. Then the normal equations can be rewritten as n 1 n 1 i 1 i 1 n 1 n 1 b0 b1 X i b0 b1 X n Yi b0 b1 X n (1.2) n 1 n 1 i 1 i 1 b0 X i b1 X i2 b0 X n b1 X n2 X iYi X n b0 b1 X n i 1 where I've used the fact that the n observation lies on the line defined by b0 and b1 , so Yn b0 b1 X n . After cancelling common terms, we see that b0 and b1 also solve the new "normal" equations for the remaining n - 1 observations, th n 1 n 1 n 1 i 1 n 1 i 1 n 1 i 1 i 1 i 1 n 1 b0 b1 X i Yi (1.3) b0 X i b1 X i2 X iYi This shows that omitting an observation that lies on the line has no effect on the value of the estimated regression coefficients b0 and b1 , and hence, has no effect upon the regression line. {Note that if E Y | X E Y for all values of X, then E Y 2 | X f X for some increasing function f (because the variance of Y increases as X increases.) It may even turn out that it is the variable Y 2 that is linearly related to X., which could be explored through a transformation of the variable Y.}