* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Final06Sol
Regression toward the mean wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Choice modelling wikipedia , lookup
Time series wikipedia , lookup
Data assimilation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Linear regression wikipedia , lookup
Stat 511 Fall 2006 Final Exam Statistics 511 Final Dec. 18, 2006 The following rules apply. 1. You may use 3 sheets of paper for any information you need - double-sided, any font. 2. You may use a calculator. 3. You may not collaborate or copy. 4. You may not use outside resources, such as the internet. As well, you may not store notes or formulas on your calculator. 5. Failure to comply with items 3 and 4 could lead to reduction in your grade, or disciplinary action. I have read the rules above and agree to comply with them. Signature ________________________________________________ Name (printed) ___________________________________________ 1 Stat 511 Fall 2006 Final Exam 1. A problem of continuing interest concerns the effect of air pollution on human health. For an early study, 16 variables were collected in 60 large metropolitan areas of the US. (Source: McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality', Technometrics, vol.15, 463-482.) We are particularly interested in whether mortality is related to the pollution variables HC, NOX, and SO2, after adjusting for the other variables. The variables are 1. PREC Average annual precipitation in inches 2. JANT Average January temperature in degrees F 3. JULT Same for July 4. OVR65 % of 1960 SMSA population aged 65 or older 5. POPN Average household size 6. EDUC Median school years completed by those over 22 7. HOUS % of housing units which are sound & with all facilities 8. DENS Population per sq. mile in urbanized areas, 1960 9. NONW % non-white population in urbanized areas, 1960 10. WWDRK % employed in white collar occupations 11. POOR % of families with income < $3000 12. HC Relative hydrocarbon pollution potential 13. NOX Same for nitric oxides 14. SO2 Same for sulphur dioxide 15. HUMID Annual average % relative humidity at 1pm 16. MORT Total age-adjusted mortality rate per 100,000 (The dependent variable) Computer output for this problem can be found on pages 2- 9 of the Computing Handout. a) Consider the full model. Is there evidence that HC, NOX and S02 are significant predictors of MORT when the other variables are in the model? Support your answer with a statistical test. H0: 14 = 15 = 16 = 0 HA: at least on of 14, 15 or 16 is nonzero when the other variables are in the model Formula for test statistic F*= SSR(HC, NOX,SO2 | PREC, JANT, JULT, OVR65, POPN, EDUC, HOUS, DENS, NONW, WWDRK, POOR, HUMID)/3 MSE d.f. 3, 44 Test: (417.06688+9159.28597+31.63982)/3 = 2.625134 1220.00049 P-value: 0.0622 Conclusion (in words): There is weak evidence (p≈0.06) that at least one of HC, NOX or SO2 is a predictor of mortality. (Fail to reject the null hypothesis is also a reasonable answer.) 2 Stat 511 Fall 2006 Final Exam b) What assumptions must be satisfied for the test in part a) to be valid? The data must be independently normally distributed with mean that depends linearly on the variables and errors that have mean zero and constant variance. c) Is there evidence of multicollinearity in the model? Briefly support your answer. HC and NOX appear to be multicollinear with other variables in the model, as they have high VIF. (Since none of the other variables have high VIF, it is likely that HC and NOX are highly correlated with each other.) 3 Stat 511 Fall 2006 Final Exam d) The investigators felt that a model with fewer variables would be more readily interpretable. Using the output given on p. 4 of the Computing Handout, about how many variables are needed to attain predictive power similar to the predictive power of the full model? Briefly justify your answer. SBC appears to be minimized at p=5, which is 4 variables. R2 continues to increase for 5 and 6 variables, before tapering off. So, about 4-6 variables appear to be needed to obtain predictive power similar to the full model. e) Based on the all subsets regression, the investigator decided to use the model that included predictors: PREC. JANT, JULT, NONW and S02. Based on this model, is there evidence that any of the pollution variables are significant predictors of MORT? H0: 5=0 HA:5 ≠0 Formula for test statistic : t* = b5/s(b5) d.f. Test: or F*= SSI/MSE=SSII/MSE (t-test, 54) (F-test, 1, 54) 4.05 P-value: 0.0002 Conclusion (in words): SO2 is a highly significant predictor of mortality. 4 Stat 511 Fall 2006 Final Exam f) The investigators noted that none of the candidate models included HC or NOX and concluded that these 2 variables are not important causal factors in MORTality. Is there statistical support for this conclusion? There is no evidence of linear association between MORTality and HC or NOX, when all other variables are in the model. However, we cannot readily infer causality (or lack of causality) from an observational study. (I will accept any reasonable answer that includes the idea that there can at best be weak evidence of causality from this type of study. We also cannot conclude a causal link with SO2 – onlly a strong association – e.g. it may be that SO2 is strongly correlated with some other factor that this the cause, such as particulates.) 5 Stat 511 Fall 2006 Final Exam 2. On Jan. 20, 1986, the space shuttle Challenger exploded less than 2 minutes into its flight, killing all on board. The forecast temperature was 31ºF, more than 20 degrees colder than any of the previous 24 flights. The night before the disaster, a team of engineers had met to determine whether the cold weather indicated that the flight should be cancelled. Among other data they investigated was the failure of a critical connector called an O-ring. O-rings had failed in 7 of 23 previous flights. (The O-rings were lost at sea for 1 flight.) We will look at the question of whether O-ring failure could be associated with temperature. The dependent variable is "Damage". Damage=1 if any of the 6 O-rings failed during the flight. Damage=0 if none of the O-rings failed during the flight. Some computer output associated with this problem is given on p. 10 – 13 of the Computing Handout. a) Consider the loess curve on p. 10. Based on this curve, what is the estimated probability of at least one failure at 70ºF and at 31ºF? (You will have to extrapolate – but these were the data available at the time.) 70ºF : 21% and at 31ºF: 100% b) Based on the fitted logistic regression, what is the estimated probability of at least one failure at 70ºF and at 31ºF. p= exp(b0 + b1 Temp)/(1+ exp(b0 + b1 Temp)) 70ºF p= 22.95% and at 31ºF. p=99.96% 6 Stat 511 Fall 2006 Final Exam c) Is there any evidence of lack of fit of the logistic regression model? Briefly justify your answer. The Hosmer and Lemeshow test does not give any evidence of lack of fit. The loess fit to the residuals is very flat, indicating no lack of fit. d) The engineers who gave the "OK" for the flight stated that there was no relationship between temperature and the probability of O-ring failure. Do a formal test of this hypothesis. Do you agree with their conclusion? H0: 1=0 Test: LRT = 7.952 (df = 1) P-value: (LRT 0.0048) HA:1 ≠0 Wald test = 4.6008 (Wald 0.0320) Conclusion (in words). This is a significant association between temperature and the probability of at least one o-ring failure. The probability increases as the temperature decreases. Note: The engineers failed to discover the relationship because they considered only the 7 flights on which o-rings failed, rather than considering all 23 flights for which data were available. As far as I know, there are still no statisticians on the panel that makes these decisions. 7 Stat 511 Fall 2006 Final Exam 3. In State College, many homes use electric heat instead of a furnace. Of course, these homes also use electricity to run appliances and lights and for cooling in the summer. In a study of electricity use, homeowners recorded their daily electricity usage (in kilowatt hours) for 55 consecutive months. The average daily usage was recorded for each month. The "average" temperature for each month was computed by averaging the daily maximum and minimum temperatures. A polynomial regression model was fitted to the data. Some computer output for the problem is given on p. 14 - 15. a) Using unpooled sequential tests, find an appropriate degree for the polynomial. The simplest way to do this is to note that the test statistic at each step is F*=SSI/MSE and we accept H0: =0 if F* < F.95,1,50= 4.03. F*<4.03 implies that SSI<4.03*MSE= 402.61 So, we fail to reject the 4th and 3rd powers, but we reject 2=0. We conclude that a quadratic polynomial provides an appropriate fit to the data. 8 Stat 511 Fall 2006 Final Exam b) Actually, the homeowners travelled several times during the period of data collection. The investigators decided to use data only from days when they were at home. The monthly usage variable is the mean only of the ni days in month i for which the homeowners were actually at home. Suppose that the daily usage values are independent with constant variance on those days. What is the variance of the average usage in month i? Call the daily variance 2. Then the variance of the average usage is 2/ni. c) Suppose that the investigators wanted to take into account the number of days in the month for which the homeowners were at home by using weighted least squares. What weights would be used? The weights should be proportional to 1/variance = ni/2. Since 2 is not known, the weights should be ni. d) Suppose the MSE from the weighted least squares regression in part c is 80. What is the estimated variance of the daily electricity usage? Using weighted least squares with weights proportional to kV-1 the MSE estimates k. In this case, k will be 2. So the estimate of the daily variance is 80. 9