Download Final06Sol

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 511
Fall 2006
Final Exam
Statistics 511
Final
Dec. 18, 2006
The following rules apply.
1. You may use 3 sheets of paper for any information you need - double-sided,
any font.
2. You may use a calculator.
3. You may not collaborate or copy.
4. You may not use outside resources, such as the internet. As well, you may
not store notes or formulas on your calculator.
5. Failure to comply with items 3 and 4 could lead to reduction in your grade,
or disciplinary action.
I have read the rules above and agree to comply with them.
Signature ________________________________________________
Name (printed) ___________________________________________
1
Stat 511
Fall 2006
Final Exam
1. A problem of continuing interest concerns the effect of air pollution on human health. For an
early study, 16 variables were collected in 60 large metropolitan areas of the US. (Source:
McDonald, G.C. and Schwing, R.C. (1973) 'Instabilities of regression estimates relating air pollution to mortality',
Technometrics, vol.15, 463-482.)
We are particularly interested in whether mortality is related to the pollution variables HC, NOX,
and SO2, after adjusting for the other variables.
The variables are
1. PREC Average annual precipitation in inches
2. JANT Average January temperature in degrees F
3. JULT Same for July
4. OVR65 % of 1960 SMSA population aged 65 or older
5. POPN Average household size
6. EDUC Median school years completed by those over 22
7. HOUS % of housing units which are sound & with all facilities
8. DENS Population per sq. mile in urbanized areas, 1960
9. NONW % non-white population in urbanized areas, 1960
10. WWDRK % employed in white collar occupations
11. POOR % of families with income < $3000
12. HC Relative hydrocarbon pollution potential
13. NOX Same for nitric oxides
14. SO2 Same for sulphur dioxide
15. HUMID Annual average % relative humidity at 1pm
16. MORT Total age-adjusted mortality rate per 100,000
(The dependent variable)
Computer output for this problem can be found on pages 2- 9 of the Computing Handout.
a) Consider the full model. Is there evidence that HC, NOX and S02 are significant predictors of
MORT when the other variables are in the model? Support your answer with a statistical test.
H0: 14 = 15 = 16 = 0
HA: at least on of 14, 15 or 16 is nonzero when the
other variables are in the model
Formula for test statistic
F*= SSR(HC, NOX,SO2 | PREC, JANT, JULT, OVR65, POPN, EDUC, HOUS, DENS, NONW, WWDRK, POOR, HUMID)/3
MSE
d.f.
3, 44
Test:
(417.06688+9159.28597+31.63982)/3 = 2.625134
1220.00049
P-value: 0.0622
Conclusion (in words): There is weak evidence (p≈0.06) that at least one of HC, NOX or SO2 is
a predictor of mortality. (Fail to reject the null hypothesis is also a reasonable answer.)
2
Stat 511
Fall 2006
Final Exam
b) What assumptions must be satisfied for the test in part a) to be valid?
The data must be independently normally distributed with mean that depends linearly on the
variables and errors that have mean zero and constant variance.
c) Is there evidence of multicollinearity in the model? Briefly support your answer.
HC and NOX appear to be multicollinear with other variables in the model, as they have high
VIF. (Since none of the other variables have high VIF, it is likely that HC and NOX are highly
correlated with each other.)
3
Stat 511
Fall 2006
Final Exam
d) The investigators felt that a model with fewer variables would be more readily interpretable.
Using the output given on p. 4 of the Computing Handout, about how many variables are needed
to attain predictive power similar to the predictive power of the full model? Briefly justify your
answer.
SBC appears to be minimized at p=5, which is 4 variables. R2 continues to increase for 5 and 6
variables, before tapering off. So, about 4-6 variables appear to be needed to obtain predictive
power similar to the full model.
e) Based on the all subsets regression, the investigator decided to use the model that included
predictors: PREC. JANT, JULT, NONW and S02.
Based on this model, is there evidence that any of the pollution variables are significant
predictors of MORT?
H0: 5=0
HA:5 ≠0
Formula for test statistic : t* = b5/s(b5)
d.f.
Test:
or F*= SSI/MSE=SSII/MSE
(t-test, 54) (F-test, 1, 54)
4.05
P-value:
0.0002
Conclusion (in words): SO2 is a highly significant predictor of mortality.
4
Stat 511
Fall 2006
Final Exam
f) The investigators noted that none of the candidate models included HC or NOX and concluded
that these 2 variables are not important causal factors in MORTality. Is there statistical support
for this conclusion?
There is no evidence of linear association between MORTality and HC or NOX, when all other
variables are in the model. However, we cannot readily infer causality (or lack of causality)
from an observational study.
(I will accept any reasonable answer that includes the idea that there can at best be weak
evidence of causality from this type of study. We also cannot conclude a causal link with SO2 –
onlly a strong association – e.g. it may be that SO2 is strongly correlated with some other factor
that this the cause, such as particulates.)
5
Stat 511
Fall 2006
Final Exam
2. On Jan. 20, 1986, the space shuttle Challenger exploded less than 2 minutes into its flight,
killing all on board. The forecast temperature was 31ºF, more than 20 degrees colder than any of
the previous 24 flights. The night before the disaster, a team of engineers had met to determine
whether the cold weather indicated that the flight should be cancelled. Among other data they
investigated was the failure of a critical connector called an O-ring. O-rings had failed in 7 of 23
previous flights. (The O-rings were lost at sea for 1 flight.) We will look at the question of
whether O-ring failure could be associated with temperature.
The dependent variable is "Damage". Damage=1 if any of the 6 O-rings failed during the flight.
Damage=0 if none of the O-rings failed during the flight. Some computer output associated with
this problem is given on p. 10 – 13 of the Computing Handout.
a) Consider the loess curve on p. 10. Based on this curve, what is the estimated probability of at
least one failure at 70ºF and at 31ºF? (You will have to extrapolate – but these were the data
available at the time.)
70ºF :
21%
and at 31ºF: 100%
b) Based on the fitted logistic regression, what is the estimated probability of at least one failure
at 70ºF and at 31ºF.
p= exp(b0 + b1 Temp)/(1+ exp(b0 + b1 Temp))
70ºF p= 22.95%
and at 31ºF. p=99.96%
6
Stat 511
Fall 2006
Final Exam
c) Is there any evidence of lack of fit of the logistic regression model? Briefly justify your
answer.
The Hosmer and Lemeshow test does not give any evidence of lack of fit. The loess fit to the
residuals is very flat, indicating no lack of fit.
d) The engineers who gave the "OK" for the flight stated that there was no relationship between
temperature and the probability of O-ring failure. Do a formal test of this hypothesis. Do you
agree with their conclusion?
H0: 1=0
Test: LRT = 7.952 (df = 1)
P-value: (LRT 0.0048)
HA:1 ≠0
Wald test = 4.6008
(Wald 0.0320)
Conclusion (in words). This is a significant association between temperature and the probability
of at least one o-ring failure. The probability increases as the temperature decreases.
Note: The engineers failed to discover the relationship because they considered only the 7 flights
on which o-rings failed, rather than considering all 23 flights for which data were available. As
far as I know, there are still no statisticians on the panel that makes these decisions.
7
Stat 511
Fall 2006
Final Exam
3. In State College, many homes use electric heat instead of a furnace. Of course, these homes
also use electricity to run appliances and lights and for cooling in the summer.
In a study of electricity use, homeowners recorded their daily electricity usage (in kilowatt
hours) for 55 consecutive months. The average daily usage was recorded for each month. The
"average" temperature for each month was computed by averaging the daily maximum and
minimum temperatures. A polynomial regression model was fitted to the data.
Some computer output for the problem is given on p. 14 - 15.
a) Using unpooled sequential tests, find an appropriate degree for the polynomial.
The simplest way to do this is to note that the test statistic at each step is
F*=SSI/MSE and we accept H0: =0 if F* < F.95,1,50= 4.03.
F*<4.03 implies that SSI<4.03*MSE= 402.61
So, we fail to reject the 4th and 3rd powers, but we reject 2=0.
We conclude that a quadratic polynomial provides an appropriate fit to the data.
8
Stat 511
Fall 2006
Final Exam
b) Actually, the homeowners travelled several times during the period of data collection. The
investigators decided to use data only from days when they were at home. The monthly usage
variable is the mean only of the ni days in month i for which the homeowners were actually at
home. Suppose that the daily usage values are independent with constant variance on those days.
What is the variance of the average usage in month i?
Call the daily variance 2. Then the variance of the average usage is 2/ni.
c) Suppose that the investigators wanted to take into account the number of days in the month for
which the homeowners were at home by using weighted least squares. What weights would be
used?
The weights should be proportional to 1/variance = ni/2. Since 2 is not known, the weights
should be ni.
d) Suppose the MSE from the weighted least squares regression in part c is 80. What is the
estimated variance of the daily electricity usage?
Using weighted least squares with weights proportional to kV-1 the MSE estimates k. In this
case, k will be 2. So the estimate of the daily variance is 80.
9