Download 1 Full model [35] - UC Davis Plant Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mathematical model wikipedia , lookup

Transcript
Homework 3. Multiple linear regression modeling.
1
Full model [35]
Use the Spartina.jmp data. Follow the steps described in the notes “Chapter 8. Regression
Diagnostics.”
1.1 For each section 8.1.1-8.1.7 write one sentence indicating the result and
stating that you were able to complete the steps, or record the step you
are not able to perform or understand. Pose a question whose answer will
help you complete the step or understand the procedure. [35]
See notes for detailed procedures and results.
2
Variable selection [65]
2.1 Use the Fit model platform to obtain a complete report of the regression of
biomass on sal, pH, K, Na, and Zn (same as above).
2.1.1 Obtain the VIF’s and standardized regression coefficients, for the full model. Indicate if
there is a severe collinearity problem. [20]
See notes for detailed procedures and results.
2.2 Use the Fit model platform to select the best subset of variables.
2.2.1 What type of goal would motivate variable selection? [5]
Variable selection is a means to deal with collinearity that is useful to get more precise estimates of
the partial regression coefficients when the goal is to understand what factors affect biomass, and
how much each factor affects biomass. Remedial measures for collinearity are only necessary when
data are observational and there is a focus on the estimated partial regression coefficients instead
on just making predictions for the response variable.
2.2.2 Use the Fit Model platform, enter all 5 X’s in the Effects box and bmss in the Y box as
before. Select Stepwise Personality and Run. In the new window, select the red triangle
and choose All possible regressions. Using CTL-click or right-click, add the Cp to the
table of results for all models that appears inside the window. Then, use the same to
Make into Data Table. Save your new table. Add a formula to the table to calculate the
AIC. RMSE is the square root of the SSE for each model. [20]
1
2
2.2.3 For Cp, AIC, and R2, plot the best for each level of Number against number. On the
graphs, write the best subset of X’s variables based on each criterion. [20]
3
Variance of Yhat and of Beta hat [35]
3.1.1 Run a new model with only the X variables selected on the basis of AIC.
3.1.2 Using the custom test, estimate bmss for the average values of X’s both in the 5-variable
(step 2.1) and the reduced model according the step 3.1.1. [15]
3.1.3 Compare the regression coefficients and their variances between the full and reduced
model and discuss the reasons for the differences. Compare the variances of the Custom
tests and their values and discuss the differences. There may be a larger difference
between models in the variances of regression coefficients than in estimated Y. Why?
[20]
The selection of a subset of variables resulted in less collinearity. Thus, the variance of the
coefficients for Na and pH are reduced by 50% in the model with only those two predictors. The
variance of the prediction of Y does not really change much, because it is no affected by collinearity
if it is near the centroid of the predictors.
3
4
Validation [15]
4.1.1 Split the data in 3 randomly selected subsets by using a column with random numbers.
Estimate parameters using the reduced model with 2/3 of the data and calculate the
MSPR with the remaining data. The predicted values for the validation data set are
calculated by Save Columns Prediction Formula. [15]
For this, create a column with the following formula:
Random Uniform()
Then select Table -> Sort and choose the column with the random number as the sort column. After
sorting, select the last 15 rows and exclude them by selecting Rows -> Exclude/Unexclude. Run the
Fit model platform as before, with Na and pH as the predictors and bms as the response. In the
output window select Save columns -> Predicted values. In the new column with the predicted
values, select and delete all values except for the last 15. Use the remaining values and the formula
for MSPR to calculate MSPR.
I obtained an MSPR=94416.
4