Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Homework 3. Multiple linear regression modeling. 1 Full model [35] Use the Spartina.jmp data. Follow the steps described in the notes “Chapter 8. Regression Diagnostics.” 1.1 For each section 8.1.1-8.1.7 write one sentence indicating the result and stating that you were able to complete the steps, or record the step you are not able to perform or understand. Pose a question whose answer will help you complete the step or understand the procedure. [35] See notes for detailed procedures and results. 2 Variable selection [65] 2.1 Use the Fit model platform to obtain a complete report of the regression of biomass on sal, pH, K, Na, and Zn (same as above). 2.1.1 Obtain the VIF’s and standardized regression coefficients, for the full model. Indicate if there is a severe collinearity problem. [20] See notes for detailed procedures and results. 2.2 Use the Fit model platform to select the best subset of variables. 2.2.1 What type of goal would motivate variable selection? [5] Variable selection is a means to deal with collinearity that is useful to get more precise estimates of the partial regression coefficients when the goal is to understand what factors affect biomass, and how much each factor affects biomass. Remedial measures for collinearity are only necessary when data are observational and there is a focus on the estimated partial regression coefficients instead on just making predictions for the response variable. 2.2.2 Use the Fit Model platform, enter all 5 X’s in the Effects box and bmss in the Y box as before. Select Stepwise Personality and Run. In the new window, select the red triangle and choose All possible regressions. Using CTL-click or right-click, add the Cp to the table of results for all models that appears inside the window. Then, use the same to Make into Data Table. Save your new table. Add a formula to the table to calculate the AIC. RMSE is the square root of the SSE for each model. [20] 1 2 2.2.3 For Cp, AIC, and R2, plot the best for each level of Number against number. On the graphs, write the best subset of X’s variables based on each criterion. [20] 3 Variance of Yhat and of Beta hat [35] 3.1.1 Run a new model with only the X variables selected on the basis of AIC. 3.1.2 Using the custom test, estimate bmss for the average values of X’s both in the 5-variable (step 2.1) and the reduced model according the step 3.1.1. [15] 3.1.3 Compare the regression coefficients and their variances between the full and reduced model and discuss the reasons for the differences. Compare the variances of the Custom tests and their values and discuss the differences. There may be a larger difference between models in the variances of regression coefficients than in estimated Y. Why? [20] The selection of a subset of variables resulted in less collinearity. Thus, the variance of the coefficients for Na and pH are reduced by 50% in the model with only those two predictors. The variance of the prediction of Y does not really change much, because it is no affected by collinearity if it is near the centroid of the predictors. 3 4 Validation [15] 4.1.1 Split the data in 3 randomly selected subsets by using a column with random numbers. Estimate parameters using the reduced model with 2/3 of the data and calculate the MSPR with the remaining data. The predicted values for the validation data set are calculated by Save Columns Prediction Formula. [15] For this, create a column with the following formula: Random Uniform() Then select Table -> Sort and choose the column with the random number as the sort column. After sorting, select the last 15 rows and exclude them by selecting Rows -> Exclude/Unexclude. Run the Fit model platform as before, with Na and pH as the predictors and bms as the response. In the output window select Save columns -> Predicted values. In the new column with the predicted values, select and delete all values except for the last 15. Use the remaining values and the formula for MSPR to calculate MSPR. I obtained an MSPR=94416. 4