Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 360: Final – Takehome Portion Fall 2016 Points: 75 Name(s):_________________________________ _________________________________ Consider once again the 2016 US Election dataset used previously in our class. The information provided in this dataset is at the county level. County data is not yet available for Alaska and hence Alaska is excluded. The following variables are contained in this dataset. Response Variables: Percent_Republican: Percentage of voters who voted for Donald Trump Predictor Variables: Percent Democrat: Percentage of voters who voted for Hillary Clinton Percent_Households_Married_Family: Percentage of households with married parents and children under the age of 18 Percent_BachelorsDegree: Percentage who have attained a bachelor’s degree – age 25 or older Percent_Born_in_US: Percentage who were born in US Percent_in_Labor_Force: Percentage who are currently in labor force – 16 or older Median_Household_Income: Median household income Percent_No_Health_Insurance: Percentage without health insurance Percent_Below_Poverty: Percent whose income in the past 12 months is below the poverty line Percent_Female: Percent female Percent_White: Percentage white Percent_Hispanic: Percentage Hispanic For this investigation, you are to determine which predictor variables, collectively, influenced how people voted in the 2016 US Election. When modeling Percent_Republican it would obviously be advantageous to know Percent_Democrat; however, in order practice this predictor variable cannot be obtained until after the election and thus will not be used in our modeling building process. Structure of Model: Mean Function: 𝐸(% 𝑅𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛 | ? ? ? ) = 𝛽0 + ⋯ Variance Function: 𝑉𝐴𝑅(% 𝑅𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛 | ? ? ? ) = 𝜎 2 1 Step 1: Check for Multicollinearity: In JMP, use Analyze > Multivariate Methods and obtain a correlation matrix that includes the response, all the potential predictors identified above. 1. Identify on the table below which predictor variables, if any, will be removed due to concerns of multicollinarity. (2 pts) Variable Percent_Democrat Percent_Republican Percent_Households_Married_Family Percent_BachelorsDegree Percent_Born_in_US Percent_In_Labor_Force Median_Household_Income Percent_No_Health_Insurance Percent_Below_Poverty Percent_Female Percent_White Percent_Hispanic Role Exclude Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Removed because of Multicollinearity? (Yes / No) Step 2: Use Modeling Building Methods to Find Best Model In JMP, select Analyze > Fit Model, specify Percent_Republican as the response. Specify the predictor variables in the Construct Model Effects box. Under Personality drop down box in the Fit Window dialog window select Stepwise. 2. Identify whether or not each predictor was included in the final model under the Forward and Backward approaches. Use Minimum BIC as the criteria here. (2 pts) Was predictor in final model? Variable Percent_Households_Married_Family Percent_BachelorsDegree Percent_Born_in_US Percent_In_Labor_Force Median_Household_Income Percent_No_Health_Insurance Percent_Below_Poverty Percent_Female Percent_White Percent_Hispanic Role Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Criteria: Minimum BIC Direction: Forward Criteria: Minimum BIC Direction: Backward 2 3. Identify whether or not each predictor was included in the final model under the Forward and Backward approaches. Use Minimum AICc as the criteria here. (2 pts) Was predictor in final model? Variable Percent_Households_Married_Family Percent_BachelorsDegree Percent_Born_in_US Percent_In_Labor_Force Median_Household_Income Percent_No_Health_Insurance Percent_Below_Poverty Percent_Female Percent_White Percent_Hispanic Role Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Criteria: Minimum AICc Direction: Forward Criteria: Minimum AICc Direction: Backward 4. Briefly discuss any differences between the four approaches (Minimum BIC / Minimum AICc and Forward / Backward) used to identify the best model. (2 pts) 5. Make a determination for your final model. Identify the predictor in your final model. Also, rank the importance of each predictor using either the t-Ratio value or the Log Worth values provided in JMP. (3 pts) Variable Percent_Households_Married_Family Percent_BachelorsDegree Percent_Born_in_US Percent_In_Labor_Force Median_Household_Income Percent_No_Health_Insurance Percent_Below_Poverty Percent_Female Percent_White Percent_Hispanic Role Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor Predictor in Final Model Yes / No Rank Importance of this Predictor using t-Ratio or Log Worth [1=Most Important] 3 Step 3: Checking Model Diagnostics 6. The Variance Inflation Factor (VIF) can be used to evaluate the remaining co-linearity of predictors after the model has been fit. In the Parameter Estimates portion of the JMP output, right click and select VIF. Predictors with a VIF value greater than 10 warrant further consideration. Verify that our model appears not suffer co-linear predictors. Provide output as evidence. (2 pts) 𝐶𝑜𝑙𝑖𝑛𝑒𝑎𝑟𝑖𝑡𝑦 𝑅𝑢𝑙𝑒: 𝑉𝐼𝐹 > 10 7. From the final model output, select Save Columns > Cook’s D Influence. Recall, that Cook’s D is a combined measure for leverage and outlier. Identify any observations that have a high Cook’s D value. These observations may warrant additional consideration as these observations may have an adverse influence on our model. (4 pts) 𝐼𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒 𝑅𝑢𝑙𝑒: 𝑀𝑎𝑗𝑜𝑟 𝐶𝑜𝑛𝑐𝑒𝑟𝑛: 𝑀𝑖𝑛𝑜𝑟 𝐶𝑜𝑛𝑐𝑒𝑟𝑛: 𝐶𝑜𝑜𝑘 ′ 𝑠 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 > 1 4 𝐶𝑜𝑜𝑘 ′ 𝑠 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 > 𝑛 a. Identify any counties that are at a level of major concern for Cook’s Distance. b. About how many counties raise to the level of minor concern? Are these counties from any particular state or are they spread out across many states? c. From a modeling perspective, why might it be a concern if counties identified above tend to be from only a handful of states? Discuss. Step 4: Checking Model Assumptions 8. Obtain the following scatterplots so that model assumptions can be verified. Add a flexible kernel smoother to each plot to identify trend. (3 pts) a. Plot #1: Residuals (y-axis) against Predicted Value (x-axis); Plot provided at bottom of Fit Model output. b. Remaining Plots: Residuals (y-axis) against each predictor variable in your final model; Must construct these “by-hand” using Fit Y by X. 4 9. Review the scatterplots provided above. Provide general comments regarding each of the following model assumption. (4 pts) a. Model is of the correct form [random scatter above/below y=0 line] b. Model does not suffer from non-constant variance [no mega-phone patterns] c. Observations are uncorrelated [no autocorrelation, i.e. snake or extreme bouncing back and forth] 10. Obtain a histogram of the residuals. Do the residuals appear to normally distributed? Briefly explain. (2 pts) A map will be used to identify counties for which counties our model is fitting poorly. The following process is used to create such a map using the Studentized Residuals. 𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝑅𝑢𝑙𝑒: 𝑀𝑜𝑑𝑒𝑙 𝑖𝑠 𝑜𝑣𝑒𝑟 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑏𝑎𝑑𝑙𝑦: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑖𝑧𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 < −2 𝑀𝑜𝑑𝑒𝑙 𝑖𝑠 𝑢𝑛𝑑𝑒𝑟 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑏𝑎𝑑𝑙𝑦: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑖𝑧𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 > 2 Step 1: Select Save Columns > Studentized Residuals. Step 2: Create a new column called Outlier. Right click on column heading and select Formula. In the Formula box, under Conditional, select IF. Step 3: The IF statements will need three conditions, predicting too low, too high, and OK. The IF statement must have three conditions and to add a condition, click on the expression box, i.e. the expr box. The border of the box will turn red. Next, click on the Step 4: twice to add an expression to the IF statement. Next, specify the conditions in the two expression boxes as is shown here. “-1” will be used to identify counties where Clinton did substantially better than expected, “1” will be used to identify counties where Trump did substantially better than expected. Click OK and the new variable will be created. 5 Step 5: Finally, plotting the new variable. Select Graph > Graph Builder. Place the Name variable into the Map Shape box in the lower-left corner of the plot. Slide the newly created Outlier variable onto the map. 11. Delete my map and replace it with yours. For what regions of the US did your model tend to under-predict Trump’s ability to secure votes? For what regions of the US did your model tend to over-predict Trump’s ability to secure votes? Discuss. (4 pts) 6 Consider the following recent post on Facebook from a friend of mine. This friend actually provided a Google Sheet with the data and any of us to play around with it a bit if we wanted. Friend’s Post Accompanying Graph Download the dataset provided on our course website. I have provided you the Google version of the dataset which is what my friend posted. A snip-it of the dataset is provided here. Consider the following model structure. Mean Function #1: 𝐸(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑌𝑒𝑎𝑟, 𝑇𝑦𝑝𝑒) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑎𝑔𝑒 + 𝛽2 ∗ 𝑌𝑒𝑎𝑟 + 𝛽3 ∗ 𝑇𝑦𝑝𝑒 Variance Function: 𝑉𝐴𝑅(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑌𝑒𝑎𝑟, 𝑇𝑦𝑝𝑒) = 𝜎 2 12. Consider the statement “Note: I have not controlled for model year or trim level. I don’t think they matter.” (2 pts each) a. Construct a scatterplot between Price (y-axis) and Year (x-axis). Is the reference to year not being important supported when looking at this scatterplot? b. Obtain an added variable plot for Year for the model above. Is the reference to year not being important supported when looking at this plot? Discuss. 7 13. Fit the following updated mean function. Mean Function #2: 𝐸(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑇𝑦𝑝𝑒) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑎𝑔𝑒 + 𝛽2 ∗ 𝑇𝑦𝑝𝑒 Variance Function: 𝑉𝐴𝑅(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑇𝑦𝑝𝑒) = 𝜎 2 a. Why should a statistician believe that Mean Function #2 is better than Mean Function #1? Discuss. (2 pts) b. Write out the mean function for Type = Sierra. (1 pt) c. Write out the mean function for Type = GrandCaravan. (1 pt) 14. Consider yet another mean function for this data. Mean Function #3: 𝐸(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑇𝑦𝑝𝑒) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑎𝑔𝑒 + 𝛽2 ∗ 𝑇𝑦𝑝𝑒 + 𝛽3 ∗ 𝑀𝑖𝑙𝑎𝑔𝑒 ∗ 𝑇𝑦𝑝𝑒 Variance Function: 𝑉𝐴𝑅(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑇𝑦𝑝𝑒) = 𝜎 2 Why might a statistician believe that Mean Function #3 is better than Mean Function #2? Discuss. (3 pts) 15. Which mean function would be consider the best statistical model (#1, #2, or #3)? Interpret the R2 and RMSE value from this model. (3 pts) 8 For this last portion of this assignment, an investigation of ACT performance for students in Minnesota will be done. For the 2015-2016 school year, the Department of Education required that all students in public schools to take the ACT exam. My understanding is this requirement was for only one year and students will not be required to take this exam in the future. This requirement resulted in the average ACT score, which is often used to evaluate quality of instruction across states, to be reduced for 2015-2016 school year because students in the past who would normally not take the ACT were required to take it this past year. The following rankings shows the dramatic drop in Minnesota from 2015 to 2016. MN Ranking 2015 – 12th MN Ranking Drops to Dramatically in 2016 [but highest with larger % taking exam] [but highest with 100 % taking exam] Source: http://www.act.org/content/dam/act/unsecured/documents/Condition-ofCollege-and-Career-Readiness-Report-2015-United-States.pdf Source:http://www.act.org/content/dam/act/unsecured/documents/CCCR_National_2016.pdf Goal: Develop a model that allow you to computed an “adjusted” score for each school district for Year = 2016. Because this requirement was a one-time thing, the MN Department of Education may consider using your “adjusted” score in place of the actual score. If an adjusted score cannot be reliable determined, they will leave the actual score in place and use an * to identify that a change in policy occurred for that one year. Process: 1. The ACT score for 2016 cannot be used in modeling as it is known to be biased downward. Thus, data from 2015 will be used to build the model. 2. After a model is obtained from the 2015 data, the formula obtained (from the 2015 data) will be used to make prediction for 2016. The values for all predictor variables have been updated and represent the 2015-2016 school year. 3. Lastly, the predictions obtained for 2016 will allow us to identify which school districts were most adversely affected by this one-time requirement. 9 Download the ACT data for 2015 from the course website. The following table identifies that the possible predictors and the response variable = Average Composite ACT 2015. Variable Name Description MN Dept of Education global measurement used for accountability, recognition, and support Predictor Per Student Funding in school Predictor MATH MCA score [MCA = MN's version a standardized test; Grade 11] Predictor Reading_MCA Reading MCA score [MCA = MN's version a standardized test; Used previous year data as test given in Grade 10] Predictor Science_MCA Science MCA score [MCA = MN's version a standardized test] Predictor Dropout Percent of Students who dropped out in this school Predictor Graduate Percent of Students who graduated; Used 4-year graduation rate Predictor Total Number of Students in Grade 11 Predictor Percent of Students in Grade 11 that are minority Predictor Percent_Free Percent of Students in Grade 11 that qualify for free lunch Predictor Percent_Reduced Percent of Students in Grade 11 that qualify for reduced lunch Predictor Percent_FR Percent of Students in Grade 11 that qualify for either free or reduced lunch Predictor Average Teacher Salary Predictor Average Teacher Years of Experience Predictor Average Teacher Age Predictor Percent of Teachers that have an advanced degree (beyond Bachelors degree) Predictor MMR PerStudent_Funding Math_MCA TotalStudents Percent_Minority Avg_Teacher_Salary Average_TeacherYears_Experience Average_Teacher_Age Percent_Teachers_AdvancedDegree Avg Comp 2011 Average Composite ACT Score for 2011 Avg Comp 2012 Average Composite ACT Score for 2012 Avg Comp 2013 Average Composite ACT Score for 2013 Avg Comp 2014 Average Composite ACT Score for 2014 Avg Comp 2015 Average Composite ACT Score for 2015 Role Response 16. Build a model to predict Average Composite ACT 2015. Provide output for your final model. You should provide a discussion of the following regarding your final model. (6 pts) a. Multicollinearity b. Process for determining best model c. Evaluation of model diagnostics and model assumptions, e.g. Cook’s D and checking assumptions via residual plots 10 17. Obtain the predicted values from your model, I will call these prediction 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2015_𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 Obtain a plot of the Actual values (y-axis) against Predicted2015_Regression (x-axis). Does your model appear to be doing a good job of predicting? Discuss. (2 pts) 18. Discuss the meaning of the R2 value and RMSE value for your model. (2 pts) 19. Your boss, who does not have a background in modeling, suggests that prediction for 2015 could more easily be obtained by simply average the ACT scores from the four previous years. These predictions will be called Predicted2015_Average 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2015_𝐴𝑣𝑒𝑟𝑎𝑔𝑒 = (𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2011 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2012 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2013 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2014) 4 Compute the R2 value when using Predicted2015_Average. Did this simple method do better than your model? Discuss. (3 pts) 20. The simple approach above does not take into consideration any information about the actual students who took the ACT in this year. Consider the following hybrid approach which will combine the simple approach with the modeling approach. The hybrid approach used here has placed equal weight on both sets of predicted values. The ½ values in this equation can be adjusted as one sees fit. A weight of 1 and 0, respectively, would just use the regression predictions and a weight of 0 and 1, respectively, would just use the moving average approach. 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐻𝑦𝑏𝑟𝑖𝑑 = 1 1 ∗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2015_𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + ∗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2015_𝐴𝑣𝑒𝑟𝑎𝑔𝑒 2 2 Obtain the R2 value when using the Predicted2015_Hybrid with equal weights. Does this model outperform the other two individual models? Discuss. (3 pts) Technical Note: Careful consideration is required when computing this R2 and comparing it against the other R2 values because some schools have missing information; thus, predictions are not available for all schools. This is a problem because R2 calculations use a SUM and a sum cannot be compared fairly when a varying number of observations are used to compute its value. I computed my R2 values here by only using observations for which a residual was present for the hybrid approach – which implies a residual is present for the regression and average approaches. Practical Note: If you ignore the Technical Note above I will not mark this problem wrong. However, if you are the type of person who likes to get things exactly correct, then I’d encourage you to take the note above into consideration so that R2 values can be fairly compared. 11 Download the ACT Data for 2016 so that “adjusted” scores can be computed for 2016. Use the Predicted2015_Regression equation and apply it to the 2016 data to obtain Predicted2016_Regression values for each school. You can copy and paste the big ugly formula – see note below. Note: When fitting a model, you can select Save Columns > Prediction Formula. The actual formula is provided in the data. You can then copy this formula from the 2015 data and paste it into a new column in the 2016 data. Next, obtain the moving average prediction for the 2016 data akin to what was done above. 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐴𝑣𝑒𝑟𝑎𝑔𝑒 = (𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2012 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2013 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2014 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2015) 4 Next, create the Hybrid predications akin to what was done above. These will be used as the “adjusted” scores for 2016. 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐻𝑦𝑏𝑟𝑖𝑑 = 1 1 ∗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + ∗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐴𝑣𝑒𝑟𝑎𝑔𝑒 2 2 Finally, compute the following quantity for each school district. 𝐶ℎ𝑎𝑛𝑔𝑒 𝑆𝑐𝑜𝑟𝑒 = (𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2016 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐻𝑦𝑏𝑟𝑖𝑑 ) 21. List a few school districts that were must hurt by the MN Department of Education requirement that everybody must take the ACT? That is, what schools tended to have large negative Change Scores? (4 pts) 22. Suppose a change score were positive. This means either 1) the requirement that everybody take the test actually made the scores increase, or 2) the modelling approach under-predicted badly for this school. Where there any schools where the Change Score was positive? If so, list a few of these schools. (3 pts) 23. Your boss wants your professional opinion on whether or not the MN Department of Education should use an “adjusted” score for 2016. What is your opinion? Discuss. (4 pts) 24. According to the ACT website, Minnesota’s overall rank was #24 for 2016. Compute the average “adjusted” score across all schools. What would Minnesota’s rank be if your “adjusted” scores were used instead? (2 pts) 12