Download WORD - StatsClass.org

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
STAT 360: Final – Takehome Portion
Fall 2016
Points: 75
Name(s):_________________________________
_________________________________
Consider once again the 2016 US Election dataset used previously in our class. The information
provided in this dataset is at the county level. County data is not yet available for Alaska and
hence Alaska is excluded. The following variables are contained in this dataset.
Response Variables:
 Percent_Republican: Percentage of voters who voted for Donald Trump
Predictor Variables:
 Percent Democrat: Percentage of voters who voted for Hillary Clinton
 Percent_Households_Married_Family: Percentage of households with married parents
and children under the age of 18
 Percent_BachelorsDegree: Percentage who have attained a bachelor’s degree – age 25
or older
 Percent_Born_in_US: Percentage who were born in US
 Percent_in_Labor_Force: Percentage who are currently in labor force – 16 or older
 Median_Household_Income: Median household income
 Percent_No_Health_Insurance: Percentage without health insurance
 Percent_Below_Poverty: Percent whose income in the past 12 months is below the
poverty line
 Percent_Female: Percent female
 Percent_White: Percentage white
 Percent_Hispanic: Percentage Hispanic
For this investigation, you are to determine which predictor variables, collectively, influenced
how people voted in the 2016 US Election. When modeling Percent_Republican it would
obviously be advantageous to know Percent_Democrat; however, in order practice this
predictor variable cannot be obtained until after the election and thus will not be used in our
modeling building process.
Structure of Model:
Mean Function: 𝐸(% 𝑅𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛 | ? ? ? ) = 𝛽0 + ⋯
Variance Function: 𝑉𝐴𝑅(% 𝑅𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛 | ? ? ? ) = 𝜎 2
1
Step 1: Check for Multicollinearity:
In JMP, use Analyze > Multivariate Methods and obtain a correlation matrix that includes the
response, all the potential predictors identified above.
1. Identify on the table below which predictor variables, if any, will be removed due to
concerns of multicollinarity. (2 pts)
Variable
Percent_Democrat
Percent_Republican
Percent_Households_Married_Family
Percent_BachelorsDegree
Percent_Born_in_US
Percent_In_Labor_Force
Median_Household_Income
Percent_No_Health_Insurance
Percent_Below_Poverty
Percent_Female
Percent_White
Percent_Hispanic
Role
Exclude
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Removed because
of Multicollinearity?
(Yes / No)
Step 2: Use Modeling Building Methods to Find Best Model
In JMP, select Analyze > Fit Model, specify Percent_Republican as the response. Specify the
predictor variables in the Construct Model Effects box. Under Personality drop down box in the
Fit Window dialog window select Stepwise.
2. Identify whether or not each predictor was included in the final model under the Forward
and Backward approaches. Use Minimum BIC as the criteria here. (2 pts)
Was predictor in final model?
Variable
Percent_Households_Married_Family
Percent_BachelorsDegree
Percent_Born_in_US
Percent_In_Labor_Force
Median_Household_Income
Percent_No_Health_Insurance
Percent_Below_Poverty
Percent_Female
Percent_White
Percent_Hispanic
Role
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Criteria: Minimum BIC
Direction: Forward
Criteria: Minimum BIC
Direction: Backward
2
3. Identify whether or not each predictor was included in the final model under the Forward
and Backward approaches. Use Minimum AICc as the criteria here. (2 pts)
Was predictor in final model?
Variable
Percent_Households_Married_Family
Percent_BachelorsDegree
Percent_Born_in_US
Percent_In_Labor_Force
Median_Household_Income
Percent_No_Health_Insurance
Percent_Below_Poverty
Percent_Female
Percent_White
Percent_Hispanic
Role
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Criteria: Minimum AICc
Direction: Forward
Criteria: Minimum AICc
Direction: Backward
4. Briefly discuss any differences between the four approaches (Minimum BIC / Minimum AICc
and Forward / Backward) used to identify the best model. (2 pts)
5. Make a determination for your final model. Identify the predictor in your final model. Also,
rank the importance of each predictor using either the t-Ratio value or the Log Worth values
provided in JMP. (3 pts)
Variable
Percent_Households_Married_Family
Percent_BachelorsDegree
Percent_Born_in_US
Percent_In_Labor_Force
Median_Household_Income
Percent_No_Health_Insurance
Percent_Below_Poverty
Percent_Female
Percent_White
Percent_Hispanic
Role
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor
Predictor in Final
Model
Yes / No
Rank Importance of
this Predictor using
t-Ratio or Log Worth
[1=Most Important]
3
Step 3: Checking Model Diagnostics
6. The Variance Inflation Factor (VIF) can be used to evaluate the remaining co-linearity of
predictors after the model has been fit. In the Parameter Estimates portion of the JMP
output, right click and select VIF. Predictors with a VIF value greater than 10 warrant further
consideration. Verify that our model appears not suffer co-linear predictors. Provide
output as evidence. (2 pts)
𝐶𝑜𝑙𝑖𝑛𝑒𝑎𝑟𝑖𝑡𝑦 𝑅𝑢𝑙𝑒: 𝑉𝐼𝐹 > 10
7. From the final model output, select Save Columns > Cook’s D Influence. Recall, that Cook’s
D is a combined measure for leverage and outlier. Identify any observations that have a
high Cook’s D value. These observations may warrant additional consideration as these
observations may have an adverse influence on our model. (4 pts)
𝐼𝑛𝑓𝑙𝑢𝑒𝑛𝑐𝑒 𝑅𝑢𝑙𝑒:
𝑀𝑎𝑗𝑜𝑟 𝐶𝑜𝑛𝑐𝑒𝑟𝑛:
𝑀𝑖𝑛𝑜𝑟 𝐶𝑜𝑛𝑐𝑒𝑟𝑛:
𝐶𝑜𝑜𝑘 ′ 𝑠 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 > 1
4
𝐶𝑜𝑜𝑘 ′ 𝑠 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 >
𝑛
a. Identify any counties that are at a level of major concern for Cook’s Distance.
b. About how many counties raise to the level of minor concern? Are these counties from
any particular state or are they spread out across many states?
c. From a modeling perspective, why might it be a concern if counties identified above
tend to be from only a handful of states? Discuss.
Step 4: Checking Model Assumptions
8. Obtain the following scatterplots so that model assumptions can be verified. Add a flexible
kernel smoother to each plot to identify trend. (3 pts)
a. Plot #1: Residuals (y-axis) against Predicted Value (x-axis); Plot provided at bottom
of Fit Model output.
b. Remaining Plots: Residuals (y-axis) against each predictor variable in your final
model; Must construct these “by-hand” using Fit Y by X.
4
9. Review the scatterplots provided above. Provide general comments regarding each of the
following model assumption. (4 pts)
a. Model is of the correct form [random scatter above/below y=0 line]
b. Model does not suffer from non-constant variance [no mega-phone patterns]
c. Observations are uncorrelated [no autocorrelation, i.e. snake or extreme bouncing
back and forth]
10. Obtain a histogram of the residuals. Do the residuals appear to normally distributed? Briefly
explain. (2 pts)
A map will be used to identify counties for which counties our model is fitting poorly. The
following process is used to create such a map using the Studentized Residuals.
𝑂𝑢𝑡𝑙𝑖𝑒𝑟 𝑅𝑢𝑙𝑒:
𝑀𝑜𝑑𝑒𝑙 𝑖𝑠 𝑜𝑣𝑒𝑟 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑏𝑎𝑑𝑙𝑦: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑖𝑧𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 < −2
𝑀𝑜𝑑𝑒𝑙 𝑖𝑠 𝑢𝑛𝑑𝑒𝑟 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑏𝑎𝑑𝑙𝑦: 𝑆𝑡𝑢𝑑𝑒𝑛𝑡𝑖𝑧𝑒𝑑 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 > 2
Step 1:
Select Save Columns > Studentized Residuals.
Step 2:
Create a new column called Outlier. Right click on column heading and select
Formula. In the Formula box, under Conditional, select IF.
Step 3:
The IF statements will need three conditions, predicting too low, too high, and
OK. The IF statement must have three conditions and to add a condition, click
on the expression box, i.e. the expr box. The border of the box will turn red.
Next, click on the
Step 4:
twice to add an expression to the IF statement.
Next, specify the conditions in the two expression boxes as is shown here. “-1”
will be used to identify counties where Clinton did substantially better than
expected, “1” will be used to identify counties where Trump did substantially
better than expected. Click OK and the new variable will be created.
5
Step 5:
Finally, plotting the new variable. Select Graph > Graph Builder. Place the
Name variable into the Map Shape box in the lower-left corner of the plot.
Slide the newly created Outlier variable onto the map.
11. Delete my map and replace it with yours. For what regions of the US did your model tend to
under-predict Trump’s ability to secure votes? For what regions of the US did your model
tend to over-predict Trump’s ability to secure votes? Discuss. (4 pts)
6
Consider the following recent post on Facebook from a friend of mine. This friend actually
provided a Google Sheet with the data and any of us to play around with it a bit if we wanted.
Friend’s Post
Accompanying Graph
Download the dataset provided on our course website. I have provided you the Google version
of the dataset which is what my friend posted. A snip-it of the dataset is provided here.
Consider the following model structure.
Mean Function #1: 𝐸(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑌𝑒𝑎𝑟, 𝑇𝑦𝑝𝑒) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑎𝑔𝑒 + 𝛽2 ∗ 𝑌𝑒𝑎𝑟 + 𝛽3 ∗ 𝑇𝑦𝑝𝑒
Variance Function: 𝑉𝐴𝑅(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑌𝑒𝑎𝑟, 𝑇𝑦𝑝𝑒) = 𝜎 2
12. Consider the statement “Note: I have not controlled for model year or trim level. I don’t
think they matter.” (2 pts each)
a. Construct a scatterplot between Price (y-axis) and Year (x-axis). Is the reference to
year not being important supported when looking at this scatterplot?
b. Obtain an added variable plot for Year for the model above. Is the reference to year
not being important supported when looking at this plot? Discuss.
7
13. Fit the following updated mean function.
Mean Function #2: 𝐸(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑇𝑦𝑝𝑒) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑎𝑔𝑒 + 𝛽2 ∗ 𝑇𝑦𝑝𝑒
Variance Function: 𝑉𝐴𝑅(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑇𝑦𝑝𝑒) = 𝜎 2
a. Why should a statistician believe that Mean Function #2 is better than Mean Function
#1? Discuss. (2 pts)
b. Write out the mean function for Type = Sierra. (1 pt)
c. Write out the mean function for Type = GrandCaravan. (1 pt)
14. Consider yet another mean function for this data.
Mean Function #3: 𝐸(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑇𝑦𝑝𝑒) = 𝛽0 + 𝛽1 ∗ 𝑀𝑖𝑙𝑒𝑎𝑔𝑒 + 𝛽2 ∗ 𝑇𝑦𝑝𝑒 + 𝛽3 ∗ 𝑀𝑖𝑙𝑎𝑔𝑒 ∗ 𝑇𝑦𝑝𝑒
Variance Function: 𝑉𝐴𝑅(𝑃𝑟𝑖𝑐𝑒 | 𝑀𝑖𝑙𝑒𝑎𝑔𝑒, 𝑇𝑦𝑝𝑒) = 𝜎 2
Why might a statistician believe that Mean Function #3 is better than Mean Function #2?
Discuss. (3 pts)
15. Which mean function would be consider the best statistical model (#1, #2, or #3)? Interpret
the R2 and RMSE value from this model. (3 pts)
8
For this last portion of this assignment, an investigation of ACT performance for students in
Minnesota will be done. For the 2015-2016 school year, the Department of Education required
that all students in public schools to take the ACT exam. My understanding is this requirement
was for only one year and students will not be required to take this exam in the future.
This requirement resulted in the average ACT score, which is often used to evaluate quality of
instruction across states, to be reduced for 2015-2016 school year because students in the past
who would normally not take the ACT were required to take it this past year. The following
rankings shows the dramatic drop in Minnesota from 2015 to 2016.
MN Ranking 2015 – 12th
MN Ranking Drops to Dramatically in 2016
[but highest with larger % taking exam]
[but highest with 100 % taking exam]
Source: http://www.act.org/content/dam/act/unsecured/documents/Condition-ofCollege-and-Career-Readiness-Report-2015-United-States.pdf
Source:http://www.act.org/content/dam/act/unsecured/documents/CCCR_National_2016.pdf
Goal: Develop a model that allow you to computed an “adjusted” score for each school district
for Year = 2016. Because this requirement was a one-time thing, the MN Department of
Education may consider using your “adjusted” score in place of the actual score. If an adjusted
score cannot be reliable determined, they will leave the actual score in place and use an * to
identify that a change in policy occurred for that one year.
Process:
1. The ACT score for 2016 cannot be used in modeling as it is known to be biased
downward. Thus, data from 2015 will be used to build the model.
2. After a model is obtained from the 2015 data, the formula obtained (from the 2015
data) will be used to make prediction for 2016. The values for all predictor variables
have been updated and represent the 2015-2016 school year.
3. Lastly, the predictions obtained for 2016 will allow us to identify which school districts
were most adversely affected by this one-time requirement.
9
Download the ACT data for 2015 from the course website. The following table identifies that
the possible predictors and the response variable = Average Composite ACT 2015.
Variable Name
Description
MN Dept of Education global measurement used
for accountability, recognition, and support
Predictor
Per Student Funding in school
Predictor
MATH MCA score [MCA = MN's version a
standardized test; Grade 11]
Predictor
Reading_MCA
Reading MCA score [MCA = MN's version a
standardized test; Used previous year data as
test given in Grade 10]
Predictor
Science_MCA
Science MCA score [MCA = MN's version a
standardized test]
Predictor
Dropout
Percent of Students who dropped out in this
school
Predictor
Graduate
Percent of Students who graduated; Used 4-year
graduation rate
Predictor
Total Number of Students in Grade 11
Predictor
Percent of Students in Grade 11 that are
minority
Predictor
Percent_Free
Percent of Students in Grade 11 that qualify for
free lunch
Predictor
Percent_Reduced
Percent of Students in Grade 11 that qualify for
reduced lunch
Predictor
Percent_FR
Percent of Students in Grade 11 that qualify for
either free or reduced lunch
Predictor
Average Teacher Salary
Predictor
Average Teacher Years of Experience
Predictor
Average Teacher Age
Predictor
Percent of Teachers that have an advanced
degree (beyond Bachelors degree)
Predictor
MMR
PerStudent_Funding
Math_MCA
TotalStudents
Percent_Minority
Avg_Teacher_Salary
Average_TeacherYears_Experience
Average_Teacher_Age
Percent_Teachers_AdvancedDegree
Avg Comp 2011
Average Composite ACT Score for 2011
Avg Comp 2012
Average Composite ACT Score for 2012
Avg Comp 2013
Average Composite ACT Score for 2013
Avg Comp 2014
Average Composite ACT Score for 2014
Avg Comp 2015
Average Composite ACT Score for 2015
Role
Response
16. Build a model to predict Average Composite ACT 2015. Provide output for your final model.
You should provide a discussion of the following regarding your final model. (6 pts)
a. Multicollinearity
b. Process for determining best model
c. Evaluation of model diagnostics and model assumptions, e.g. Cook’s D and checking
assumptions via residual plots
10
17. Obtain the predicted values from your model, I will call these prediction
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2015_𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
Obtain a plot of the Actual values (y-axis) against Predicted2015_Regression (x-axis). Does
your model appear to be doing a good job of predicting? Discuss. (2 pts)
18. Discuss the meaning of the R2 value and RMSE value for your model. (2 pts)
19. Your boss, who does not have a background in modeling, suggests that prediction for 2015
could more easily be obtained by simply average the ACT scores from the four previous
years. These predictions will be called Predicted2015_Average
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2015_𝐴𝑣𝑒𝑟𝑎𝑔𝑒 =
(𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2011 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2012 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2013 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2014)
4
Compute the R2 value when using Predicted2015_Average. Did this simple method do better
than your model? Discuss. (3 pts)
20. The simple approach above does not take into consideration any information about the
actual students who took the ACT in this year. Consider the following hybrid approach
which will combine the simple approach with the modeling approach. The hybrid approach
used here has placed equal weight on both sets of predicted values. The ½ values in this
equation can be adjusted as one sees fit. A weight of 1 and 0, respectively, would just use
the regression predictions and a weight of 0 and 1, respectively, would just use the moving
average approach.
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐻𝑦𝑏𝑟𝑖𝑑 =
1
1
∗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2015_𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + ∗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2015_𝐴𝑣𝑒𝑟𝑎𝑔𝑒
2
2
Obtain the R2 value when using the Predicted2015_Hybrid with equal weights. Does this
model outperform the other two individual models? Discuss. (3 pts)
Technical Note: Careful consideration is required when computing this R2 and comparing it
against the other R2 values because some schools have missing information; thus,
predictions are not available for all schools. This is a problem because R2 calculations use a
SUM and a sum cannot be compared fairly when a varying number of observations are used
to compute its value. I computed my R2 values here by only using observations for which a
residual was present for the hybrid approach – which implies a residual is present for the
regression and average approaches.
Practical Note: If you ignore the Technical Note above I will not mark this problem wrong.
However, if you are the type of person who likes to get things exactly correct, then I’d
encourage you to take the note above into consideration so that R2 values can be fairly
compared.
11
Download the ACT Data for 2016 so that “adjusted” scores can be computed for 2016. Use the
Predicted2015_Regression equation and apply it to the 2016 data to obtain
Predicted2016_Regression values for each school. You can copy and paste the big ugly formula
– see note below.
Note: When fitting a model, you can select Save Columns > Prediction Formula. The
actual formula is provided in the data. You can then copy this formula from the 2015
data and paste it into a new column in the 2016 data.
Next, obtain the moving average prediction for the 2016 data akin to what was done above.
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐴𝑣𝑒𝑟𝑎𝑔𝑒 =
(𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2012 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2013 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2014 + 𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2015)
4
Next, create the Hybrid predications akin to what was done above. These will be used as the
“adjusted” scores for 2016.
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐻𝑦𝑏𝑟𝑖𝑑 =
1
1
∗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + ∗ 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐴𝑣𝑒𝑟𝑎𝑔𝑒
2
2
Finally, compute the following quantity for each school district.
𝐶ℎ𝑎𝑛𝑔𝑒 𝑆𝑐𝑜𝑟𝑒 = (𝐴𝑣𝑔 𝐶𝑜𝑚𝑝 2016 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑2016_𝐻𝑦𝑏𝑟𝑖𝑑 )
21. List a few school districts that were must hurt by the MN Department of Education
requirement that everybody must take the ACT? That is, what schools tended to have large
negative Change Scores? (4 pts)
22. Suppose a change score were positive. This means either 1) the requirement that
everybody take the test actually made the scores increase, or 2) the modelling approach
under-predicted badly for this school. Where there any schools where the Change Score
was positive? If so, list a few of these schools. (3 pts)
23. Your boss wants your professional opinion on whether or not the MN Department of
Education should use an “adjusted” score for 2016. What is your opinion? Discuss. (4 pts)
24. According to the ACT website, Minnesota’s overall rank was #24 for 2016. Compute the
average “adjusted” score across all schools. What would Minnesota’s rank be if your
“adjusted” scores were used instead? (2 pts)
12