Download Regression Model Building

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia, lookup

Instrumental variables estimation wikipedia, lookup

Choice modelling wikipedia, lookup

Time series wikipedia, lookup

Data assimilation wikipedia, lookup

Linear regression wikipedia, lookup

Regression analysis wikipedia, lookup

Coefficient of determination wikipedia, lookup

Regression Forecasting and
Model Building
Forecasting company revenue with
multiple linear regression
Forecasting Revenue: An Example of
Regression Model Building
• Setting: Possibly a large set of predictor variables
(i.e. revenue drivers) used to predict future quarterly
revenues from data collected from previous 7 years.
• Goal: Find an equation (model) that explains
variation in Y with a smaller set of predictors that
are all related to Y but not too related to each other
(multicollinearity). Predict revenues for next four
quarters. Your dependent variable will be revenues
or seasonally adjusted revenues depending upon
whether your data has pronounced seasonality.
Forecasting Revenue: An Example of
Regression Model Building
• When you speculate on predictors it is not unusual that
many of them will be strongly related to each other. This is
especially the case when a variable is mostly derived form
another. Predictors that are too highly correlated can form
multicollinearity where predictors essentially add no
additional information while interfering with each other to
fit the dependent variable.
• Starting Point: Examine multicollinearity by checking
correlations with a correlation matrix and by generating
VIF values. This allows you some choice in which to
choose variables that have better forecasts available or that
you believe should be most related to revenues in theory.
Variance Inflation Factors
• Variance Inflation Factor (VIF) – Measure of
how highly correlated each independent
variable is with the other predictors in the
model. Used to identify Multicollinearity.
• Values larger than 10 for a predictor imply large
inflation of standard errors of regression
coefficients due to this variable being in model.
• Inflated standard errors lead to insignificant tstatistics for regression coefficients and wider
confidence intervals
Forecasting Revenue: An Example of
Regression Model Building
• Run a multiple regression to look at VIF values (and D-W
values) – Delete one of the variables from those that with
VIF > 10. Use the correlation matrix to see which pairs of
high VIF variables are highly correlated. For each pair,
choose the one that has the highest VIF or the variable with
high VIF that may not have forecasts available or has other
problems (such as non-linearity). There is some flexibility
in this step and it may require some investigation.
• Repeat until all VIF are smaller than 10. This will result in a
reduced set of variables to use in finding an equation using
All Possible Regressions.
Forecasting Revenue: An Example of
Regression Model Building
• Best Model Process using the data. Use MegaStat All Possible
Regressions to find an equation that has the fewest number of all
significant (p-value < .05) variables and has a small standard error and
a large adjusted R-squared.
• Megastat will order the models from highest adjusted R-squared and lowest
standard error. Look at the top for best model candidates with all significant pvalues and fewest predictors. You can use the formula
=IF(COUNTIF(predictor range,">.05")>0,"","OK") to help
identify the significant predictors models and compare OK models by looking at
adjusted R-squared / standard error
• If Megastat provided a Cp Statistic, it summarizes each possible model, where
“best” model can be selected based on this statistic. Ideally you select the model
with the fewest predictors p that has Cp  p and has all p-values < .05 for all
Forecasting Revenue: An Example of
Regression Model Building
• Again you have some flexibility here to choose a set of
variables with desirable qualities (e.g. good forecasts).
• Minor differences in adjusted R-squared, standard error are not
likely to have significant impact on your forecast results.
• Keep in mind that p-values only indicate confidence that the slope
is not zero. You need only be confident enough and smaller pvalues do not translate into better forecasts. Predictors with pvalues that are small but larger than .05 may still be good for your
• If you have to go far down the list sacrificing R-square and
standard error, consider using a model with less significant
predictors or swap out one or more variables with one of the highly
correlated variables you left out previously
Validating Your Model
• When you forecast with speculative predictors it’s possible that the
data coincidentally has a relationship to the dependent variable
(“spurious correlation”) especially with small amounts of time
series data. To help address this we will use a “hold out sample”
for a validation process that the relationships actually exist.
• Validation with holdout sample: Run the regression with the best
model selected leaving out the last two quarters of data. Forecast
the quarters you held out with 95% prediction intervals.
– Check the assumptions for the validation model. If not valid, can you fix?
Transform data?
– Do the actual values fall within the lower and upper prediction limits
implying that the predictions seem reasonable?
– If not, try using an alternative model from the all possible regressions
options or see if there is a reason that quarters held out are different in some
way. Look at the quarterly reports and see if they might suggest use of a
dummy variable. Redo the validation process.
Regression Diagnostics
Model Assumptions:
• Residual plots or other diagnostics can be used to check the
-- Plot of Residuals versus each variable should be random cloud
U-shaped (or rainbow)  Nonlinear relationship
-- Plot of Residuals versus predicted should be random cloud
Wedge shaped  Non-constant (increasing) variability
-- Residuals should be mound-shaped (normal). Use skewness/kurtosis or a
normal probability plot to check.
-- Plot of Residuals versus Time order (Time series data) should be random
cloud. If D-W < 1.3, residuals are not independent.
Cook’s D is a check for influential observations that may have large
impacts on the equation. Check data for accuracy or errors (e.g. typos,
wrong units, etc.).
Detecting Influential Observations
Studentized Residuals – Residuals divided by their estimated
standard errors. Observations in dark blue are considered outliers
from the equation.
Leverage Values – Measure of how far an observation is from the
others in terms of the levels of the independent variables (not the
dependent variable). Observations in dark blue are considered to be
outliers in the X values.
Cook’s D – Measure of aggregate impact of each observation on the
group of regression coefficients, as well as the group of fitted values.
Values larger than 1 are considered highly influential.
Influential observations may suggest quarters to research to see if
something special happened that may suggest a dummy variable.
The Final Forecasts
• Add the last two quarters back in your data set and redo
the equation using the same variables and the next four
quarters. Recheck the assumptions now that you have 2
additional data points.
• Do the forecasts make sense? Superimpose your forecasts
on a time series plot of revenues and ensure that the
forecasts seem reasonable. If not, try to explain or find
• Document all your data and forecast sources.
• Write a report that documents all aspects of the
forecasting process.