Download Week 8

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Multiple Regression Analysis
Farideh H. Dehkordi-Vakil
Multiple Regression


Regression models may contain more than one
independent variable.
Example. In a study of direct operating cost, Y, for 67
branch offices of consumer finance charge, four
independent variables were considered:





X1: Average size of loan outstanding during the year,
X2 : Average number of loans outstanding,
X3 : Total number of new loan applications processed, and
X4 : Office salary scale index.
The model for this example is
Y   0  1 x1   2 x2  3 X 3   4 x4  
Formal Statement of the Model

General regression model
Y  0  1 x1   2 x2     k xk  
•
•
•
0, 1, , k are parameters
X1, X2, …,Xk are known constants
 , the error terms are independent N(o, 2)
Meaning of Regression Coefficients



The values of the regression parameters i are not
known.We estimate them from data.
i indicates the change in the mean response per
unit increase in Xi when the rest of the
independent variables in the model are held
constant
The parameters i are frequently called partial
regression coefficients because they reflect the
partial effect of one independent variable when the
rest of independent variables are included in the
model and are held constant
Analysis of Variance Results

The sum of squares decomposition and the
associated degrees of freedom are:
2
2
2
ˆ
ˆ
(
y

y
)

(
y

y
)

(
y

y
)
 i
 i
 i i
SST

df:

SSR

n  1  k  (n  k  1)
SSE
Analysis of Variance Table
Source
Sum of
Squares
df
Mean
Square
F-test
Regression
SSR
k
MSR=
SSR/k
MSR/MSE
Error
SSE
n-k-1
MSE=
SSE/n-k-1
Total
SST
n-1
F-test for Regression Relation

To test the statistical significance of the regression
relation between the dependent variable y and the
set of variables x1,…, xk, i.e. to choose between
the alternatives:
H 0 : 1   2     k  0
H a : not all  i (i  1, k ) equal zero

We use the test statistic:
MSR
F
MSE
F-test for Regression Relation



The decision rule at significance level  is:

Reject H0 if

Where the critical value F(, k, n-k-1) can be found
from an F-table.
F  F ( ; k , n  k  1)
The existence of a regression relation by itself
does not assure that useful prediction can be made
by using it.
Note that when k=1, this test reduces to the F-test
for testing in simple linear regression whether or
not 1= 0
Interval estimation of i

For our regression model, we have:
bi   i
s(bi )

has a t - distributi on with n - k - 1 degrees of freedom
Therefore, an interval estimate for i
with 1-  confidence coefficient is:
bi  t (
Where
s(bi ) 

2
; n  k  1) s (bi )
MSE
 ( x  x )2
Tests for k

To test:
H 0 : i  0
H a : i  0


We may use the test statistic:
bi
t
s (bi )
Reject H0 if

t  t ( ; n  k  1)
2

t  t ( ; n  k  1)
2
or
Selecting the best Regression equation.

After a lengthy list of potentially useful
independent variables has been compiled,
some of the independent variables can be
screened out. An independent variable



May not be fundamental to the problem
May be subject to large measurement error
May effectively duplicate another independent
variable in the list.
Selecting the best Regression Equation.

Once the investigator has tentatively
decided upon the functional forms of the
regression relations (linear, quadratic, etc.),
the next step is to obtain a subset of the
independent variables (x) that “best”
explain the variability in the dependent
variable y.
Selecting the best Regression Equation.



An automatic search procedure that
develops sequentially the subset of x
variables to be included in the regression
model is called stepwise procedure.
It was developed to economize on
computational efforts.
It will end with the identification of a single
regression model as “best”.
Example:Sales Forecasting


Sales Forecasting
 Multiple regression is a popular technique for predicting
product sales with the help of other variables that are likely to
have a bearing on sales.
Example
 The growth of cable television has created vast new potential
in the home entertainment business. The following table gives
the values of several variables measured in a random sample of
20 local television stations which offer their programming to
cable subscribers. A TV industry analyst wants to build a
statistical model for predicting the number of subscribers that a
cable station can expect.
Example:Sales Forecasting





Y = Number of cable subscribers (SUSCRIB)
X1 = Advertising rate which the station charges local
advertisers for one minute of prim time
space (ADRATE)
X2 = Kilowatt power of the station’s non-cable signal
(KILOWATT)
X3 = Number of families living in the station’s area of
dominant influence (ADI), a geographical division of
radio and TV audiences (APIPOP)
X4 = Number of competing stations in the ADI
(COMPETE)
Example:Sales Forecasting



The sample data are fitted by a multiple regression
model using Excel program.
The marginal t-test provides a way of choosing the
variables for inclusion in the equation.
The fitted Model is
SUBSCRIBE   0  1  ADRATE   2  APIPOP  3  COMPETE   4  SIGNAL
Example:Sales Forecasting

Excel Summary output
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.884267744
R Square
0.781929444
Adjusted R Square
0.723777295
Standard Error
142.9354188
Observations
20
ANOVA
df
Regression
SS
MS
4
1098857.84
274714.4601
Residual
15
306458.0092
20430.53395
Total
19
1405315.85
Coefficients
Standard Error
t Stat
F
Significance F
13.44626923
P-value
7.52E-05
Lower 95% Upper 95%
Intercept
51.42007002
98.97458277
0.51952803
0.610973806
-159.539
AD_Rate
-0.267196347
0.081055107
-3.296477624
0.004894126
-0.43996
-0.09443
Signal
-0.020105139
0.045184758
-0.444954014
0.662706578
-0.11641
0.076204
0.440333955
0.135200486
3.256896248
0.005307766
0.152161
0.728507
16.230071
26.47854322
0.61295181
0.549089662
-40.2076
72.66778
APIPOP
Compete
262.3795
Example:Sales Forecasting



Do we need all the four variables in the
model?
Based on the partial t-test, the variables
signal and compete are the least significant
variables in our model.
Let’s drop the least significant variables one
at a time.
Example:Sales Forecasting

Excel Summary Output
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.882638739
R Square
0.779051144
Adjusted R Square
0.737623233
Standard Error
139.3069743
Observations
20
ANOVA
df
SS
Regression
MS
3
1094812.92
364937.64
Residual
16
310502.9296
19406.4331
Total
19
1405315.85
Coefficients
Standard Error
t Stat
F
18.80498277
P-value
Significance F
1.69966E-05
Lower 95%
Upper 95%
Intercept
51.31610447
96.4618242
0.531983558
0.602046756
-153.1737817
255.806
AD_Rate
-0.259538026
0.077195983
-3.36206646
0.003965102
-0.423186162
-0.09589
APIPOP
0.433505145
0.130916687
3.311305499
0.004412929
0.15597423
0.711036
Compete
13.92154404
25.30614013
0.550125146
0.589831583
-39.72506442
67.56815
Example:Sales Forecasting

The variable Compete is the next variable to
get rid of.
Example:Sales Forecasting

Excel Summary Output
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8802681
R Square
0.774871928
Adjusted R Square
0.748386273
Standard Error
136.4197776
Observations
20
ANOVA
df
SS
Regression
MS
2
1088939.802
544469.901
Residual
17
316376.0474
18610.35573
Total
19
1405315.85
Coefficients
Intercept
96.28121395
AD_Rate
-0.254280696
APIPOP
0.495481252
Standard Error
50.16415506
t Stat
F
29.2562866
P-value
Significance F
3.13078E-06
Lower 95%
Upper 95%
-9.556049653
202.1184776
1.919322948
0.07188916
0.075014548 -3.389751739
0.003484198
-0.41254778 -0.096013612
0.065306012
7.45293E-07
0.357697418
7.587069489
0.633265086
Example:Sales Forecasting
• All the variables in the model are
statistically significant, therefore our final
model is:
Final Model
SUBSCRIBE  96.28  0.25  ADRATE  0.495  APIPOP
Interpreting the Final Model





What is the interpretation of the estimated parameters.
Is the association positive or negative?
Does this make sense intuitively, based on what the data
represents?
What other variables could be confounders?
Are there other analysis that you might consider doing?
New questions raised?
Multicollinearity


In multiple regression analysis, one is often
concerned with the nature and significance of the
relations between the independent variables and
the dependent variable.
Questions that are frequently asked are:


What is the relative importance of the effects of the
different independent variables?
What is the magnitude of the effect of a given
independent variable on the dependent variable?
Multicollinearity



Can any independent variable be dropped from the
model because it has little or no effect on the dependent
variable?
Should any independent variables not yet included in
the model be considered for possible inclusion?
Simple answers can be given to these questions if


The independent variables in the model are
uncorrelated among themselves.
They are uncorrelated with any other independent
variables that are related to the dependent variable but
omitted from the model.
Multicollinearity



When the independent variables are correlated among
themselves, multicollinearity among them is said to exist.
In many non-experimental situations in business,
economics, and the social and biological sciences, the
independent variables tend to be correlated among
themselves.
For example, in a regression of family food expenditures
on the variables family income, family savings, and the age
of head of household, the independent variables will be
correlated among themselves.
Multicollinearity

Further, the independent variables will also
be correlated with other socioeconomic
variables not included in the model that do
affect family food expenditures, such as
family size.
Multicollinearity
Some key problems that typically arise when the
independent variables being considered for the
regression model are highly correlated among
themselves are:

1.
2.
3.
Adding or deleting an independent variable changes the
regression coefficients.
The estimated standard deviations of the regression coefficients
become large when the independent variables in the regression
model are highly correlated with each other.
The estimated regression coefficients individually may not be
statistically significant even though a definite statistical relation
exists between the dependent variable and the set of
independent variables.
Multicollinearity Diagnostics

A formal method of detecting the presence of
multicollinearity that is widely used is by the
means of Variance Inflation Factor.

It measures how much the variances of the estimated
regression coefficients are inflated as compared to
when the independent variables are not linearly related.
VIFj 

1
,
2
1 Rj
j  1,2,k
R 2j Is the coefficient of determination from the
regression of the jth independent variable on the
remaining k-1 independent variables.
Multicollinearity Diagnostics

AVIF near 1 suggests that multicollinearity is not a
problem for the independent variables.


Its estimated coefficient and associated t value will not change
much as the other independent variables are added or deleted from
the regression equation.
A VIF much greater than 1 indicates the presence of
multicollinearity. A maximum VIF value in excess of 10 is
often taken as an indication that the multicollinearity may
be unduly influencing the least square estimates.

the estimated coefficient attached to the variable is unstable and
its associated t statistic may change considerably as the other
independent variables are added or deleted.
Multicollinearity Diagnostics


The simple correlation coefficient between all
pairs of explanatory variables (i.e., X1, X2, …, Xk
) is helpful in selecting appropriate explanatory
variables for a regression model and is also critical
for examining multicollinearity.
While it is true that a correlation very close to +1
or –1 does suggest multicollinearity, it is not true
(unless there are only two explanatory variables)
to infer multicollinearity does not exist when there
are no high correlations between any pair of
explanatory variables.
Example:Sales Forecasting
Pearson Correlation Coefficients, N = 20
Prob > |r| under H0: Rho=0
SUBSCRIB
ADRATE
KILOWATT
APIPOP
COMPETE
1.00000
-0.02848
0.9051
0.44762
0.0478
0.90447
<.0001
0.79832
<.0001
-0.02848
0.9051
1.00000
-0.01021
0.9659
0.32512
0.1619
0.34147
0.1406
KILOWATT
KILOWATT
0.44762
0.0478
-0.01021
0.9659
1.00000
0.45303
0.0449
0.46895
0.0370
APIPOP
APIPOP
0.90447
<.0001
0.32512
0.1619
0.45303
0.0449
1.00000
0.87592
<.0001
COMPETE
COMPETE
0.79832
<.0001
0.34147
0.1406
0.46895
0.0370
0.87592
1.00000
SUBSCRIB
SUBSCRIB
ADRATE
ADRATE
<.0001
Example:Sales Forecasting

VIF calculation:
Fit the model

APIPOP   0  1  SIGNAL   2  ADRATE  3  COMPETE
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.878054
R Square
0.770978
Adjusted R Square
0.728036
Standard Error
264.3027
Observations
20
ANOVA
df
Regression
SS
MS
3
3762601
1254200
Residual
16
1117695
69855.92
Total
19
4880295
Coefficients
Standard Error t Stat
F
17.9541
Significance F
2.25472E-05
P-value
Lower 95%
Intercept
-472.685
139.7492
-3.38238
0.003799
-768.9402258
Upper 95%
-176.43
Compete
159.8413
28.29157
5.649786
3.62E-05
99.86587622
219.8168
ADRATE
0.048173
0.149395
0.322455
0.751283
-0.268529713
0.364876
Signal
0.037937
0.083011
0.457012
0.653806
-0.138038952
0.213913
Example:Sales Forecasting

Fit the model
Compete   0  1  ADRATE   2  APIPOP  3  SIGNAL
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.882936
R Square
0.779575
Adjusted R Square
0.738246
Standard Error
1.34954
Observations
20
ANOVA
df
Regression
SS
MS
3
103.0599
34.35329
Residual
16
29.14013
1.821258
Total
19
132.2
Coefficients
Standard Error t Stat
F
18.86239
P-value
Significance F
1.66815E-05
Lower 95%
Upper 95%
Intercept
3.10416
0.520589
5.96278
1.99E-05
2.000559786
4.20776
ADRATE
0.000491
0.000755
0.649331
0.525337
-0.001110874
0.002092
Signal
0.000334
0.000418
0.799258
0.435846
-0.000552489
0.001221
APIPOP
0.004167
0.000738
5.649786
3.62E-05
0.002603667
0.005731
Example:Sales Forecasting

Fit the model
Signal   0  1  ADRATE   2  APIPOP  3  COMPETE
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.512244
R Square
0.262394
Adjusted R Square
0.124092
Standard Error
790.8387
Observations
20
ANOVA
df
Regression
SS
3
MS
3559789
1186596
Residual
16 10006813
625425.8
Total
19 13566602
Coefficients
Standard Error t Stat
F
1.897261
Significance F
0.170774675
P-value
Lower 95%
Intercept
5.171093
547.6089
0.009443
0.992582
-1155.707711
Upper 95%
1166.05
APIPOP
0.339655
0.743207
0.457012
0.653806
-1.235874129
1.915184
Compete
114.8227
143.6617
0.799258
0.435846
-189.7263711
419.3718
ADRATE
-0.38091
0.438238
-0.86919
0.397593
-1.309935875
0.548109
Example:Sales Forecasting

Fit the model
ADRATE   0  1  Signal   2  APIPOP  3  COMPETE
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.399084
R Square
0.159268
Adjusted R Square
0.001631
Standard Error
440.8588
Observations
20
ANOVA
df
Regression
SS
MS
3
589101.7
196367.2
Residual
16
3109703
194356.5
Total
19
3698805
Coefficients
Standard Error t Stat
Intercept
Signal
APIPOP
Compete
F
1.010346
Significance F
0.413876018
P-value
Lower 95%
Upper 95%
253.7304
298.6063
0.849716
0.408018
-379.2865355
886.7474
-0.11837
0.136186
-0.86919
0.397593
-0.407073832
0.170329
0.134029
0.415653
0.322455
0.751283
-0.747116077
1.015175
52.3446
80.61309
0.649331
0.525337
-118.5474784
223.2367
Example:Sales Forecasting


VIF calculation Results:
Variable
R- Squared
VIF
ADRATE
0.159268
1.19
COMPETE
0.779575
4.54
SIGNAL
0.262394
1.36
APIPOP
0.770978
4.36
There is no significant multicollinearity.
Qualitative Independent Variables



Many variables of interest in business, economics,
and social and biological sciences are not
quantitative but are qualitative.
Examples of qualitative variables are gender
(male, female), purchase status (purchase, no
purchase), and type of firms.
Qualitative variables can also be used in multiple
regression.
Qualitative Independent Variables

An economist wished to relate the speed with which a
particular insurance innovation is adopted (y) to the size of
the insurance firm (x1) and the type of firm. The dependent
variable is measured by the number of months elapsed
between the time the first firm adopted the innovation and
and the time the given firm adopted the innovation. The
first independent variable, size of the firm, is quantitative,
and measured by the amount of total assets of the firm. The
second independent variable, type of firm, is qualitative
and is composed of two classes-Stock companies and
mutual companies.
Indicator variables
Indicator, or dummy variables are used to
determine the relationship between qualitative
independent variables and a dependent variable.
 Indicator variables take on the values 0
and 1.
 For the insurance innovation example, where the
qualitative variable has two classes, we might
define the indicator variable x2 as follows:

x2 
1 if stock company
0 otherwise
Indicator variables


A qualitative variable with c classes will be
represented by c-1 indicator variables.
A regression function with an indicator
variable with two levels (c = 2) will yield
two estimated lines.
Interpretation of Regression Coefficients

In our insurance innovation example, the
regression model is:
y   0  1 x1   2 x2  

Where:

x1  size of firm

x2 
1 if stock company
0 otherwise
Interpretation of Regression Coefficients

To understand the meaning of the regression
coefficients in this model, consider first the
case of mutual firm. For such a firm, x2 = 0
and we have:
yˆi  b0  b1 x1  b2 (0)  b0 b1 x1

Mutual firms
For a stock firm x2 = 1 and the response
function is:
yˆi  b0  b1 x1  b2 (1)  (b0  b2 )  b1 x1
Stock firms
Interpretation of Regression Coefficients



The response function for the mutual firms is a
straight line, with y intercept 0 and slope 1.
For stock firms, this also is a straight line, with the
same slope 1 but with y intercept 0+2.
With reference to the insurance innovation
example, the mean time elapsed before the
innovation is adopted is linear function of size of
firm (x1), with the same slope 1for both types of
firms.
Interpretation of Regression Coefficients



2 indicates how much lower or higher the
response function for stock firm is than the one for
the mutual firm.
2 measures the differential effect of type of firms.
In general, 2 shows how much higher (lower) the
mean response line is for the class coded 1 than
the line for the class coded 0, for any level of x1.
Accounting for Seasonality in a Multiple
regression Model




Seasonal Patterns are not easily accounted for by
the typical causal variables that we use in
regression analysis.
An indicator variable can be used effectively to
account for seasonality in our time series data.
The number of seasonal indicator variables to use
depends on the data.
If we have p periods in our data series, we can not
use more than P-1 seasonal indicator variables.
Example: Private Housing Starts (PHS)

Housing starts in the United States measured in
thousands of units. These data are plotted for 1990
Q1 through 1999Q4. There are typically few
housing starts during the first quarter of the year
(January, February, March); there is usually a big
increase in the second quarter of (April, May,
June), followed by some decline in the third
quarter (July, August, September), and further
decline in the fourth quarter (October, November,
December).
Nov-98
Jul-98
1
Mar-98
Nov-97
Jul-97
1
Mar-97
Nov-96
Jul-96
Mar-96
Nov-95
Jul-95
1
Mar-95
Nov-94
Jul-94
100
Mar-94
Nov-93
Jul-93
1
Mar-93
Nov-92
Jul-92
200
Mar-92
Nov-91
Jul-91
Mar-91
Nov-90
Jul-90
Mar-90
Example: Private Housing Starts (PHS)
Private Housing Starts (PHS) in Thousands of Units
400
350
300
250
1
1
1
150
1
"1" marks the first quarter of each year.
50
0
Example: Private Housing Starts (PHS)

To Account for and measure this seasonality in a
regression model, we will use three dummy
variables: Q2 for the second quarter, Q3 for the
third quarter, and Q4 for the fourth quarter. These
will be coded as follows:



Q2 = 1 for all second quarters and zero otherwise.
Q3 = 1 for all third quarters and zero otherwise
Q4 = 1 for all fourth quarters and zero otherwise.
Example: Private Housing Starts (PHS)




Data for private housing starts (PHS), the
mortgage rate (MR), and these seasonal indicator
variables are shown in the following slide.
Examine the data carefully to verify your
understanding of the coding for Q2, Q3, Q4.
Since we have assigned dummy variables for the
second, third, and fourth quarters, the first quarter
is the base quarter for our regression model.
Note that any quarter could be used as the base,
with indicator variables to adjust for differences in
other quarters.
Example: Private Housing Starts (PHS)
PERIOD
PHS
MR
Q2
Q3
Q4
31-Mar-90
217
10.1202
0
0
0
30-Jun-90
271.3
10.3372
1
0
0
30-Sep-90
233
10.1033
0
1
0
31-Dec-90
173.6
9.9547
0
0
1
31-Mar-91
146.7
9.5008
0
0
0
30-Jun-91
254.1
9.5265
1
0
0
30-Sep-91
239.8
9.2755
0
1
0
31-Dec-91
199.8
8.6882
0
0
1
31-Mar-92
218.5
8.7098
0
0
0
30-Jun-92
296.4
8.6782
1
0
0
30-Sep-92
276.4
8.0085
0
1
0
31-Dec-92
238.8
8.2052
0
0
1
31-Mar-93
213.2
7.7332
0
0
0
30-Jun-93
323.7
7.4515
1
0
0
30-Sep-93
309.3
7.0778
0
1
0
31-Dec-93
279.4
7.0537
0
0
1
31-Mar-94
252.6
7.2958
0
0
0
30-Jun-94
354.2
8.4370
1
0
0
30-Sep-94
325.7
8.5882
0
1
0
31-Dec-94
265.9
9.0977
0
0
1
31-Mar-95
214.2
8.8123
0
0
0
30-Jun-95
296.7
7.9470
1
0
0
30-Sep-95
308.2
7.7012
0
1
0
31-Dec-95
257.2
7.3508
0
0
1
31-Mar-96
240
7.2430
0
0
0
30-Jun-96
344.5
8.1050
1
0
0
30-Sep-96
324
8.1590
0
1
0
31-Dec-96
252.4
7.7102
0
0
1
31-Mar-97
237.8
7.7905
0
0
0
30-Jun-97
324.5
7.9255
1
0
0
30-Sep-97
314.6
7.4692
0
1
0
31-Dec-97
256.8
7.1980
0
0
1
31-Mar-98
258.4
7.0547
0
0
0
30-Jun-98
360.4
7.0938
1
0
0
30-Sep-98
348
6.8657
0
1
0
31-Dec-98
304.6
6.7633
0
0
1
31-Mar-99
294.1
6.8805
0
0
0
30-Jun-99
377.1
7.2037
1
0
0
30-Sep-99
355.6
7.7990
0
1
0
31-Dec-99
308.1
7.8338
0
0
1
Example: Private Housing Starts (PHS)

The regression model for private housing
starts (PHS) is:
PHS   0  1 (MR)   2 (Q2)  3 (Q3)   4 (Q4)


In this model we expect b1 to have a
negative sign, and we would expect b2, b3,
b4 all to have positive signs. Why?
Regression results for this model are shown
in the next slide.
Example: Private Housing Starts (PHS)
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
0.885398221
0.78393001
0.759236296
26.4498851
Observations
40
ANOVA
df
SS
Regression
MS
4
88837.93624
22209.48406
Residual
35
24485.87476
699.5964217
Total
39
113323.811
Coefficients
Intercept
473.0650749
Standard Error
35.54169837
t Stat
F
31.74613731
P-value
13.31014264
2.93931E-15
4.257226391 -7.058206249
3.21421E-08
Significance F
3.33637E-11
Lower 95%
Upper 95%
400.9115031
545.2186467
MR
-30.04838192
Q2
95.74106935
11.84748487
8.081130334
1.6292E-09
71.689367
119.7927717
Q3
73.92904763
11.82881519
6.249911462
3.62313E-07
49.91524679
97.94284847
Q4
20.54778131
11.84139803
1.73524961
0.091495355
-3.491564078
44.5871267
-38.69102153 -21.40574231
Example: Private Housing Starts (PHS)


Use the prediction equation to make a
forecast for each of the fourth quarter of
1999.
Prediction equation:
PHSˆ  473.06  30.05( MR )  95.74(Q 2)  73.93(Q3)  20.55(Q 4)
Example: Private Housing Starts (PHS)
400
Private Housing Starts (PHS) with a Simple Regression Forecast (PHSF1) and a Multiple Regression Forecast (PHSF2) in
Thousands of Units
350
300
250
200
150
100
50
PHS
PHSF1
PHSF2
Nov-98
Jul-98
Mar-98
Nov-97
Jul-97
Mar-97
Nov-96
Jul-96
Mar-96
Nov-95
Jul-95
Mar-95
Nov-94
Jul-94
Mar-94
Nov-93
Jul-93
Mar-93
Nov-92
Jul-92
Mar-92
Nov-91
Jul-91
Mar-91
Nov-90
Jul-90
Mar-90
0
Regression Diagnostics and Residual
Analysis




It is important to check the adequacy of the model before it
becomes part of the decision making process.
Residual plots can be used to check the model
assumptions.
It is important to study outlying observations to decide
whether they should be retained or eliminated.
If retained, whether their influence should be reduced in
the fitting process or revise the regression function.
Time Series Data and the Problem of
Serial Correlation



In the regression models we assume that the
errors i are independent.
In business and economics, many regression
applications involve time series data.
For such data, the assumption of
uncorrelated or independent error terms is
often not appropriate.
Problems of Serial Correlation

If the error terms in the regression model are
autocorrelated, the use of ordinary least squares
procedures has a number of important
consequences



MSE underestimate the variance of the error terms
The confidence intervals and tests using the t and F
distribution are no longer strictly applicable.
The standard error of the regression coefficients
underestimate the variability of the estimated regression
coefficients. Spurious regression can result.
First order serial correlation


The error term in current period is directly related
to the error term in the previous time period.
Let the subscript t represent time, then the simple
linear regression model is:
yt   0  1 xt   t




Where
 t    t 1  t
t = error at time t
 = the parameter that measures correlation between
adjacent error terms
t normally distributed error terms with mean zero and
variance 2
Example

The effect of positive serial correlation in a
simple linear regression model.



Misleading forecasts of future y values.
Standard error of the estimate, S y.x will
underestimate the variability of the y’s about
the true regression line.
Strong autocorrelation can make two unrelated
variables appear to be related.
Durbin-Watson Test for Serial
Correlation

Recall the first-order serial correlation model
yt   0  1 xt   t
 t    t 1  t

The hypothesis to be tested are:
H0 :   0
Ha :   0

The alternative hypothesis is  > 0 since in
business and economic time series tend to show
positive correlation.
Durbin-Watson Test for Serial
Correlation

The Durbin-Watson statistic is defined as
n
DW 
 (e
t 2
t
 et 1 ) 2
n
e
t 1

2
t
Where
et  yt  yˆt  the residual for time period t
et 1  yt 1  yˆt 1  the residual for time period t -1
Durbin-Watson Test for Serial
Correlation

The auto correlation coefficient  can be
estimated by the lag 1 residual
autocorrelation r1(e)
n
r1 (e) 
e e
t 2
n
t
t 1
2
e
 t
t 1

And it can be shown that
DW  2(1  r1 (e))
Durbin-Watson Test for Serial
Correlation




Since –1 < r1(e) < 1 then 0 < DW < 4
If r1(e) = 0, then DW = 2 (there is no
correlation.)
If r1(e) > 0, then DW < 2 (positive
correlation)
If r1(e) < 0, Then DW > 2 (negative
correlation)
Durbin-Watson Test for Serial
Correlation

Decision rule:





If DW > U, Do not reject H0.
If DW < L, Reject H0
If L  DW  U, the test is inconclusive.
The critical Upper (U) an Lower (L) bound can be
found in Durbin-Watson table of your text book.
To use this table you need to know The
significance level () The number of independent
parameters in the model (k), and the sample size
(n).
Example

The Blaisdell Company wished to predict
its sales by using industry sales as a
predictor variable. The following table gives
seasonally adjusted quarterly data on
company sales and industry sales for the
period 1983-1987.
Example
Year
1983
1984
1985
1986
1987
Quarter
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CompSale
20.96
21.4
21.96
21.52
22.39
22.76
23.48
23.66
24.1
24.01
24.54
24.3
25
25.64
26.36
26.98
27.52
27.78
28.24
28.78
InduSale
127.3
130
132.7
129.4
135
137.1
141.2
142.8
145.5
145.3
148.3
146.4
150.2
153.1
157.3
160.7
164.2
165.6
168.7
171.7
Example
Blaisdell Company Example
Company Sales ($
millions)
35
30
25
20
15
10
5
0
0
50
100
Industry sales($ millions)
150
200
Example




The scatter plot suggests that a linear regression
model is appropriate.
Least squares method was used to fit a regression
line to the data.
The residuals were plotted against the fitted
values.
The plot shows that the residuals are consistently
above or below the fitted value for extended
periods.
Example
Example

To confirm this graphic diagnosis we will use the
Durbin-Watson test for:
H0 :   0
Ha :   0

The test statistic is:
n
DW 
 (e
t 2
t
 et 1 ) 2
n
e
t 1
2
t
Example
Year
1983
1984
1985
1986
1987
Quarter
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
t
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Company sales(y) Industry sales(x)
20.96
127.3
21.4
130
21.96
132.7
21.52
129.4
22.39
135
22.76
137.1
23.48
141.2
23.66
142.8
24.1
145.5
24.01
145.3
24.54
148.3
24.3
146.4
25
150.2
25.64
153.1
26.36
157.3
26.98
160.7
27.52
164.2
27.78
165.6
28.24
168.7
28.78
171.7
Blaisdell Company Example
35
30
et
-0.02605
-0.06202
0.022021
0.163754
0.04657
0.046377
0.043617
-0.05844
-0.0944
-0.14914
-0.14799
-0.05305
-0.02293
0.105852
0.085464
0.106102
0.029112
0.042316
-0.04416
-0.03301
et -et-1
(et -et-1)^2
-0.03596
0.084036
0.141733
-0.11718
-0.00019
-0.00276
-0.10205
-0.03596
-0.05474
0.001152
0.094937
0.030125
0.12878
-0.02039
0.020638
-0.07699
0.013204
-0.08648
0.011152
0.001293
0.007062
0.020088
0.013732
3.76E-08
7.61E-06
0.010415
0.001293
0.002997
1.33E-06
0.009013
0.000908
0.016584
0.000416
0.000426
0.005927
0.000174
0.007478
0.000124
et ^2
0.000679
0.003846
0.000485
0.026815
0.002169
0.002151
0.001902
0.003415
0.008911
0.022243
0.021901
0.002815
0.000526
0.011205
0.007304
0.011258
0.000848
0.001791
0.00195
0.00109
0.097941 0.133302
Example
.09794
DW 
 .735
.13330


Using Durbin Watson table of your text
book, for k = 1, and n=20, and using  =
.01 we find U = 1.15, and L = .95
Since DW = .735 falls below L = .95 , we
reject the null hypothesis, namely, that the
error terms are positively autocorrelated.
Remedial Measures for Serial
Correlation

Addition of one or more independent
variables to the regression model.


One major cause of autocorrelated error terms
is the omission from the model of one or more
key variables that have time-ordered effects on
the dependent variable.
Use transformed variables.

The regression model is specified in terms of
changes rather than levels.
Extensions of the Multiple Regression
Model

In some situations, nonlinear terms may be needed
as independent variables in a regression analysis.



Business or economic logic may suggest that nonlinearity is expected.
A graphic display of the data may be helpful in
determining whether non-linearity is present.
One common economic cause for non-linearity is
diminishing returns.

Fore example, the effect of advertising on sales may
diminish as increased advertising is used.
Extensions of the Multiple Regression
Model

Some common forms of nonlinear functions
are :
Y   0  1 ( X )   2 ( X 2 )
Y   0  1 ( X )   2 ( X 2 )   3 ( X 3 )
Y   0  1 (1 X )
Y  e  0 X 1
Extensions of the Multiple Regression
Model


To illustrate the use and interpretation of a
non-linear term, we return to the problem of
developing a forecasting model for private
housing starts (PHS).
So far we have looked at the following
model
PHS   0  1 (MR)   2 (Q2)  3 (Q3)   4 (Q4)

Where MR is the mortgage rate and Q2, Q3, and Q4 are
indicators variables for quarters 2, 3, and 4.
Example: Private Housing Start

First we add real disposable personal
income per capita (DPI) as an independent
variable. Our new model for this data set is:
PHS   0  1 (MR)   2 (Q2)  3 (Q3)   4 (Q4)  5 ( DPI )

Regression results for this model are shown
in the next slide.
Example: Private Housing Start
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.943791346
R Square
0.890742104
Adjusted R Square
0.874187878
Standard Error
19.05542121
Observations
39
ANOVA
df
SS
Regression
MS
5
97690.01942
19538
Residual
33
11982.59955
363.1091
Total
38
109672.619
Coefficients
Intercept
Standard Error
F
53.80753
Significance F
6.51194E-15
t Stat
P-value
Lower 95%
Upper 95%
-0.2953
0.769613
-245.0826992
182.9546249
-31.06403714
105.1938477
MR
-20.1992545
4.124906847
-4.8969
2.5E-05
Q2
97.03478074
8.900711541
10.90191
1.78E-12
78.9261326
115.1434289
Q3
75.40017073
8.827185877
8.541813
7.17E-10
57.44111179
93.35922967
Q4
20.35306822
8.83373887
2.304015
0.027657
2.380677107
38.32545934
DPI
0.022407799
0.004356973
5.142974
1.21E-05
0.013543464
0.031272134
-28.59144723 -11.80706176
Example: Private Housing Start

The prediction model is
PHSˆ  31.06  20.19( MR )  97.03(Q 2)  75.40(Q3)  20.35(Q 4)  0.02( DPI )


In comparison with the previous model, we
see that the R-squared has improved.It has
changed from 78% to 89%.
The standard error of the estimate has
decreased from 26.49 for the previous
model to 19.05 for the new model.
Example: Private Housing Start




The value of the DW test has changed from 0.88
for the previous model to 0.78 for the new model.
At 5% level the critical value for DW test, from
Durbin-Watson table, for k = 5, and n = 39 is L=
1.22, and U = 1.79.
Since The value of the DW test is smaller than
L=1.22, we reject the null hypothesis H0:  =0
This implies that there is serial correlation in both
models, the assumption of the independence of
the error terms is not valid.
Example: Private Housing Start


The Plot of PHS against
DPI shows a curve linear
relation.
Next we introduce a
nonlinear term into the
regression.
The square of disposable
personal income per capita
(DPI2) is included in the
regression model.
Private Housing Start and Disposable Personal Income
21500
21000
20500
20000
PHS

19500
19000
18500
18000
17500
0
50
100
150
200
DPI
250
300
350
400
Example: Private Housing Start


We also add the dependent variable, lagged
one quarter, as an independent variable in
order to help reduce serial correlation.
The third model that we fit to our data set
is:
PHS   0  1 ( MR )   2 (Q 2)   3 (Q3)   4 (Q 4)   5 ( DPI )   6 ( DPI 2 )   7 ( LPHS )

Regression results for this model are shown
in the next slide.
Example: Private Housing Start
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.97778626
R Square
0.956065971
Adjusted R Square
0.946145384
Standard Error
12.46719572
Observations
39
ANOVA
df
SS
Regression
MS
7
104854.2589
14979.17985
Residual
31
4818.360042
155.4309691
Total
38
109672.619
Coefficients
t Stat
Significance F
3.07085E-19
P-value
Lower 95%
716.5926532
1017.664989
0.704153784
0.486593
-1358.949934
2792.13524
MR
-13.65521724
3.093504134
-4.414158396
0.000114
-19.96446404
-7.345970448
Q2
106.9813297
6.069780998
17.62523718
1.04E-17
94.60192287
119.3607366
Q3
27.72122303
9.111432565
3.042465916
0.004748
9.138323433
46.30412262
Q4
-13.37855186
7.653050858
-1.748133144
0.09034
-28.98706069
2.22995698
DPI
Intercept
Standard Error
F
96.37191
Upper 95%
-0.060399279
0.104412354
-0.578468704
0.567127
-0.273349798
0.15255124
DPI SQUARED
0.000335974
0.000536397
0.626354647
0.535668
-0.000758014
0.001429963
LPHS
0.655786939
0.097265424
6.742241114
1.51E-07
0.457412689
0.854161189
Example: Private Housing Start





The inclusion of DPI2 and Lagged PHS has
increased the R-squared to 96%
The standard error of the estimate has decreased to
12.45
The value of the DW test has increased to 2.32
which is greater than U = 1.79 which rule out
positive serial correlation.
You see that the third model worked best for this
data set.
The following slide gives the data set.
Example: Private Housing Start
PERIOD
PHS
LPHS
Q2
Q3
Q4
DPI
30-Jun-90
271.3
10.3372
MR
217
1
0
0
18063
DPI SQUARED
1,631,359.85
30-Sep-90
233
10.1033
271.3
0
1
0
18031
1,625,584.81
31-Dec-90
173.6
9.9547
233
0
0
1
17856
1,594,183.68
31-Mar-91
146.7
9.5008
173.6
0
0
0
17748
1,574,957.52
30-Jun-91
254.1
9.5265
146.7
1
0
0
17861
1,595,076.61
30-Sep-91
239.8
9.2755
254.1
0
1
0
17816
1,587,049.28
31-Dec-91
199.8
8.6882
239.8
0
0
1
17811
1,586,158.61
31-Mar-92
218.5
8.7098
199.8
0
0
0
18000
1,620,000.00
30-Jun-92
296.4
8.6782
218.5
1
0
0
18085
1,635,336.13
30-Sep-92
276.4
8.0085
296.4
0
1
0
18036
1,626,486.48
31-Dec-92
238.8
8.2052
276.4
0
0
1
18330
1,679,944.50
31-Mar-93
213.2
7.7332
238.8
0
0
0
17975
1,615,503.13
30-Jun-93
323.7
7.4515
213.2
1
0
0
18247
1,664,765.05
30-Sep-93
309.3
7.0778
323.7
0
1
0
18246
1,664,582.58
31-Dec-93
279.4
7.0537
309.3
0
0
1
18413
1,695,192.85
31-Mar-94
252.6
7.2958
279.4
0
0
0
18154
1,647,838.58
30-Jun-94
354.2
8.4370
252.6
1
0
0
18409
1,694,456.41
30-Sep-94
325.7
8.5882
354.2
0
1
0
18493
1,709,955.25
31-Dec-94
265.9
9.0977
325.7
0
0
1
18667
1,742,284.45
31-Mar-95
214.2
8.8123
265.9
0
0
0
18834
1,773,597.78
30-Jun-95
296.7
7.9470
214.2
1
0
0
18798
1,766,824.02
30-Sep-95
308.2
7.7012
296.7
0
1
0
18871
1,780,573.21
31-Dec-95
257.2
7.3508
308.2
0
0
1
18942
1,793,996.82
31-Mar-96
240
7.2430
257.2
0
0
0
19071
1,818,515.21
30-Jun-96
344.5
8.1050
240
1
0
0
19081
1,820,422.81
30-Sep-96
324
8.1590
344.5
0
1
0
19161
1,835,719.61
31-Dec-96
252.4
7.7102
324
0
0
1
19152
1,833,995.52
31-Mar-97
237.8
7.7905
252.4
0
0
0
19331
1,868,437.81
30-Jun-97
324.5
7.9255
237.8
1
0
0
19315
1,865,346.13
30-Sep-97
314.6
7.4692
324.5
0
1
0
19385
1,878,891.13
31-Dec-97
256.8
7.1980
314.6
0
0
1
19478
1,896,962.42
31-Mar-98
258.4
7.0547
256.8
0
0
0
19632
1,927,077.12
30-Jun-98
360.4
7.0938
258.4
1
0
0
19719
1,944,194.81
30-Sep-98
348
6.8657
360.4
0
1
0
19905
1,980,963.41
31-Dec-98
304.6
6.7633
348
0
0
1
20194
2,038,980.00
31-Mar-99
294.1
6.8805
304.6
0
0
0
20377
2,076,010.87
30-Jun-99
377.1
7.2037
294.1
1
0
0
20472
2,095,440.74
30-Sep-99
355.6
7.7990
377.1
0
1
0
20756
2,153,982.23
31-Dec-99
308.1
7.8338
355.6
0
0
1
21124
2,231,020.37