Download 5-Prediction, Goodne..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
CHAPTER 5
PREDICTION, GOODNESS-OF-FIT AND MODELING ISSUES
1.
2.
3.
4.
5.
6.
7.
Confidence Interval for the Mean Value of 𝑦 for a Given Value π‘₯, and Prediction Interval for the Individual
Value of y for a Given π‘₯
1.1. Confidence Interval for the Mean Value of 𝑦 for a Given Value π‘₯
1.2. Prediction Interval for the Individual Value of 𝑦 for a Given π‘₯
Reporting Regression Results
2.1. Computer Output
2.2. Reporting the Summary Results
The F-Test of Goodness of Fit
Modeling Issues
4.1. The Effects of Scaling the Data
4.1.1. Changing the scale of π‘₯
4.1.2. Changing the scale of 𝑦
4.1.3. Changing the scale of π‘₯ and 𝑦 by the same factor 𝑐
Choosing a Functional Form
Linear Functional Form
Linear-Log (Semi-log)
Log-Log Functional Form
Examples for Functional Forms
Prediction in the Log-Linear Model
1. Prediction Interval for the Individual Value of π’š for a Given 𝒙
In Chapter 4, Section 3.1, we learned how to build an interval estimate for the mean value of the dependent
variable 𝑦 for a given or specified value of the dependent variable π‘₯. For that we used the example of weekly
food expenditure as a function of weekly income.
𝐹𝑂𝑂𝐷𝐸𝑋𝑃 = 𝛽1 + 𝛽2 𝐼𝑁𝐢𝑂𝑀𝐸 + 𝑒
There we obtained the estimated regression equation, 𝑦̂ = 𝑏1 + 𝑏2 π‘₯,
𝑦̂ = 83.416 + 10.21π‘₯
and built an interval estimate for 𝑦̂ when π‘₯ = 10 (weekly income = $1,000). Since 𝑦̂ is a point on the
estimated regression line, it represents an estimate of the mean value 𝑦, food expenditure, in the population
for the given weekly income. Therefore, when we build an interval estimate for 𝑦̂, we are building an interval
estimate for the mean or expected value of 𝑦 for a given π‘₯: πœ‡π‘¦|π‘₯0 = E(𝑦|π‘₯0 ).
We know, however, for each value of π‘₯ in the population, there are many different values of 𝑦 corresponding
to that π‘₯. How do we build an interval estimate for an individual value, rather than the mean value, of 𝑦? Let
π‘₯0 denote the given value of π‘₯. Then in the population regression function, 𝑦0 is an individual value of 𝑦 for a
given value of π‘₯, which deviates from the mean value by the disturbance term 𝑒0 ,
𝑦0 = E(𝑦|π‘₯0 ) + 𝑒0 = 𝛽1 + 𝛽2 π‘₯0 + 𝑒0
When we obtain the estimated sample regression equation, the observed value of 𝑦 for each value of π‘₯
deviates from the predicted value by the prediction error 𝑒.
𝑦0 = 𝑦̂0 + 𝑒0
5-Prediction, Goodness-of-Fit, and Modeling Issues
1 of 27
𝑦0 = 𝑏1 + 𝑏2 π‘₯0 + 𝑒0
The objective is now to build an interval estimate for this 𝑦0 . The interval estimate is then,
𝐿, π‘ˆ = 𝑦0 ± 𝑑𝛼⁄2,(π‘›βˆ’2) se(𝑦0 )
Here we need to find an estimator for 𝑦0 and calculate se(𝑦0 ). Since the expected value of the 𝑦0 in the
population is 𝑦̂0 , the prediction interval is built around the estimated 𝑦̂0 from the sample. Therefore, the
estimator of 𝑦0 is 𝑦̂0 . To determine se(𝑦0 ), first start with var(𝑦0 ) and then take its square root.
var(𝑦0 ) = var(𝑦̂0 + 𝑒0 )
Given the assumption of independence of 𝑦 values and the disturbance term, and the assumption of
homoscedasticity, we can write,
var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒)
var(𝑦0 ) = var(𝑏1 + 𝑏2 π‘₯0 ) + var(𝑒)
var(𝑦0 ) = var(𝑏1 ) + π‘₯02 var(𝑏2 ) + 2π‘₯0 cov(𝑏1 , 𝑏2 ) + var(𝑒)
The interval estimate, or the prediction interval for an individual value of 𝑦 is then,
𝐿, π‘ˆ = 𝑦̂0 ± 𝑑𝛼⁄2,(π‘›βˆ’2) se(𝑦0 )
For comparison, let’s build an interval estimate for the mean value of food expenditure for the weekly income
of $2,000 (π‘₯0 = 20) and a prediction interval for the individual value of food expenditure for the same weekly
income.
The estimated regression equation is
𝑦̂ = 83.416 + 10.21π‘₯
For π‘₯0 = 20, 𝑦̂ = 83.416 + 10.21(20) = 287.61.
The variance of error term is
var(𝑒) = 8013.294
Using the inverse matrix 𝑋 βˆ’1 , we can obtain the covariance matrix:
covariance matrix = var(𝑒)𝑋 βˆ’1 = 8013.294 [
[
var(𝑏1 )
cov(𝑏1 , 𝑏2 )
cov(𝑏1 , 𝑏2 )
1884.442
]=[
var(𝑏2 )
βˆ’85.903
0.2352
βˆ’0.0107
βˆ’0.0107
]
0.00055
βˆ’85.903
]
4.382
5-Prediction, Goodness-of-Fit, and Modeling Issues
2 of 27
Confidence Interval for the Mean Value of 𝑦
Prediction Interval for the Individual Value of 𝑦
𝐿, π‘ˆ = 𝑦̂0 ± 𝑑𝛼⁄2,𝑑𝑓 se(𝑦̂0 )
𝐿, π‘ˆ = 𝑦̂0 ± 𝑑𝛼⁄2,𝑑𝑓 se(𝑦0 )
var(𝑦̂0 ) = var(𝑏1 + 𝑏2 π‘₯0 )
var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒)
var(𝑦̂0 ) = var(𝑏1 ) + π‘₯02 var(𝑏2 ) + 2π‘₯0 cov(𝑏1 , 𝑏2 )
var(𝑦̂0 ) = 201.017
var(𝑦̂0 ) = 201.017 + 8013.29 = 8214.31
se(𝑦̂0 ) = 14.178
se(𝑦0 ) = 90.633
𝑑0.025,38 = 2.024
𝑑0.025,38 = 2.024
𝑀𝑂𝐸 = (2.024)(14.178) = 28.70
𝑀𝑂𝐸 = (2.024)(90.633) = 183.48
𝐿 = 287.61 βˆ’ 28.70 = 258.91
𝐿 = 287.61 βˆ’ 183.48 = 104.13
π‘ˆ = 287.61 + 28.70 = 316.31
π‘ˆ = 287.61 + 183.48 = 471.09
Note the chart below. The bands representing prediction intervals for individual values of 𝑦 are wider than
the bands for the confidence intervals for mean values of 𝑦. Also both bands are narrower in the middle. The
further away the π‘₯ values are from the π‘₯Μ… , the bigger the squared deviation (π‘₯ βˆ’ π‘₯Μ… )2, hence the bigger the
variance. The greater the variance, the less reliable is the prediction.
700
600
500
400
300
200
100
0
0
5
10
15
20
25
30
35
-100
2. Coefficient of Determination, R2
The closeness of fit of the regression line (to the scatter plot) is a measure of the closeness of the relationship
between π‘₯ and 𝑦. The less scattered the observed 𝑦 values are around the regression line, the closer the
relationship between π‘₯ and 𝑦. As explained above, se(𝑒) is such a measure of the fit. However, se(𝑒) has a
major drawback. It is an absolute measure and, therefore, is affected by the absolute size of the data. The
larger the values of the data set, the larger the 𝑠𝑒(𝑒).
5-Prediction, Goodness-of-Fit, and Modeling Issues
3 of 27
To explain this drawback, consider the data in the food expenditure example. Suppose the dependent
variable data, weekly food expenditure figures, were also in $100s. That is, for example, instead of showing
the weekly food expenditure as $155, we show it as $1.55. As the calculations in the tab β€œfood2” in the Excel
file β€œCH5 DATA” show, the standard error of estimate is reduced from 89.517 to 0.895. This reduction in the
se(𝑒) is solely due to the change in the scale of the independent variable data.
This should make it clear that using se(𝑒) as a measure of closeness of the fit suffers from the misleading
impact of the absolute size or scale of the data used in the model. An alternative measure of the closeness of
fit, which is not affected by the scale of the data, is the coefficient of determination, denoted by π‘ΉπŸ (rsquare).
R-square is a relative measure. It is, therefore, not affected by the scale of data. It measures the proportion of
total variations in 𝑦 explained by the regression (that is, by π‘₯). Basically, 𝑅2 involves the comparison of the
variations or deviations of the observed 𝑦 around the regression (𝑦̂) line against the variations of the same 𝑦
values around the mean (𝑦̅) line. The diagram below shows the comparison of these deviations.
y 700
600
500
yΜ‚
400
300
yΜ…
200
100
0
0
5
10
15
20
25
30
x
35
Mathematically, 𝑅2 is proportion of the total squared deviation of the 𝑦 values from 𝑦̅ that is explained by the
regression (𝑦̂) line. To understand this statement consider the following diagram.
y 700
600
500
400
483
361
300
284
yΜ‚
yΜ…
200
100
0
27.14
5-Prediction, Goodness-of-Fit, and Modeling Issues
x
4 of 27
In the diagram, the horizontal line represents the mean of all the observed 𝑦 values: 𝑦̅ = 284. The regression
line is represented by the regression equation 𝑦̂ = 83.416 + 10.21π‘₯. A single observed value of 𝑦 = 483 for a
given π‘₯ = 27.14 value is selected. The vertical distance between this 𝑦 value and 𝑦̅ is called β€œtotal deviation”.
π‘‡π‘œπ‘‘π‘Žπ‘™ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝑦 βˆ’ 𝑦̅ = 483 βˆ’ 284 = 199
The vertical distance between𝑦̂ on the regression line and 𝑦̅ is called β€œexplained deviation”.
𝐸π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝑦̂ βˆ’ 𝑦̅ = 361 βˆ’ 284 = 77
As the diagram indicates, clearly this portion of the total deviation is due to (or explained by) the regression
model. That is, this deviation is explained by the independent variable π‘₯.
The vertical distance between 𝑦 and 𝑦̂, the residual 𝑒, is called β€œunexplained deviation”.
π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝑦 βˆ’ 𝑦̂ = 483 βˆ’ 361 = 122
Thus,
π‘‡π‘œπ‘‘π‘Žπ‘™ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› = 𝐸π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘› + π‘ˆπ‘›π‘’π‘₯π‘π‘™π‘Žπ‘–π‘›π‘’π‘‘ π·π‘’π‘£π‘–π‘Žπ‘‘π‘–π‘œπ‘›
(𝑦 βˆ’ 𝑦̅) = (𝑦̂ βˆ’ 𝑦̅) + (𝑦 βˆ’ 𝑦̂)
199 = 77 + 122
Repeating the same process for all values of 𝑦, squaring the resulting deviations, and summing the squared
values, we have the following the sum of squared deviations:
1. Sum of Squared Total Deviations
βˆ‘(𝑦 βˆ’ 𝑦̅)2
π‘†π‘’π‘š π‘œπ‘“ π‘†π‘žπ‘’π‘Žπ‘Ÿπ‘’π‘  π‘‡π‘œπ‘‘π‘Žπ‘™ (𝑆𝑆𝑇):
2. Sum of Squared Explained Deviations
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
π‘†π‘’π‘š π‘œπ‘“ π‘†π‘žπ‘’π‘Žπ‘Ÿπ‘’π‘  π‘…π‘’π‘”π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘› (𝑆𝑆𝑅):
3. Sum of Squared Unexplained Deviations
βˆ‘π‘’ 2 = βˆ‘(𝑦 βˆ’ 𝑦̂)2
π‘†π‘’π‘š π‘œπ‘“ π‘†π‘žπ‘’π‘Žπ‘Ÿπ‘’π‘  πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ (𝑆𝑆𝐸):
It can be shown that
βˆ‘(𝑦 βˆ’ 𝑦̅)2 = βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 + βˆ‘(𝑦 βˆ’ 𝑦̂)2
(see footnote1)
(𝑦 βˆ’ 𝑦̅) = (𝑦̂ βˆ’ 𝑦̅) + (𝑦 βˆ’ 𝑦̂)
Square both sides
(𝑦 βˆ’ 𝑦̅)2 = (𝑦̂ βˆ’ 𝑦̅)2 + (𝑦 βˆ’ 𝑦̂)2 + 2(𝑦 βˆ’ 𝑦̂)(𝑦̂ βˆ’ 𝑦̅)
and sum
βˆ‘(𝑦 βˆ’ 𝑦̅)2 = βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 + βˆ‘(𝑦 βˆ’ 𝑦̅)2 + 2βˆ‘π‘’(𝑦̂ βˆ’ 𝑦̅)
We must show that βˆ‘π‘’(𝑦̂ βˆ’ 𝑦̅) = 0
βˆ‘π‘’(𝑦̂ βˆ’ 𝑦̅) = βˆ‘π‘’π‘¦Μ‚ βˆ’ π‘¦Μ…βˆ‘π‘’
βˆ‘π‘’(𝑦̂ βˆ’ 𝑦̅) = βˆ‘π‘’π‘¦Μ‚
βˆ‘π‘’ = 0
Now show that βˆ‘π‘’π‘¦Μ‚ = 0
βˆ‘π‘’π‘¦Μ‚ = βˆ‘π‘’(𝑏1 + 𝑏2 π‘₯)
βˆ‘π‘’π‘¦Μ‚ = βˆ‘π‘’(𝑦̅ βˆ’ 𝑏2 π‘₯Μ… + 𝑏2 π‘₯)
βˆ‘π‘’π‘¦Μ‚ = π‘¦Μ…βˆ‘π‘’ + 𝑏2 βˆ‘π‘’(π‘₯ βˆ’ π‘₯Μ… )
βˆ‘π‘’π‘¦Μ‚ = 𝑏2 βˆ‘π‘’π‘₯ βˆ’ 𝑏2 π‘₯Μ… βˆ‘π‘’
1
5-Prediction, Goodness-of-Fit, and Modeling Issues
5 of 27
That is,
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
See the Excel file β€œCH5 DATA” tab β€œRSQ” for the calculations of sum of squares.
βˆ‘(𝑦 βˆ’ 𝑦̅)2 = 495132.16
Note that:
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 = 190626.98
βˆ‘(𝑦 βˆ’ 𝑦̂)2 = 304505.18
495132.16 = 190626.98 + 304505.18
As stated at the beginning of this discussion, 𝑅2 measures the proportion of total deviations in y explained by
the regression. Thus,
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 𝑆𝑆𝑅
𝑅2 =
=
βˆ‘(𝑦 βˆ’ 𝑦̅)2 𝑆𝑆𝑇
For our example:
𝑅2 =
𝑆𝑆𝑅 190626.98
=
= 0.385
𝑆𝑆𝑇 495132.16
Also note:
𝑆𝑆𝐸 304505.18
=
= 0.615
𝑆𝑆𝑇 495132.16
Thus, when 𝑅2 = 0.385, 38.5 percent of the variations or deviations in 𝑦, food expenditure, are explained by
the regression model, that is the independent variable π‘₯, weekly income. The remaining 61.5 percent of the
variations are due to other unexplained factors. Note that if all the variations in 𝑦 were explained by income ,
then 𝑅2 = 1. Thus, the values of 𝑅2 vary from 0 to 1:
0 ≀ 𝑅2 ≀ 1
Also note that the value of 𝑅2 is not affected by the scale of the data. You can check this in the model using
Excel with the food expenditure figures in hundreds of dollars.
3. Correlation Analysis
In Chapter 1 and Chapter 2 the concept of covariance was explained as a measure of the extent of association
between two variables π‘₯ and 𝑦, and Οƒπ‘₯𝑦 was used as symbol for the population covariance and 𝑠π‘₯𝑦 for the
sample covariance. It was also explained that to avoid the distorting impact of the scale of the data on
covariance, the correlation coefficient was obtained by dividing the covariance by the product of the standard
deviations of x and y:
βˆ‘π‘’π‘¦Μ‚ = 𝑏2 βˆ‘π‘’π‘₯
βˆ‘π‘’π‘₯ = βˆ‘π‘₯(𝑦 βˆ’ 𝑦̂)
βˆ‘π‘’π‘₯ = βˆ‘π‘₯(𝑦 βˆ’ 𝑏1 βˆ’ 𝑏2 π‘₯)
The right-hand-side above is the normal equation obtained in development of the least squares coefficients.
πœ•βˆ‘π‘’ 2 β„πœ•π‘2 = βˆ’2βˆ‘π‘₯(𝑦 βˆ’ 𝑏1 βˆ’ 𝑏2 π‘₯) = 0
With this, then βˆ‘π‘’(𝑦̂ βˆ’ 𝑦̅) = 0, and hence
βˆ‘(𝑦 βˆ’ 𝑦̅)2 = βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 + βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
5-Prediction, Goodness-of-Fit, and Modeling Issues
6 of 27
𝜌=
Οƒπ‘₯𝑦
Οƒπ‘₯ σ𝑦
𝑠=
sπ‘₯𝑦
sπ‘₯ s𝑦
π‘π‘œπ‘π‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘π‘œπ‘Ÿπ‘Ÿπ‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘π‘œπ‘’π‘“π‘“π‘–π‘π‘–π‘’π‘›π‘‘
π‘ π‘Žπ‘šπ‘π‘™π‘’ π‘π‘œπ‘Ÿπ‘Ÿπ‘’π‘™π‘Žπ‘‘π‘–π‘œπ‘› π‘π‘œπ‘’π‘“π‘“π‘–π‘π‘–π‘’π‘›π‘‘
In the sample formula,
sπ‘₯𝑦 =
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )(𝑦 βˆ’ 𝑦̅)
π‘›βˆ’1
sπ‘₯ = √
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
π‘›βˆ’1
s𝑦 = √
βˆ‘(𝑦 βˆ’ 𝑦̅)2
π‘›βˆ’1
From which we obtain,
rπ‘₯𝑦 =
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )(𝑦 βˆ’ 𝑦̅)
βˆšβˆ‘(π‘₯ βˆ’ π‘₯Μ… )2 βˆšβˆ‘(𝑦 βˆ’ 𝑦̅)2
Since π‘Ÿ is a relative measure, then
βˆ’1 ≀ π‘Ÿ ≀ 1
The closer the coefficient of correlation is to βˆ’1 or 1, the stronger the association between the variations in 𝑦
and the variations in π‘₯.
3.1. The Relationship Between R² and r
We can show that the coefficient of determination in regression, R², which shows how closely the variations
in the dependent variable 𝑦 are associated with the variations in the explanatory variable π‘₯, is equal to
correlation coefficient squared.
𝑅2 =
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
2
= π‘Ÿπ‘₯𝑦
βˆ‘(𝑦 βˆ’ 𝑦̅)2
(see footnote2)
In the numerator of 𝑅2 , substitute for 𝑦̂ = 𝑏1 + 𝑏2 π‘₯, and then for 𝑏1 = 𝑦̅ βˆ’ 𝑏2 π‘₯Μ… .
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 = βˆ‘(𝑏1 + 𝑏2 π‘₯ βˆ’ 𝑦̅)2
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 = βˆ‘(𝑦̅ βˆ’ 𝑏2 π‘₯Μ… + 𝑏2 π‘₯ βˆ’ 𝑦̅)2
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 = βˆ‘[𝑏2 (π‘₯ βˆ’ π‘₯Μ… )]2
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 = 𝑏22 βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑏22 βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑅2 =
βˆ‘(𝑦 βˆ’ 𝑦̅)2
Using
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )(𝑦 βˆ’ 𝑦̅)
𝑏2 =
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
and substituting for 𝑏22 in the numerator above, we have
[βˆ‘(π‘₯ βˆ’ π‘₯Μ… )(𝑦 βˆ’ 𝑦̅)]2 βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑅2 =
[βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2 ]2
βˆ‘(𝑦 βˆ’ 𝑦̅)2
2
𝑅2 =
[βˆ‘(π‘₯ βˆ’ π‘₯Μ… )(𝑦 βˆ’ 𝑦̅)]2
2
= π‘Ÿπ‘₯𝑦
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2 βˆ‘(𝑦 βˆ’ 𝑦̅)2
5-Prediction, Goodness-of-Fit, and Modeling Issues
7 of 27
3.2. Another Point About R2
When it is said that 𝑅2 is a measure of β€œgoodness of fit”, this simply refers to the correlation between the
observed and predicted value of 𝑦. This correlation can be expressed as π‘Ÿπ‘¦π‘¦Μ‚ . Like any other measure of
correlation between two variables,
π‘Ÿπ‘¦π‘¦Μ‚ =
βˆ‘(𝑦 βˆ’ 𝑦̅)(𝑦̂ βˆ’ 𝑦̅̂ )
βˆšβˆ‘(𝑦 βˆ’ 𝑦̅)2 βˆ‘(𝑦̂ βˆ’ 𝑦̅̂ )2
That π‘Ÿπ‘¦π‘₯ = π‘Ÿπ‘¦π‘¦Μ‚ is easily explained by the fact that 𝑦̂ is a linear transformation of the variable π‘₯: 𝑦̂ = 𝑏1 + 𝑏2 π‘₯.
Therefore, the correlation between 𝑦 and π‘₯ is the same as the correlation between 𝑦 and the linear
2
2
transformation of π‘₯. Also, it is easily proved mathematically that It can be shown that π‘Ÿπ‘¦π‘¦
Μ‚ = 𝑅 (see
footnote3),
2
π‘Ÿπ‘¦π‘¦
Μ‚
2
[βˆ‘(𝑦 βˆ’ 𝑦̅)(𝑦̂ βˆ’ 𝑦̅̂ )]
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
=
=
= 𝑅2
βˆ‘(𝑦 βˆ’ 𝑦̅)2 βˆ‘(𝑦̂ βˆ’ 𝑦̅̂ )2 βˆ‘(𝑦 βˆ’ 𝑦̅)2
Thus, 𝑅2 is also a measure of how well the estimated regression fits the data.
4. Reporting Regression Results
4.1. Computer Output
Several computer statistical softwares are available to produce the regression results. We will use Excel’s
regression output for illustration. In Excel, we find Regression in Tools, Data Analysis. Following the simple
instructions in the drop box presented by Excel, the following output is generated for the food expenditure
example:
We need to prove that
2
[βˆ‘(𝑦 βˆ’ 𝑦̅)(𝑦̂ βˆ’ 𝑦̅̂ )]
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 𝑆𝑆𝑅
2
π‘Ÿπ‘¦π‘¦
=
=
=
= 𝑅2
Μ‚
2
2
Μ…
βˆ‘(𝑦 βˆ’ 𝑦̅)2 𝑆𝑆𝑇
βˆ‘(𝑦 βˆ’ 𝑦̅) βˆ‘(𝑦̂ βˆ’ 𝑦̂)
First, we can show that the mean of the predicted values is equal to the mean of observed values:
𝑦̅̂ = 𝑦̅
𝑦̂ = 𝑏1 + 𝑏2 π‘₯
βˆ‘π‘¦Μ‚ = 𝑛𝑏1 + 𝑏2 βˆ‘π‘₯
βˆ‘π‘¦Μ‚
βˆ‘π‘₯
= 𝑏1 + 𝑏2
𝑛
𝑛
𝑦̅̂ = 𝑏1 + 𝑏2 π‘₯Μ… = 𝑦̅ βˆ’ 𝑏2 π‘₯Μ… + 𝑏2 π‘₯Μ… = 𝑦̅
βˆ‘(𝑦 βˆ’ 𝑦̅)(𝑦̂ βˆ’ 𝑦̅̂ ) = βˆ‘(𝑦̂ + 𝑒 βˆ’ 𝑦̅)(𝑦̂ βˆ’ 𝑦̅)
βˆ‘(𝑦 βˆ’ 𝑦̅)(𝑦̂ βˆ’ 𝑦̅̂ ) = βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 + βˆ‘π‘’(𝑦̂ βˆ’ 𝑦̅)
βˆ‘(𝑦 βˆ’ 𝑦̅)(𝑦̂ βˆ’ 𝑦̅̂ ) = βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
since βˆ‘π‘’(𝑦̂ βˆ’ 𝑦̅) = 0
Thus,
[βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 ]2
2
π‘Ÿπ‘¦π‘¦
Μ‚ =
βˆ‘(𝑦 βˆ’ 𝑦̅)2 βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 𝑆𝑆𝑅
2
π‘Ÿπ‘¦π‘¦
=
= 𝑅2
Μ‚ =
βˆ‘(𝑦 βˆ’ 𝑦̅)2 𝑆𝑆𝑇
3
5-Prediction, Goodness-of-Fit, and Modeling Issues
8 of 27
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.6205
R Square
0.3850
Adjusted R Square
0.3688
Standard Error
89.517
Observations
40
ANOVA
df
Regression
Residual
Total
Intercept
income
1
38
39
SS
190626.98
304505.18
495132.16
MS
190627
8013.294
F
23.789
Significance F
1.95E-05
Coefficients
83.416
10.210
Standard Error
43.410
2.093
t Stat
1.922
4.877
P-value
0.0622
0.0000
Lower 95%
-4.463
5.972
Upper 95%
171.295
14.447
The following table contains all the different symbols and formulas that are used in generating this output:
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Not relevant to simple regression
𝑅2 = 𝑆𝑆𝑅⁄𝑆𝑆𝑇
Adjusted R Square
Not relevant to simple regression
Standard Error
βˆ‘π‘’ 2
βˆ‘(𝑦 βˆ’ 𝑦̂)2
𝑆𝑆𝐸
se(𝑒) = √
=√
β‰‘βˆš
= βˆšπ‘€π‘†πΈ
π‘›βˆ’2
π‘›βˆ’2
π‘›βˆ’2
Observations
𝑛
ANOVA*
Regression
Residual
𝑑𝑓 **
π‘˜βˆ’1
π‘›βˆ’π‘˜
𝑆𝑆
𝑆𝑆𝑅 = βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
Total
π‘›βˆ’1
𝑆𝑆𝑇 = βˆ‘(𝑦 βˆ’ 𝑦̅)2
Coefficients
𝑆𝑆𝐸 = βˆ‘(𝑦 βˆ’ 𝑦̂)2
𝑀𝑆
𝑀𝑆𝑅 = 𝑆𝑆𝑅⁄(π‘˜ βˆ’ 1)
𝑀𝑆𝐸 = 𝑆𝑆𝐸⁄(𝑛 βˆ’ π‘˜)
𝐹*
𝐹 = 𝑀𝑆𝑅⁄𝑀𝑆𝐸
π‘†π‘–π‘”π‘›π‘–π‘“π‘–π‘π‘Žπ‘›π‘π‘’ 𝐹
Tail area of 𝐹 π·π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘›
Standard Error†
𝑑 π‘†π‘‘π‘Žπ‘‘
𝑃­π‘£π‘Žπ‘™π‘’𝑒
Lower 95%††
Intercept
𝑏1
se(𝑏1 )
|t| = 𝑏1 ⁄se(𝑏1 )
P(𝑑 > |𝑑|)
𝐿 = 𝑏1 βˆ’ 𝑀𝑂𝐸1
π‘ˆ = 𝑏1 + 𝑀𝑂𝐸1
X Variable 1
𝑏2
se(𝑏2 )
|t| = 𝑏2 ⁄se(𝑏2 )
P(𝑑 > |𝑑|)
π‘ˆ = 𝑏2 + 𝑀𝑂𝐸2
π‘ˆ = 𝑏2 + 𝑀𝑂𝐸2
Notes:
*
**
†
††
Upper 95%
ANOVA and the 𝐹 π‘‘π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘› are explained below
π‘˜ denotes the number of parameters estimated. Here, π‘˜ = 2
se(𝑏1 ) = se(𝑒)βˆšβˆ‘π‘₯ 2 β„π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
se(𝑏2 ) = se(𝑒)β„βˆšβˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑀𝑂𝐸1 = 𝑑α⁄2,(π‘›βˆ’2) se(𝑏1 )
𝑀𝑂𝐸2 = 𝑑α⁄2,(π‘›βˆ’2) se(𝑏2 )
5-Prediction, Goodness-of-Fit, and Modeling Issues
9 of 27
4.2. Reporting the Summary Results
In many cases, rather than providing the whole computer output, the regression output is reported in a
summary form. The following are two ways in which summery results are reported.
yΛ† ο€½ 83.416  10.210 x R 2 ο€½ 0.385
( 43.41)
( 2.093)
( s.e.)
This summary report provides the value of the standard error of the regression coefficients, se(𝑏1 ) and
se(𝑏2 ). This information allows us to obtain the confidence intervals for the parameters of the regression.
You just need to compute the 𝑀𝑂𝐸 using 𝑑α⁄2,(π‘›βˆ’π‘˜) and the standard error of each coefficient. You can also
divide the coefficient value by its standard error to obtain the test statistic for the hypothesis test about the
parameter.
Alternatively, the summary result is reported as follows:
yΛ† ο€½ 83.416  10.210 x R 2 ο€½ 0.385
(1.922)
( 4.877)
(t )
Here, you can use the 𝑑 stat and either compute the probability value (you must use a computer) or compare
it to the critical 𝑑: 𝑑α⁄2,(π‘›βˆ’π‘˜) to test for the null hypothesis, 𝐻0 : 𝛽2 = 0.
5. The 𝑭 Test of Goodness of Fit
The goodness of fit of regression is measured by 𝑅2 .
𝑅2 =
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
βˆ‘(𝑦 βˆ’ 𝑦̅)2
The more the observed values of y are clustered around the regression line, the better the goodness of fit, or
the greater the linear association between the dependent variable 𝑦 and the explanatory variable π‘₯. Also, we
saw above that 𝑅2 can also be computed as the square of the correlation coefficient π‘Ÿ: 𝑅2 = π‘Ÿ 2
π‘Ÿ=
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )(𝑦 βˆ’ 𝑦̅)
βˆšβˆ‘(π‘₯ βˆ’ π‘₯Μ… )2 βˆšβˆ‘(𝑦 βˆ’ 𝑦̅)2
π‘Ÿ2 =
[βˆ‘(π‘₯ βˆ’ π‘₯Μ… )(𝑦 βˆ’ 𝑦̅)]2
= 𝑅2
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2 βˆ‘(𝑦 βˆ’ 𝑦̅)2
Theoretically, two variables π‘₯ and 𝑦 are independent if the population correlation coefficient 𝜌 is zero.
Within the simple linear regression context, absence of linear relationship between π‘₯ and 𝑦 would imply the
slope parameter 𝛽2 is zero. Thus, none of the total deviation of 𝑦 from the mean 𝑦 would be accounted for by
the regression. As explained, in the sample, 𝑅2 measures this deviation relative to the total. However, even if
there is no relationship between π‘₯ and 𝑦 in the population, the probability that an 𝑅2 computed from a
random sample would be zero is practically nil. Therefore, 𝑅2 will always be a number greater than zero.
In simple regression analysis, therefore, to conclude that there is a relationship between π‘₯ and 𝑦, we perform
the test of hypothesis that
𝐻0 : 𝛽2 = 0
versus
𝐻1 : 𝛽2 β‰  0
This has already been done using 𝑏2 as the test statistic and performing the β€œt test”. We may consider 𝑅2 , in a
way, an alternative test statistic testing the same hypothesis. If 𝑅2 is significantly different from zero, then we
5-Prediction, Goodness-of-Fit, and Modeling Issues
10 of 27
will reject the above null hypothesis. To determine if 𝑅2 is significantly different from zero we need a critical
value, for a given significance level Ξ±, to which we compare the test statistic. The problem here is that there is
no statistical critical value directly related to 𝑅2 .
𝑅2 is obtained as a ratio of two squared deviations, 𝑆𝑆𝑅 over 𝑆𝑆𝑇, and, as such, it does not generate any
probability distribution such as 𝑍, 𝑇, or Chi-square. The way around this problem is the indirect approach of
measuring the mean 𝑆𝑆𝑅 relative to the mean 𝑆𝑆𝐸. This way we are comparing two measures of variance of
𝑦—variance due to regression versus variance due to unexplained factors. Hence the term ANOVAβ€”analysis
of variance. If explained deviations outweigh the unexplained deviations, then the variance measures in the
numerator of the variance ratio will be greater than that in the denominator. Since the variance ratio is that
of squared terms, the ratio measure is always positive. The larger the variations due to regression is, the
further away the quotient is from 1, indicating a better fit.
To obtain any variance measure from sample data we divide the sum of square deviations by the degrees of
freedom to determine the mean square. The two mean squares in the regression ANOVA are the mean square
regression (𝑀𝑆𝑅) and mean square error (𝑀𝑆𝐸):
𝑀𝑆𝑅 =
𝑆𝑆𝑅 βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
=
𝑑𝑓
π‘˜βˆ’1
where k is the number of the parameters in the regression, 𝛽1 and 𝛽2 .
𝑀𝑆𝐸 =
𝑆𝑆𝐸 βˆ‘(𝑦 βˆ’ 𝑦̂)2
=
𝑑𝑓
π‘›βˆ’π‘˜
The ratio of 𝑀𝑆𝑅 to 𝑀𝑆𝐸 is called the 𝐹-ratio.
𝐹=
𝑀𝑆𝑅
𝑀𝑆𝐸
The 𝐹-ratio is a test statistic with a specific probability distribution called the 𝐹 distribution.4 The 𝐹
distribution is the ratio of two independent Chi-square random variables each divided by its own degrees of
freedom, 𝑑1 being the degrees of freedom of the numerator and 𝑑2 that of the denominator:
𝐹(𝑑1,𝑑2 ) =
πœ’12 ⁄𝑑1
πœ’22 ⁄𝑑2
The F distribution is used in testing the equality of two population variances, as is being done in the case
being discussed here.
𝐹(π‘˜βˆ’1,π‘›βˆ’π‘˜) =
βˆ‘(𝑦̂ βˆ’ 𝑦̅)2 ⁄(π‘˜ βˆ’ 1)
βˆ‘(𝑦 βˆ’ 𝑦̂)2 ⁄(𝑛 βˆ’ π‘˜)
The numerator variance measure represents the average squared deviation of predicted values from the
mean of 𝑦. This is the mean square deviation explained by the regression. The denominator is the mean
square deviation of the observed values from the regression lineβ€”the unexplained mean square deviation.
To observe the impact of these mean squares on the value of 𝐹 consider the following two models and
corresponding figures. In (𝐴) test scores on an exam are observed against the explanatory variable hours
studied. In (𝐡) the same test scores are observed against randomly generated numbers as the explanatory
variable.
4
Named after the English statistician Sir Ronald A. Fisher.
5-Prediction, Goodness-of-Fit, and Modeling Issues
11 of 27
(𝐴)
(𝐡)
Score
Hours
Studied
56
52
72
56
92
72
88
96
80
100
1.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
7.0
Score
56
52
72
56
92
72
88
96
80
100
0.6564
8.0140
3.9091
0.0045
𝑅2 =
𝑏2 =
|𝑑| =
π‘π‘Ÿπ‘œπ‘ π‘£π‘Žπ‘™π‘’π‘’ =
0.0329
0.1299
0.5213
0.6163
𝑅2 =
𝑏2 =
|𝑑| =
π‘π‘Ÿπ‘œπ‘ π‘£π‘Žπ‘™π‘’π‘’ =
1836.8056
120.1993
15.2813
0.0045
𝑀𝑆𝑅 =
𝑀𝑆𝐸 =
𝐹=
π‘†π‘–π‘”π‘›π‘–π‘“π‘–π‘π‘Žπ‘›π‘π‘’ 𝐹 =
Random
numbers
15
63
42
51
85
93
32
43
31
65
MSR =
MSE =
F=
Significance F =
91.9413
338.3073
0.2718
0.6163
The regression model (𝐴) shows the relationship between test scores and the hours studied. Model (B)
regresses the same test scores against a set of numbers randomly selected from 1-100. Note the regression
line in Model (𝐡) is practically flat with a slope of 0.10. Correspondingly, the |t| statistics for 𝐻0 : 𝛽2 = 0
results in the probability value of 0.6163, leading us to convincingly conclude that the population slope
parameter is zero.
Now pay attention to 𝑀𝑆𝑅. Since the regression line, as shown in panel (𝐡) in the diagram below, is very
close to the 𝑦̅ line, leaving very little room for the deviations 𝑦̂ βˆ’ 𝑦̅, thus making 𝑀𝑆𝑅 = οƒ₯(𝑦̂ βˆ’ 𝑦̅)²β„(π‘˜ βˆ’ 1)
very small relative to 𝑀𝑆𝐸 = οƒ₯(𝑦 βˆ’ 𝑦̂)²β„(𝑛 βˆ’ π‘˜). The F statistic is hence a small value 91.9413⁄338.3073 =
0.2718. The probability value (the tail area under the F-curve to the right of the F value of 0.1744, using
=F.DIST.RT(0.2718,1,8) in Excel, is 0.6163β€”clearly leading us not to reject 𝐻0 : 𝛽2 = 0.
In contrast, the regression line in panel (𝐴) indicates a pronounced slope, making the deviations 𝑦̂ βˆ’ 𝑦̅, hence
𝑀𝑆𝑅, relatively significant compared to 𝑀𝑆𝐸. The 𝐹 statistic is thus a large value 1836.8056⁄120.1993 =
15.2813. The probability value is 0.0045, clearly leading us to reject 𝐻0 : 𝛽2 = 0.
y
(A)
120
y
100
yΜ‚
100
80
yΜ…
80
60
60
40
40
20
20
0
(B)
120
yΜ‚
yΜ…
0
0
1
2
3
4
5
6
7
8
x
0
20
40
60
80
100
120
x
After explaining all this about the F test and 𝐴𝑁𝑂𝑉𝐴, it would sound quite anticlimactic to say that in simple
regression we need not perform the F test at all because it is redundant! With careful attention you would
5-Prediction, Goodness-of-Fit, and Modeling Issues
12 of 27
recognize that the F statistic with the numerator degrees of freedom k – 1 is the same value as t statistic
squared:
𝐹 = 𝑑 2 = (3.9091)2 = 15.2813
And,
𝑝­π‘£π‘Žπ‘™π‘’𝑒 = π‘†π‘–π‘”π‘›π‘–π‘“π‘–π‘π‘Žπ‘›π‘π‘’ 𝐹 = 0.0045
See footnote5 for the mathematical proof that 𝐹 = 𝑑 2. Note, however, that the 𝐹 test plays a different and
important role in statistical inference in multiple regressionβ€”to be pointed out in later chapters
6. Modeling Issues
6.1. The Effects of Scaling the Data
6.1.1.
Changing the scale of 𝒙
Consider the general form of the estimated simple linear regression equation: 𝑦̂ = 𝑏1 + 𝑏2 π‘₯. We want to find
out what happens to the regression model if we changed the scale of π‘₯ by multiplying it be a constant 𝑐. First,
determine the impact on the slope coefficient 𝑏2 . Denote the resulting new coefficient as 𝑏2βˆ— .
Impact on π’ƒπŸ :
𝑏2 =
𝑏2βˆ— = 𝑏2 ⁄𝑐
βˆ‘π‘₯𝑦 βˆ’ 𝑛π‘₯Μ… 𝑦̅
βˆ‘π‘₯ 2 βˆ’ 𝑛π‘₯Μ… 2
Multiply π‘₯ by a constant 𝑐. Let 𝑏2βˆ— be the resulting new coefficient. Then,
𝑏2βˆ— =
βˆ‘(𝑐π‘₯)𝑦 βˆ’ 𝑛(𝑐π‘₯Μ… )𝑦̅ 𝑐(βˆ‘π‘₯𝑦 βˆ’ 𝑛π‘₯Μ… 𝑦̅) 𝑏2
=
=
βˆ‘(𝑐π‘₯)2 βˆ’ 𝑛(𝑐π‘₯Μ… )2 𝑐 2 (βˆ‘π‘₯ 2 βˆ’ 𝑛π‘₯Μ… 2 )
𝑐
Show that,
𝑀𝑆𝑅
𝑏22
𝐹=
=
= 𝑑2
𝑀𝑆𝐸 var(𝑏2 )
𝑀𝑆𝑅 = βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
Note: 𝑑𝑓 = 1
𝑀𝑆𝑅 = βˆ‘(𝑏1 + 𝑏2 π‘₯ βˆ’ 𝑦̅)2
𝑀𝑆𝑅 = βˆ‘(𝑦̅ βˆ’ 𝑏2 π‘₯Μ… + 𝑏2 π‘₯ βˆ’ 𝑦̅)2
𝑀𝑆𝑅 = βˆ‘(𝑏2 π‘₯ βˆ’ 𝑏2 π‘₯Μ… )2
𝑀𝑆𝑅 = 𝑏22 βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑀𝑆𝐸 = var(𝑒)
var(𝑒)
var(𝑏2 ) =
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑀𝑆𝐸 = var(𝑒) = var(𝑏2 )βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
Thus,
𝑏22 βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑏22
𝐹=
=
var(𝑏2 )βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2 var(𝑏2 )
Note that the test statistic for 𝐻0 : 𝛽2 = 0 is
𝑏2
𝑑=
se(𝑏2 )
Therefore,
𝑏22
𝐹=
= 𝑑2
var(𝑏2 )
5
5-Prediction, Goodness-of-Fit, and Modeling Issues
13 of 27
Thus, when π‘₯ is scaled by a constant 𝑐, the new slope coefficient is equal to the pre-scaled coefficient divided
by 𝑐.
Impact on π’ƒπŸ :
𝑏1βˆ— = 𝑏1
𝑏1 = 𝑦̅ βˆ’ 𝑏2 π‘₯Μ…
𝑏1βˆ— = 𝑦̅ βˆ’ 𝑏2βˆ— (𝑐π‘₯Μ… ) = 𝑦̅ βˆ’ (𝑏2 ⁄𝑐 )(𝑐π‘₯Μ… ) = 𝑦̅ βˆ’ 𝑏2 π‘₯Μ… = 𝑏1
Thus, scaling π‘₯ does not change the intercept.
Μ‚
Impact on the predicted values π’š
𝑦̂ βˆ— = 𝑦̂
𝑦̂ = 𝑏1 + 𝑏2 π‘₯
𝑦̂ βˆ— = 𝑏1 + 𝑏2βˆ— (𝑐π‘₯) = 𝑏1 + (𝑏2 ⁄𝑐 )(𝑐π‘₯Μ… ) = 𝑏1 + 𝑏2 π‘₯ = 𝑦̂
There is no impact.
Impact on 𝐯𝐚𝐫(𝒆):
var(𝑒) =
var(𝑒 βˆ— ) = var(𝑒)
βˆ‘(𝑦 βˆ’ 𝑦̂)2
π‘›βˆ’2
var(𝑒 βˆ— ) =
βˆ‘(𝑦 βˆ’ 𝑦̂ βˆ— )2 βˆ‘(𝑦 βˆ’ 𝑦̂)2
=
= var(𝑒)
π‘›βˆ’2
π‘›βˆ’2
There is no impact.
Impact on π‘ΉπŸ :
𝑅2 =
(𝑅2 )βˆ— = 𝑅2
𝑆𝑆𝑅 βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
=
𝑆𝑆𝑇 βˆ‘(𝑦 βˆ’ 𝑦̅)2
(𝑅2 )βˆ— =
𝑆𝑆𝑅 βˆ‘(𝑦̂ βˆ— βˆ’ 𝑦̅)2 βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
=
=
= 𝑅2
𝑆𝑆𝑇
βˆ‘(𝑦 βˆ’ 𝑦̅)2
βˆ‘(𝑦 βˆ’ 𝑦̅)2
There is no impact.
Impact on 𝐯𝐚𝐫(π’ƒπŸ ):
var(𝑏2βˆ— ) =
1
𝑐2
var(𝑏2 )
𝑏2
1
var(𝑏2βˆ— ) = var ( ) = 2 var(𝑏2 )
𝑐
𝑐
se(𝑏2βˆ— ) =
1
se(𝑏2 )
𝑐
Impact on 𝐯𝐚𝐫(π’ƒπŸ ):
var(𝑏1βˆ— ) = var(𝑏1 )
Note: 𝑏1βˆ— = 𝑏1
5-Prediction, Goodness-of-Fit, and Modeling Issues
14 of 27
Impact on t statistic:
𝑑=
π‘‘βˆ— = 𝑑
𝑏2
se(𝑏2 )
π‘‘βˆ— =
𝑏2βˆ—
𝑏2 ⁄𝑐
=
=𝑑
se(𝑏2βˆ— ) se(𝑏2 )⁄𝑐
There is no change.
6.1.2.
Changing the scale of π’š
We want to find out what happens to the regression model if we changed the scale of 𝑦 by multiplying it by a
constant 𝑐. First, determine the impact on the slope coefficient 𝑏2 .
Impact on π’ƒπŸ :
𝑏2 =
𝑏2βˆ— = 𝑐𝑏2
βˆ‘π‘₯𝑦 βˆ’ 𝑛π‘₯Μ… 𝑦̅
βˆ‘π‘₯ 2 βˆ’ 𝑛π‘₯Μ… 2
Multiply 𝑦 by a constant 𝑐. Let 𝑏2βˆ— be the resulting new coefficient. Then,
𝑏2βˆ— =
βˆ‘π‘π‘₯𝑦 βˆ’ 𝑐𝑛π‘₯Μ… 𝑦̅ 𝑐(βˆ‘π‘₯𝑦 βˆ’ 𝑛π‘₯Μ… 𝑦̅)
=
= 𝑐𝑏2
βˆ‘π‘₯ 2 βˆ’ 𝑛π‘₯Μ… 2
βˆ‘π‘₯ 2 βˆ’ 𝑛π‘₯Μ… 2
Thus, when 𝑦 is scaled by a constant 𝑐, the new slope coefficient is equal to the pre-scaled coefficient
multiplied by 𝑐.
Impact on π’ƒπŸ :
𝑏1βˆ— = 𝑐𝑏1
𝑏1 = 𝑦̅ βˆ’ 𝑏2 π‘₯Μ…
𝑏1βˆ— = 𝑐𝑦̅ βˆ’ 𝑏2βˆ— π‘₯Μ… = 𝑐𝑦̅ βˆ’ 𝑐𝑏2 π‘₯Μ… = 𝑐(𝑦̅ βˆ’ 𝑏2 π‘₯Μ… ) = 𝑐𝑏1
Thus, scaling y changes the intercept by a multiple of c.
Μ‚:
Impact on the predicted values π’š
𝑦̂ βˆ— = 𝑐𝑦̂
𝑦̂ = 𝑏1 + 𝑏2 π‘₯
𝑦̂ βˆ— = 𝑐𝑏1 + 𝑐𝑏2 π‘₯ = 𝑐(𝑏1 + 𝑏2 π‘₯) = 𝑐𝑦̂
Predicted values also change by a multiple of 𝑐.
Impact on 𝐯𝐚𝐫(𝒆):
var(𝑒) =
var(𝑒 βˆ— ) = 𝑐 2 var(𝑒)
βˆ‘(𝑦 βˆ’ 𝑦̂)2
π‘›βˆ’2
5-Prediction, Goodness-of-Fit, and Modeling Issues
15 of 27
var(𝑒 βˆ— ) =
βˆ‘(𝑐𝑦 βˆ’ 𝑦̂ βˆ— )2 βˆ‘(𝑐𝑦 βˆ’ 𝑐𝑦̂)2 𝑐 2 βˆ‘(𝑦 βˆ’ 𝑦̂)2
=
=
= 𝑐 2 var(𝑒)
π‘›βˆ’2
π‘›βˆ’2
π‘›βˆ’2
Impact on π‘ΉπŸ :
𝑅2 =
(𝑅2 )βˆ— = 𝑅2
𝑆𝑆𝑅 βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
=
𝑆𝑆𝑇 βˆ‘(𝑦 βˆ’ 𝑦̅)2
(𝑅2 )βˆ— =
βˆ‘(𝑦̂ βˆ— βˆ’ 𝑐𝑦̅)2 βˆ‘(𝑐𝑦̂ βˆ’ 𝑐𝑦̅)2 𝑐 2 βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
=
=
= 𝑅2
βˆ‘(𝑐𝑦 βˆ’ 𝑐𝑦̅)2 βˆ‘(𝑐𝑦 βˆ’ 𝑐𝑦̅)2 𝑐 2 βˆ‘(𝑦 βˆ’ 𝑦̅)2
There is no impact.
Impact on 𝐯𝐚𝐫(π’ƒπŸ ):
var(𝑏2 ) =
var(𝑒)
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
var(𝑏2βˆ— ) =
𝑐 2 var(𝑒)
= 𝑐 2 var(𝑏2 )
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
var(𝑏2βˆ— ) = 𝑐 2 var(𝑏2 )
se(𝑏2βˆ— ) = 𝑐se(𝑏2 )
Impact on 𝐯𝐚𝐫(π’ƒπŸ ):
var(𝑏1βˆ— ) = 𝑐 2 var(𝑏1 )
var(𝑏1 ) =
βˆ‘π‘₯ 2
var(𝑒)
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
var(𝑏1βˆ— ) =
βˆ‘π‘₯ 2
βˆ‘π‘₯ 2
βˆ—)
var(𝑒
𝑐 2 var(𝑒) = 𝑐 2 var(𝑏1 )
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
Impact on 𝒕 statistic:
𝑑=
π‘‘βˆ— = 𝑑
𝑏2
se(𝑏2 )
π‘‘βˆ— =
𝑏2βˆ—
𝑐𝑏2
=𝑑
βˆ—) =
se(𝑏2
𝑐se(𝑏2 )
6.1.3.
Changing the scale of 𝒙 and π’š by the same factor 𝒄
Consider the general form of the estimated simple linear regression equation: 𝑦̂ = 𝑏1 + 𝑏2 π‘₯. We want to find
out what happens to the regression model if we changed the scale of π‘₯ and 𝑦 by multiplying both be a
constant 𝑐. First, determine the impact on the slope coefficient 𝑏2 .
Impact on π’ƒπŸ
𝑏2βˆ— = 𝑏2
5-Prediction, Goodness-of-Fit, and Modeling Issues
16 of 27
𝑏2βˆ— =
βˆ‘(𝑐π‘₯)(𝑐𝑦) βˆ’ 𝑛(𝑐π‘₯Μ… )(𝑐𝑦̅) 𝑐 2 βˆ‘π‘₯𝑦 βˆ’ 𝑛π‘₯Μ… 𝑦̅
= 2 2
= 𝑏2
βˆ‘(𝑐π‘₯)2 βˆ’ 𝑛(𝑐π‘₯Μ… )2
𝑐 βˆ‘π‘₯ βˆ’ 𝑛π‘₯Μ… 2
There is no change in the slope coefficient.
Impact on π’ƒπŸ
𝑏1βˆ— = 𝑐𝑏1
𝑏1 = 𝑦̅ βˆ’ 𝑏2 π‘₯Μ…
𝑏1βˆ— = 𝑐𝑦̅ βˆ’ 𝑏2βˆ— (𝑐π‘₯Μ… ) = 𝑐𝑦̅ βˆ’ 𝑏2 (𝑐π‘₯Μ… ) = 𝑐(𝑦̅ βˆ’ 𝑏2 π‘₯Μ… ) = 𝑐𝑏1
The intercept will change by a multiple of c.
Impact on the predicted values yΜ‚
𝑦̂ βˆ— = 𝑐𝑦̂
𝑦̂ βˆ— = 𝑏1βˆ— + 𝑏2βˆ— (𝑐π‘₯) = 𝑐𝑏1 + 𝑏2 (𝑐π‘₯) = 𝑐(𝑏1 + 𝑏2 π‘₯) = 𝑐𝑦̂
Impact on 𝐯𝐚𝐫(𝒆)
var(𝑒 βˆ— ) =
βˆ‘(𝑐𝑦 βˆ’ 𝑐𝑦̂)2
= 𝑐 2 var(𝑒)
π‘›βˆ’2
Impact on π‘ΉπŸ
(𝑅2 )βˆ— =
var(𝑒 βˆ— ) = 𝑐 2 var(𝑒)
(𝑅2 )βˆ— = 𝑅2
βˆ‘(𝑐𝑦̂ βˆ’ 𝑐𝑦̅)2 βˆ‘(𝑦̂ βˆ’ 𝑦̅)2
=
= 𝑅2
βˆ‘(𝑐𝑦 βˆ’ 𝑐𝑦̅)2 βˆ‘(𝑦 βˆ’ 𝑦̅)2
There is no impact.
Impact on 𝐯𝐚𝐫(π’ƒπŸ )
var(𝑏2 ) =
var(𝑒)
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
var(𝑏2βˆ— ) =
𝑐 2 var(𝑒)
= var(𝑏2 )
βˆ‘(𝑐π‘₯ βˆ’ 𝑐π‘₯Μ… )2
var(𝑏2βˆ— ) = var(𝑏2 )
se(𝑏2βˆ— ) = se(𝑏2 )
Impact on 𝐯𝐚𝐫(π’ƒπŸ )
var(𝑏1 ) =
var(𝑏1βˆ— ) =
var(𝑏1βˆ— ) = 𝑐 2 var(𝑏1 )
βˆ‘π‘₯ 2
var(𝑒)
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑐 2 βˆ‘π‘₯ 2
βˆ‘π‘₯ 2
var(𝑒 βˆ— ) =
𝑐 2 var(𝑒) = 𝑐 2 var(𝑏1 )
2
βˆ’ π‘₯Μ… )
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
𝑛𝑐 2 βˆ‘(π‘₯
5-Prediction, Goodness-of-Fit, and Modeling Issues
17 of 27
Impact on 𝒕 statistic
π‘‘βˆ— =
π‘‘βˆ— = 𝑑
𝑏2βˆ—
𝑏2
=𝑑
βˆ—) =
se(𝑏2
se(𝑏2 )
There is no change.
7. Choosing a Functional Form
In explaining the simple linear regression model we have assumed that the population parameters 𝛽1 and 𝛽2
are linearβ€”that is, they are not expressed as, say, 𝛽22 , 1⁄𝛽2 , or any form other than 𝛽2 β€”and also the impact of
the changes in the independent variable on y works directly through x rather than through expressions such
as, say, π‘₯ 2 or ln(π‘₯).
In this section, we will continue assuming that the regression is linear in parameters, but relax the
assumption of linearity of the variables. In many economic models the relationship between the dependent
and independent variables is not a straight line relationship. That is the change in 𝑦 does not follow the same
pattern for all values of π‘₯. Consider for example an economic model explaining the relationship between
expenditure on food (or housing) and income. As income rises, we do expect expenditure on food to rise, but
not at a constant rate. In fact, we should expect the rate of increase in expenditure on food to decrease as
income rises. Therefore the relationship between income and food expenditure is not a straight-line
relationship.
In Chapter 3 we considered two alternative regression models, the quadratic model and the log-linear model.
Here we will considered two more alternative models: the linear-log and log-log models.
7.1.
Linear-Log (Semi-log) Model
The independent variable is in logarithms, but the explained variable is not.
𝑦 = 𝛽1 + 𝛽2 ln(π‘₯)
Slope:
𝑑𝑦
1
= 𝛽2
𝑑π‘₯
π‘₯
Elasticity:
πœ–=
𝑑𝑦 π‘₯
1 π‘₯
1
= 𝛽2 ( ) = 𝛽2
𝑑π‘₯ 𝑦
π‘₯ 𝑦
𝑦
The following diagram is the plot of functions 𝑦 = 1 + 1.3ln(π‘₯) and 𝑦 = 1 βˆ’ 1.3ln(π‘₯). For example, for π‘₯ = 2,
the function with 𝛽2 > 0 provides 𝑦 = 1 + 1.3ln(2) = 1.901, and the one with the negative 𝛽2 < 0 provides
𝑦 = 1 βˆ’ 1.3 ln(2) = 0.099.
5-Prediction, Goodness-of-Fit, and Modeling Issues
18 of 27
y
5
4
Ξ²β‚‚ > 0
3
2
1
0
-1
Ξ²β‚‚ < 0
-2
-3
0.0
1.0
2.0
3.0
4.0
5.0
6.0
x
The slope of the function 𝑦 = 1 + 1.3ln(π‘₯) at a given point, say, π‘₯0 = 2 is
𝑑𝑦
1
1
= 𝛽2 = 1.3 = 0.65
𝑑π‘₯
π‘₯
2
Let’s interpret the meaning of slope = 0.65 in a linear-log function. First, the coefficient of ln(π‘₯), 1.3, is not
the slope of the function. The value 1.3 implies that for each 1% increase in π‘₯, the dependent variable rises by
approximately 0.013 units.
π‘₯0 = 2
βˆ†π‘₯%
0.01
π‘₯1
2.02
𝑦0 = 1.9011
𝑦1
βˆ†π‘¦ = 𝑦1 βˆ’ 𝑦0
1.9140
0.0129
The slope of 0.65 at π‘₯0 = 2 means that for a very small change in π‘₯ in the immediate vicinity of π‘₯0 = 2 , 𝑦 rises
by 0.65 units. The table below shows that as the increment in π‘₯ is reduced from 1 to 0.001, the difference
quotient approaches 0.65, the slope of the function at π‘₯0 = 2.
π‘₯0 = 2
π‘₯1
3
2.5
2.1
2.01
2.001
𝑦0 = 1.9011
𝑦1
βˆ†π‘¦β„βˆ†π‘₯ β‰ˆ 𝑑𝑦⁄𝑑π‘₯
2.4282
0.5271
2.1912
0.5802
1.9645
0.6343
1.9076
0.6484
1.9017
0.6498
Thus, β€œslope” in the linear-log model implies the β€œchange in 𝑦 in response to a small change in π‘₯”. The
coefficient of ln(π‘₯), on the other hand, implies the change in 𝑦 for a percentage change in π‘₯.
Example
Use the data in the food expenditure model (see the Excel file β€œCH5 DATA”). The variables are 𝑦 = π‘“π‘œπ‘œπ‘‘_𝑒π‘₯𝑝
(weekly food expenditure in $) and π‘₯ = π‘–π‘›π‘π‘œπ‘šπ‘’ (weekly income in $). The output is the result of running the
regression 𝑦̂ = 𝑏1 + 𝑏2 ln(π‘₯). To run the regression first transform the π‘₯ values to ln(π‘₯). The estimated
regression equation is:
𝑦̂ = βˆ’97.1864 + 132.1658ln(π‘₯)
5-Prediction, Goodness-of-Fit, and Modeling Issues
19 of 27
The coefficient of ln(π‘₯), 𝑏2 = 132.1658, implies that the weekly food expenditure will increase by
approximately $1.32 for each 1% increase in weekly income, regardless of the income level, as the following
calculations show.
π‘₯0 =
π‘₯1 =
βˆ†π‘₯% =
10
1.01
0.01
𝑦0 =
𝑦1 =
βˆ†π‘¦ =
207.137
208.452
1.3151
π‘₯0 =
π‘₯1 =
βˆ†π‘₯% =
20
2.02
0.01
𝑦0 =
𝑦1 =
βˆ†π‘¦ =
298.747
300.062
1.3151
However, the change in food-expenditure for each additional dollar increase in weekly income will differ
based on the income level, as the calculations in the following table show. This means that the slope of the
regression equation depends on the value of π‘₯, the weekly income level.
π‘₯0 =
π‘₯1 =
βˆ†π‘₯ =
1000
1001
1
𝑦0 =
𝑦1 =
βˆ†π‘¦β„βˆ†π‘₯ =
207.137
207.269
0.1321
π‘₯0 =
π‘₯1 =
βˆ†π‘₯ =
2000
2001
1
𝑦0 =
𝑦1 =
βˆ†π‘¦β„βˆ†π‘₯ =
298.747
298.813
0.0661
7.2. Log-Linear Model
The log-linear model was introduced in Chapter 3. There are additional points with respect to this model that
we need to pay attention to. The Log-Linear Model in regression takes the following form
ln(𝑦) = 𝛽1 + 𝛽2 π‘₯
determine the slope take the exponent of both sides of the equation
𝑦 = 𝑒 𝛽1+𝛽2 π‘₯
Then,
Slope:
𝑑𝑦
= 𝛽2 𝑒 ln(𝑦) = 𝛽2 𝑦
𝑑π‘₯
Elasticity:
πœ–=
𝑑𝑦 π‘₯
π‘₯
= 𝛽2 𝑦 ( ) = 𝛽2 π‘₯
𝑑π‘₯ 𝑦
𝑦
The coefficient 𝛽2 in ln(𝑦) = 𝛽1 + 𝛽2 π‘₯ implies that for each additional unit increase in π‘₯, 𝑦 will increase by 𝛽2
percent. Using the slope expression above, we have,
𝑑𝑦⁄𝑦
= 𝛽2
𝑑π‘₯
Example
Consider the model in which the π‘π‘Ÿπ‘–π‘π‘’ of a house is related to the house size measured in square feet (π‘ π‘žπ‘“π‘‘).
Let 𝑦 = π‘π‘Ÿπ‘–π‘π‘’ and π‘₯ = π‘ π‘žπ‘“π‘‘. The log-linear equation is,
Μ‚
ln
(𝑦) = 𝑏1 + 𝑏2 π‘₯
The data and the summary regression output for this example is in the Excel file π‘β„Ž5 π‘‘π‘Žπ‘‘π‘Ž. The estimated
regression equation is
Μ‚
ln
(𝑦) = 10.8386 + 0.000411π‘₯
5-Prediction, Goodness-of-Fit, and Modeling Issues
20 of 27
Consider a house size of π‘₯ = 2000 π‘ π‘žπ‘“π‘‘. The impact of an additional square feet is shown in calculations in
the following table:
π‘₯0 =
π‘₯1 =
βˆ†π‘₯ =
2000
2001
1
ln(𝑦0 ) =
ln(𝑦1 ) =
11.66113
11.66155
𝑦0 =
𝑦1 =
βˆ†π‘¦ =
βˆ†π‘¦β„π‘¦0 =
115975.5
116023.2
47.7
0.000411
Note that when π‘₯0 = 2000 π‘ π‘žπ‘“π‘‘, for each additional π‘ π‘žπ‘“π‘‘, the price of the house increases by βˆ†π‘¦ = $47.7. The
proportional increase is βˆ†π‘¦β„π‘¦0 = 0.000411 or 0.04%.
Now consider a house size of π‘₯ = 4000 π‘ π‘žπ‘“π‘‘:
π‘₯0 =
π‘₯1 =
βˆ†π‘₯ =
4000
4001
1
ln(𝑦)0 =
ln(𝑦)1 =
12.48367
12.48408
𝑦0 =
𝑦1 =
βˆ†π‘¦ =
βˆ†π‘¦β„π‘¦0 =
263991.4
264100.0
108.6
0.000411
For a larger house, here π‘₯ = 4000 π‘ π‘žπ‘“π‘‘, each additional square feet adds a larger amount βˆ†π‘¦ = $108.6 to the
price of the house. However, the percentage change in the price of the house is the same.
7.2.1. Adjustment to the Predicted Value in Log-Linear Models
In the calculations in the previous two tables the predicted value of 𝑦 for a given value of π‘₯ was obtained by
taking the exponent (anti-log) of the predicted log of 𝑦.
π‘₯ = 2000
Μ‚ = 10.8386 + 0.000411(2000) = 11.66113
ln(𝑦)
𝑦 ≑ 𝑦𝑛 = exp(11.66113) = 115975.5
Here 𝑦𝑛 denotes β€œnatural predictor”. In most cases (for large samples) a β€œcorrected” predicted value is
obtained by multiplying the β€œnatural” predictor by the quantity 𝑒 var(𝑒)⁄2 . In the regression summary output
var(𝑒) is shown as 𝑀𝑆𝐸, or mean square error.
𝑦̂𝑐 = 𝑦̂𝑛 𝑒 var(𝑒)⁄2
In the above example, the regression summary output shows that var(𝑒) = 𝑀𝑆𝐸 = 0.10334. Thus, for π‘₯ =
2000,
𝑦̂𝑐 = 115975.5𝑒 0.10334⁄2 = 122125.4
The natural predictor tends to systematically under-predict the value of 𝑦 in a log-linear model. The
corrected predictor balances this downward bias in large samples.
Example
A Growth Model
The Excel file β€œCH5 DATA” tab β€œwheat” contains data describing average wheat yield (tons per hectare) for a
region in Australia against time (𝑑), which runs from 1950 to 1997. The rise in yield overtime is attributed to
improvements in technology, where 𝑑 is used as a proxy for technology. The objective here is to obtain an
estimate of average rate of growth in yield.
5-Prediction, Goodness-of-Fit, and Modeling Issues
21 of 27
Yield (tons per hectare)
2.5
2.0
1.5
1.0
0.5
0.0
0
10
20
30
40
50
Time (1950-1997)
Let 𝑦 stand for π‘ŒπΌπΈπΏπ·, where 𝑦0 is the yield in the base year and 𝑦𝑑 is the yield in year 𝑑. Also, let π‘Ÿ stand for
the rate of growth. Then,
𝑦𝑑 = 𝑦0 (1 + π‘Ÿ)𝑑
Taking the natural log of both sides and using the properties of logarithms, we have,
ln(𝑦𝑑 ) = ln(𝑦0 ) + 𝑑 ln(1 + π‘Ÿ)
We can write this as a log-linear regression model with 𝑏1 = ln(𝑦0 ) and 𝑏2 = ln(1 + π‘Ÿ).
lnΜ‚
(𝑦𝑑 ) = 𝑏1 + 𝑏2 𝑑
The estimated regression equation is,
lnΜ‚
(𝑦𝑑 ) = βˆ’0.3434 + 0.01784𝑑
From the estimated coefficients we can determine the base year yield and the growth rate:
Base year yield
𝑏1 = ln(𝑦0 ) = βˆ’0.3434
𝑦0 = exp(βˆ’0.3434) = 0.709 tons per hectare
Growth rate
𝑏2 = ln(1 + π‘Ÿ) = 0.01784
1 + π‘Ÿ = exp(0.01784) = 1.018
π‘Ÿ = 1.018 βˆ’ 1 = 0.018
The estimated average annual growth rate is then approximately 1.8%.
Example
A Wage Equation
The Excel file β€œCH5 DATA” tab β€œwage” contains data describing hourly wage (π‘Šπ΄πΊπΈ) against years of
education (πΈπ·π‘ˆπΆ). In this example the objective is to determine the estimated average rate of increase in the
5-Prediction, Goodness-of-Fit, and Modeling Issues
22 of 27
wage rate for each additional year of schooling. Using the same methodology as in the previous example, we
have
π‘Šπ΄πΊπΈ = π‘Šπ΄πΊπΈ0 (1 + π‘Ÿ)πΈπ·π‘ˆπΆ
ln(π‘Šπ΄πΊπΈ) = ln(π‘Šπ΄πΊπΈ0 ) + ln(1 + π‘Ÿ)πΈπ·π‘ˆπΆ
We obtain the following estimated regression equation
Μ‚
ln(π‘Šπ΄πΊπΈ)
= 1.6094 + 0.0904πΈπ·π‘ˆπΆ
π‘Šπ΄πΊπΈ0 = exp(1.6094) = 5.00
1 + π‘Ÿ = exp(0.0904) = 0.095
Thus, the estimated rate of increase for an additional year of education is 9.5%.
7.2.2.
Predicted Value in the Log-Linear Wage-Education Model
What is the predicted value of π‘Šπ΄πΊπΈ for a person with 12 years of education?
Μ‚
ln(π‘Šπ΄πΊπΈ)
= 1.6094 + 0.0904(12) = 2.694
Μ‚ = exp(2.694) = 14.7958 = $14.80
π‘Šπ΄πΊπΈ
According to text (p 154) this figure in a log-linear model is the β€œnatural” predictor (𝑦̂𝑛 ). We need to find the
corrected predictor, 𝑦̂𝑐 . The corrected predictor is obtained by,
𝑦̂𝑐 = 𝑦̂𝑛 𝑒 var(𝑒)⁄2
𝑦̂𝑐 = 14.7958 × exp(0.2773⁄2) = 16.996 = $17.00
In large samples the natural predictor 𝑦̂𝑛 tends to systematically under-predict the value of the dependent
variable. The correction offsets this downward bias.
7.2.3.
Generalized π‘ΉπŸ Measure for Log-Linear Models
When considering 𝑅2 in a log-linear model, we need to keep two things in mind: 1) The 𝑅2 measure shown in
Μ‚
the regression output involves ln
(𝑦). We need to show 𝑅2 as a measure of explained variations in 𝑦, rather
than ln(𝑦). 2) When we find the antilog of the predicted values, the result is a set of natural predictors which
we need to change into the corrected predictors. Thus, we obtain the general 𝑅2 as follows: Recall from
above that 𝑅2 is equal to coefficient of correlation between 𝑦 and 𝑦̂ squared. Here we must find the
correlation coefficient between 𝑦 and 𝑦̂𝑐 and then square it.
2
𝑅2 = π‘Ÿπ‘¦π‘¦
̂𝑐
2
π‘Ÿπ‘¦π‘¦
Μ‚ = 0.1859
The calculations are shown in the Excel file.
5-Prediction, Goodness-of-Fit, and Modeling Issues
23 of 27
7.2.4.
Prediction Interval in the Log-Linear Model
Here we want to build a prediction interval for π‘Šπ΄πΊπΈ when πΈπ·π‘ˆπΆ = 12. This is an interval for individual
value of 𝑦 for a given π‘₯.
First, we need to determine se(𝑦0 ) when π‘₯0 = 12. From the regression equation
Μ‚
ln
(𝑦) = 𝑏1 + 𝑏2 π‘₯
we have the covariance matrix
var(𝑒)𝑋 βˆ’1 = [
var(𝑏1 )
cov(𝑏1 , 𝑏2 )
cov(𝑏1 , 𝑏2 )
0.0075
]=[
var(𝑏2 )
βˆ’0.00052
βˆ’0.00052
]
0.000038
var[ln(𝑦0 )] = var(𝑏1 ) + π‘₯02 var(𝑏2 ) + 2π‘₯0 cov(𝑏1 , 𝑏2 ) + var(𝑒)
var[ln(𝑦0 )] = 0.0075 + 122 × 0.000038 βˆ’ 2 × 12 × 0.00052 + 0.2773 = 0.2777
se[ln(𝑦0 )] = 0.5270
𝑀𝑂𝐸 = 𝑑0.025,998 × se[ln(𝑦0 )] = 1.962 × 0.5270 = 1.034
ln(𝑦̂|π‘₯0 ) = 2.694
𝐿ln(𝑦0) = 2.694 βˆ’ 1.034 = 1.660
π‘ˆln(𝑦0) = 2.694 + 1.034 = 3.728
𝐿𝑦𝑛 = exp(1.660) = 5.2604
π‘ˆπ‘¦π‘› = exp(3.728) = 41.6158
𝐿𝑦𝑐 = 5.2604 × exp(0.2773/2) = $6.04
π‘ˆπ‘¦π‘ = 41.6158 × exp(0.2773/2) = $41.77
The prediction interval [$6.04, $41.77] is so wide that is basically useless. This indicates that our model is not
an accurate predictor of the range of the dependent variable values for a given π‘₯. To develop a better
predictor we need to add additional variables to the model and approach the situation via a multiple
regression model. This will be done in the next chapter.
7.3. Log-Log Models
The log-log model is used in describing demand equations and production functions. Generally,
ln(𝑦) = 𝛽1 + 𝛽2 ln(π‘₯)
To determine the slope of the function 𝑑𝑦⁄𝑑π‘₯ , first take the exponent of both sides.
𝑦 = 𝑒 𝛽1+𝛽2 ln(π‘₯)
The slope is then,
𝑑𝑦
1
𝑦
= 𝛽2 𝑒 𝛽1+𝛽2ln(π‘₯) = 𝛽2
𝑑π‘₯
π‘₯
π‘₯
The elasticity is,
5-Prediction, Goodness-of-Fit, and Modeling Issues
24 of 27
πœ–=
𝑑𝑦 𝑦
= 𝛽2
𝑑π‘₯ π‘₯
Thus, in the log-log model, the coefficient 𝛽2 represents the percentage change in 𝑦 in response to a
percentage change in π‘₯.
Example
A Log-Log Poultry Demand Equation
The Excel file β€œCH5 DATA” tab β€œchicken” contains data contains data describing the per capita consumption of
chicken (in pounds) against the real (inflation adjusted) price. Using the log-log model
ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃)
the estimated regression equation is,
Μ‚ = 3.7169 βˆ’ 1.1214 ln(𝑃)
ln(𝑄)
Here the coefficient of ln(𝑃) is the estimated elasticity of demand, which is 1.121. This implies that for a 1%
increase in the real price of chicken, the quantity demanded is reduced by 1.121%. To obtain the predicted
value of per capita consumption when 𝑃 = $2.00,
Μ‚ = 3.7169 βˆ’ 1.1214 ln(2) = 2.94
ln(𝑄)
𝑄̂𝑛 = exp(2.94) = 18.91
𝑄̂𝑐 = 18.91 × evar(𝑒)/2 = 18.91 × exp(0.01392⁄2) = 19.042
The generalized 𝑅2 is,
2
𝑅𝐺2 = π‘Ÿπ‘¦π‘¦
̂𝑐 = 0.8818
See the Excel file for the calculations.
5-Prediction, Goodness-of-Fit, and Modeling Issues
25 of 27
Appendix
The variance formula
1 (π‘₯0 βˆ’ π‘₯Μ… )2
var(𝑦̂0 ) = πœŽπ‘’2 ( +
)
𝑛 βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
is obtained as follows:
Start with the estimated regression equation and the predicted value of 𝑦 for the given π‘₯0 .
𝑦̂0 = 𝑏1 + 𝑏2 π‘₯0
Taking the variance from both sides of the equation we have,
var(𝑦̂0 ) = var(𝑏1 + 𝑏2 π‘₯0 )
var(𝑦̂0 ) = var(𝑏1 ) + π‘₯02 var(𝑏2 ) + 2π‘₯0 cov(𝑏1 , 𝑏2 )
On the right-hand-side, substituting for
var(𝑏1 ) =
var(𝑏2 ) =
βˆ‘π‘₯ 2
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
πœŽπ‘’2 ,
πœŽπ‘’2
, and
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
cov(𝑏1 , 𝑏2 ) =
βˆ’π‘₯Μ…
𝜎2
(π‘₯
βˆ‘ βˆ’ π‘₯Μ… )2 𝑒
we have,
var(𝑦̂0 ) =
βˆ‘π‘₯ 2
π‘₯02
βˆ’2π‘₯0 π‘₯Μ…
2
𝜎
+
πœŽπ‘’2 βˆ’
𝜎2
𝑒
2
2
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2 𝑒
var(𝑦̂0 ) = πœŽπ‘’2
βˆ‘π‘₯ 2 + 𝑛π‘₯02 βˆ’ 2𝑛π‘₯0 π‘₯Μ… + 𝑛π‘₯Μ… 2 βˆ’ 𝑛π‘₯Μ… 2
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
var(𝑦̂0 ) = πœŽπ‘’2
βˆ‘π‘₯ 2 βˆ’ 𝑛π‘₯Μ… 2 + 𝑛(π‘₯0 βˆ’ π‘₯Μ… )2
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
var(𝑦̂0 ) = πœŽπ‘’2
βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2 + 𝑛(π‘₯0 βˆ’ π‘₯Μ… )2
π‘›βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
1 (π‘₯0 βˆ’ π‘₯Μ… )2
var(𝑦̂0 ) = πœŽπ‘’2 ( +
)
𝑛 βˆ‘(π‘₯ βˆ’ π‘₯Μ… )2
5-Prediction, Goodness-of-Fit, and Modeling Issues
26 of 27
5-Prediction, Goodness-of-Fit, and Modeling Issues
27 of 27