Download Tutorial 11

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ECON20003 – QUANTITATIVE METHODS 2
TUTORIAL 11
Download the t11e3, t11e4 and t11e6 Excel data files from the subject website and save it
to your computer or USB flash drive. Read this handout and try to complete the tutorial
exercises before your tutorial class, so that you can ask help from your tutor during the Zoom
session if necessary.
After you have completed the tutorial exercises attempt the “Exercises for assessment”. You
must submit your answers to these exercises in the Tutorial 11 Homework Canvas
Assignment Quiz by the next tutorial in order to get the tutorial mark. For each assessment
exercise type your answer in the relevant box available in the Quiz or type your answer
separately in Microsoft Word and upload it in PDF format as an attachment. In either case,
if the exercise requires you to use R, save the relevant R/RStudio script and printout in a
Word document and upload it together with your written answer in PDF format.
Using the Sample Regression Equation
Once you estimated a regression model and ensured that the sample regression equation
is acceptable, you can use it to predict either an element of the sub-population of the
dependent variable that is generated by some given set of values of the independent
variables (individual prediction) or the mean of this sub-population (mean prediction). In the
first case the aim is to predict
y0   0  1 x0,1   2 x0,2  ...   k x0,k   0
where x0,i denotes a possible value of the ith independent variable and y0 is a random
variable because it depends on the 0 random error, while in the second case the aim is to
predict
E ( y0 )  E (Y | x0,1 , x0,2 ,..., x0,k )   0  1 x0,1   2 x0,2  ...   k x0,k
which is constant.
In both cases the prediction can be either a single value or an interval.
As for the point predictions, numerically there is no difference between the point prediction
of an individual element, y0, and that of the conditional expected value of dependent
variable, E(y0). Namely, both are equal to the value of the sample regression function
evaluated over the given a set of the independent variable values, i.e.
yˆ 0  Eˆ (Y | x0 )  ˆ0  ˆ1 x0,1  ˆ2 x0,2 ,...,  ˆk x0,k
1
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
However, the interval predictions are different because the standard error depends on
whether the aim is to predict y0 or E(y0).
For example, if there is only one independent variable in the model (k = 1), the estimated
standard error for an individual prediction is
s yˆ0  s 1 
( x  x )2
1
 n0
n  ( xi  x ) 2
i 1
while for the mean prediction it is
sEˆ ( y )  s
0
( x  x )2
1
 n0
n  ( xi  x ) 2
i 1
where s is the estimated standard error of regression.
Based on these estimated standard errors and assuming that the classical assumptions
behind linear regression hold, the prediction interval for an individual value y0 and the
confidence interval for the expected value E(y0) are
yˆ 0  t /2,n  2 s yˆ0
and
yˆ 0  t /2,n  2 sEˆ ( y
0)
A comparison of the standard error of the interval predictor of y0 to that of E(y0) reveals that,
due to the extra term under the square root (i.e. “1”),
s yˆ0  sEˆ ( y
0)
Hence, the mean of a sub-population can always be predicted with a smaller standard error
than any of its elements. Consequently, given the same confidence level, the confidence
interval estimate for E(y0) is always narrower, i.e. more precise, than the corresponding
prediction interval estimate for y0.
Finally, note that for large sample sizes t0.025  2, and the standard error of the individual
prediction is just marginally bigger than the standard error of regression. Therefore,
yˆ 0  2 s
provides an approximate 95% individual prediction interval.1
1
We do not consider the standard error formulas for the more general cases when k > 1 because they are
more complicated, and you will not need to use them in manual calculations. However, the inequality between
the two standard errors and the formula for the approximate 95% individual prediction interval remain valid.
2
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
Exercise 1
Let’s return to Exercise 2 of Tutorial 10. In that exercise you regressed sons’ height to their
fathers’ height using data on 400 father-son pairs’ heights in centimetres and evaluated the
results. Based on that regression, answer the following questions.
a) Predict the height of a son whose father is 175cm tall and the average height of all sons
whose fathers are 175cm tall.
From part (a), Exercise 2 of Tutorial 10, the R regression printout is
Hence, the sample regression equation is
  91.353  0.479 Father
Son
i
i
In this case x0 = 175 and the corresponding point prediction
  Eˆ ( Son | Father )  ˆ  ˆ Father  91.353  0.479 175  175.178
Son
0
0
0
1
0
Hence, the height of one randomly selected son whose father is 175cm tall and the
average height of all sons whose fathers are 175cm tall are predicted to be 175.178cm.
b) Predict with 90% confidence the height of a son whose father is 175cm tall.
To develop this prediction interval, we need the point prediction, the t reliability factor,
and the estimated standard. From part (a), the point prediction is 175.178. From the t
table, the reliability factor for /2 = 0.05 and df = n – 2 = 398 is
t / 2,df  t0.05,398  t0.05,  z0.05  1.645
The estimated standard error can be calculated with the following formula:
3
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
s yˆ0  s 1 
( x  x )2
1
 n0
n  ( xi  x ) 2
i 1
In this formula s is the estimated standard error of regression. From the R printout (see
the Residual standard error) it is about 8.062. We also need the sample mean and the
sum of squared deviations for X (i.e. Father), which are 167.856 and 40980.480,
respectively.2
Putting all these together, the estimated standard error is
s yˆ0  8.062 1 
1
(175  167.856)2

 8.077
400
40980.480
and the 90% prediction interval for the height of a son whose father is 175cm tall is
yˆ 0  t /2,n  2 s yˆ0  175.178  1.645  8.077  (161.891 , 188.465)
Hence, with 90% confidence the height of a son whose father is 175cm tall is between
161.891cm and 188.465cm.
To generate these results with R, launch RStudio, create a new project and script, and
name them t11e1. Import the data from the t10e2 Excel file, and re-estimate the
regression model by running the following commands:
attach(t10e2)
m = lm(Son ~ Father)
summary(m)
The R function for point and interval predictions from a fitted model is
predict(model, newdata, interval, level, …)
where newdata is an optional data frame in which to look for variables with which to predict
(if omitted, the fitted values are used), interval is the type of interval calculation ("none",
"confidence" or "prediction"), and level is the confidence level (the default value is 0.95).
In this example newdata is generated with the
newdata = data.frame(Father = 175)
command and then the required point prediction and the 90% prediction interval can be
obtained by executing the
2
LK: To save time I calculated them with R by executing the mean(Father) and var(Father) * (length(Father) 1) commands.
4
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
predict(m, newdata, level = 0.90, interval = "prediction")
command. It returns:
On this printout fit is the point estimate and lwr and upr are the lower and upper limits
of the corresponding 90% prediction interval.
c) Predict with 90% confidence the average height of all sons whose fathers are 175cm
tall.
In this case the estimated standard error is
sEˆ ( y )  s
0
( x  x )2
1
1
(175  167.856) 2
 n0

8.062

 0.493
n  ( xi  x ) 2
400
40980.480
i 1
and the 90% confidence interval for the average height of all sons whose fathers are
175cm tall is
yˆ 0  t /2,n  2 sEˆ ( y )  175.178  1.645  0.493  (174.367 , 175.989)
0
Hence, with 90% confidence the average height of all sons whose fathers are 175cm
tall is between 174.367cm and 175.989cm.
To obtain this interval with R, execute the
predict(m, newdata, level = 0.90, interval = "confidence")
command. It returns:
The fit value is the same than before. However, the 90% prediction interval developed
for the height of a son whose father is 175cm tall is much wider, and hence provides a
less precise prediction, than this 90% confidence interval developed for the average
height of all sons whose fathers are 175cm tall.
5
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
Dummy Independent Variables in Regression Models
In regression models qualitative variables, such as gender, race, qualification, preference
etc., can be captured by dummy variables (D), also known as indicator or binary variables.
They have two possible values, usually 1 for “success” and 0 for “failure”.
Since a dummy variable has two different possible values, it can be used to distinguish two
different categories. However, if a qualitative variable has more than 2 categories, one
dummy variable is insufficient to represent all those categories. In general, m different
categories (m > 1) can be represented by a set of m – 1 dummy variables.
A dummy independent variable, D, can be introduced in a regression model in two different
ways. Either as a standalone independent variable or in interaction with some quantitative
independent variable, X. Assuming that there are only two independent variables in the
model, X and D, in the first case,
Y   0  1 X   2 D  
and D has effect on the y-intercept, which is 0 if D = 0 or 0 + 2 if D = 1, but not on the
slope parameter of X, which is always 1, irrespectively of D. For this reason, D is called an
intercept dummy variable.
In the second case, the interaction between D and X is captured by their product, DX.
Consequently,
Y   0  1 X   2 DX  
This time the y-intercept does not depend on D, it is always 0, but the slope parameter of
X is 1 if D = 0 or 1 + 2 if D = 1. For this reason, the DX interaction variable is called a
slope dummy variable.
These two specifications can be combined, i.e. a binary qualitative variable can be
represented in a regression model with an intercept dummy variable and with a slope
dummy variable at the same time:
Y   0  1 X   2,1 D   2,2 DX  
This time the y-intercept is 0 if D = 0 or 0 + 2,1 if D = 1, while the slope parameter of X is
1 if D = 0 or 1 + 2,2 if D = 1.
Exercise 2
(Selvanathan et al., p. 846, ex. 19.8)
In a study of computer applications, a survey asked which computer a number of companies
used. The following dummy variables were created:
6
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
D1 = 1 if Lenovo
= 0 if not Lenovo
D2 = 1 if Macintosh
= 0 if not Macintosh
What computer is being referred to by each of the following pairs of values?
a)
D1 = 0, D2 = 1
D1 = 0 means that the computer is not a Lenovo and D2 = 1 means that it is a Macintosh.
b)
D1 = 1, D2 = 0
D1 = 1 means that the computer is a Lenovo and D2 = 0 means that it is not a Macintosh.
c)
D1 = 0, D2 = 0
D1 = 0 means that the computer is not a Lenovo and D2 = 0 means that it is not a
Macintosh either. Hence, it must be something else, for example, a HP.
Exercise 3
A drug manufacturer wishes to compare three drugs (Drug: A, B, and C), which it can
produce for the treatment of severe depression. The investigator would also like to study the
relationship between the age of the patients (Age) and the effectiveness of the drugs (Effect
measured on a scale from 1: low to 100: high). The investigator takes a random sample of
36 patients who are comparable with respect to diagnosis and severity of depression and
assigns them randomly to receive drug A, B, or C. This sample data is saved in the t11e3
Excel file.
The investigator intends to use a multiple regression to model Effect as a function of Age
and Drug.
a) The dependent variable (Effect) and the first independent variable (Age) are quantitative
variables. The second independent variable (Drug), however, is a qualitative variable
that has three categories. It can be represented in the regression model with two dummy
variables defined, for example, as
D1 = 1 if Drug = A and D1 = 0 otherwise,
D2 = 1 if Drug = B and D2 = 0 otherwise.
These two dummy variables are sufficient to distinguish the three different drugs, since
for A D1 = 1 and D2 = 0, for B D1 = 0 and D2 = 1, and for C, which is the base category
this time, D1 = 0 and D2 = 0.
Launch RStudio, create a new project and script, name them t11e3, import the data from
the t11e3 Excel file and execute the
attach(t11e3)
7
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
command.
With R, dummy variables can be created by the
ifelse(condition, 1 , 0)
function, where condition is a logical expression and the result is 1 if the condition is true and
the result is 0 if the condition is false.
In this case, to generate the D1 and D2 dummy variables the condition is that Drug
equals A and B, respectively. Hence, execute the
D1 = ifelse(Drug == "A", 1, 0)
D2 = ifelse(Drug == "B", 1, 0)
commands.
b) Estimate the following multiple linear regression model:
Effecti   0  1 Agei   2 D1i  3 D 2i   i
Execute the following commands,
m1 = lm(Effect ~ Age + D1 + D2)
summary(m1)
to obtain:
Call:
lm(formula = Effect ~ Age + D1 + D2)
Residuals:
Min
1Q
-12.5165 -3.5373
Median
0.8309
3Q
3.9782
Max
9.6501
Coefficients:
Estimate Std. Error t value
(Intercept) 22.02158
3.46634
6.353
Age
0.67063
0.06901
9.718
D1
10.24504
2.43816
4.202
D2
0.60979
2.43674
0.250
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|)
3.92e-07 ***
4.55e-11 ***
0.000198 ***
0.803995
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.969 on 32 degrees of freedom
Multiple R-squared: 0.7896, Adjusted R-squared: 0.7699
F-statistic: 40.03 on 3 and 32 DF, p-value: 6.095e-11
The sample regression equation is
  22.022  0.671Age  10.245D1  0.610 D 2
Effect
i
i
i
i
8
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
c) Test the overall utility of this multiple regression model.
From the printout in part (b), the F-statistic is 40.03 and the corresponding p-value is
practically zero. Consequently, we can reject the null hypothesis that all three slope
parameters are simultaneously zero at any reasonable significance level, so at least one
Age, D1 and D2 has a significant effect on Effect, so this regression model is useful.
d) Interpret the unadjusted and adjusted coefficients of determination. Why do they have
different values?
Again, from the printout, the unadjusted coefficient of determination is R2 = 0.7896. It
means that about 79% of the sample variation in the effectiveness of the drugs is
accounted for by this regression, i.e. by the age of the patients and by the type of the
drug.
The adjusted coefficient of determination is slightly smaller, 0.7699, so after having
taken the sample size and the number of independent variables into consideration,
about 77% of the sample variation in the effectiveness of the drugs is accounted for by
the age of the patients and by the type of the drug.
e) Interpret the coefficients.
In this case the y-intercept estimate does not have a logical interpretation because Age
cannot be zero.
The first slope estimate suggests that keeping D1 and D2 constant, i.e. at any given
drug, with every additional year of Age the effectiveness of the drug increases by about
0.671.
As regards the second and the third slope estimates, since they belong to the two
intercept dummy variables used to distinguish the three different drugs, they cannot be
considered separately from each other. Instead, assuming that Age is kept constant, we
can consider the values of Effect predicted for the three drugs and compare them to
each other.
Recalling that for drug A D1 = 1 and D2 = 0 while for drug C D1 = 0 and D2 = 0, the
difference between the estimated effectiveness of drugs A and C for any given Age is
 ˆ
0
 

 ˆ1 Age  ˆ2 1  ˆ3  0  ˆ0  ˆ1 Age  ˆ2  0  ˆ3  0  ˆ2  10.245
Similarly, since for drug B D1 = 0 and D2 = 1 while for drug C D1 = 0 and D2 = 0, the
difference between the estimated effectiveness of drugs B and C for any given Age is
 ˆ  ˆ Age  ˆ  0  ˆ 1   ˆ
0
1
2
3
0

 ˆ1 Age  ˆ2  0  ˆ3  0  ˆ3  0.610
Finally, we can compare the estimated effectiveness of drugs A and B for any given
Age:
9
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
 ˆ
0
 

 ˆ1 Age  ˆ2 1  ˆ3  0  ˆ0  ˆ1 Age  ˆ2  0  ˆ3 1  ˆ2  ˆ3  10.245  0.610  9.635
Hence, for any given Age, drug A is expected to be 10.245 more efficient than drug C,
drug B is expected to be 0.610 more efficient than drug C, and drug A is expected to be
9.635 more efficient than drug B.
f)
In the multiple regression model estimated in part (b), the D1 and D2 dummy variables
are intercept dummy variables as they alter the y-intercept only. Consequently, the
model assumes that the marginal impact of Age on Effect is 1, no matter which
particular Drug is used. Let’s now relax this restriction.
In order to let the marginal impact of Age on Effect to change by Drug, the D1 and D2
dummy variables need to be introduced to the model as slope dummy variables as well,
implying the following multiple regression model:
Effecti   0  1 Agei   2 D1i  3 D1i Agei   4 D 2i  5 D 2i Agei   i
In this new model Age interacts with D1 and D2, i.e. with Drug.
Execute the following commands,
m2 = lm(Effect ~ Age + D1 + D1*Age + D2 + D2*Age)
summary(m2)
to obtain:
Call:
lm(formula = Effect ~ Age + D1 + D1 * Age + D2 + D2 * Age)
Residuals:
Min
1Q Median
-6.584 -2.668 0.099
3Q
2.780
Max
6.416
Coefficients:
Estimate Std. Error t value
(Intercept) 6.21138
3.30103
1.882
Age
1.03339
0.07128 14.498
D1
41.30421
5.01075
8.243
D2
21.95811
5.01709
4.377
Age:D1
-0.70288
0.10738 -6.546
Age:D2
-0.48886
0.10879 -4.494
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|)
0.069619
4.31e-15
3.35e-09
0.000134
3.06e-07
9.69e-05
.
***
***
***
***
***
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.868 on 30 degrees of freedom
Multiple R-squared: 0.9172, Adjusted R-squared: 0.9034
F-statistic: 66.43 on 5 and 30 DF, p-value: 2.59e-15
The sample regression equation is
  6.211  1.033 Age  41.304 D1  0.703D1 Age  21.958D 2  0.489 D 2 Age
Effect
i
i
i
i
i
i
i
i
This new regression is also significant (F-test) and has a much higher adjusted
coefficient of determination (0.9034) than the original regression. In addition, every slope
10
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
estimate is significantly different from zero (two-tail t-tests), supporting the new
specification.
g) What do the slope coefficients of the new model suggest to you?
To answer this question, the best is to consider the three drugs one by one.
Starting with the base category, for drug C (i.e. for D1 = 0 and D2 = 0) the sample
regression equation is
  6.211  1.033 Age
Effect
i
i
For drug A (D1 = 1 and D2 = 0) the sample regression equation is
  6.211  1.033 Age  41.304  0.703 Age  47.515  0.330 Age
Effect
i
i
i
i
while for drug B (D1 = 0 and D2 = 1) it is
  6.211  1.033 Age  21.958  0.489 Age  28.169  0.544 Age
Effect
i
i
i
i
The corresponding sample regression lines have different intercepts and slopes, which
can be best compared to each other visually. In order to do so, first we need to create
an artificial data series for Age and calculate the corresponding estimated Effect for
each drug separately. We are going to denote these new series as age, effect_A,
effect_B and effect_C so as to distinguish them from the original Age and Effect series.
Execute the following commands:
age = (0: 100)
effect_A = 47.515 + 0.330*age
effect_B = 28.169 + 0.544*age
effect_C = 6.211 + 1.033*age
The first command creates age and sets it equal to 0, 1, , 2, …, 100 and subsequently
the other three commands calculate the corresponding estimated effect for the three
drugs.
Next, we can depict effect_C with the plot() function and then add two lines for effect_A
and effect_B with the lines() function.3 To do so, execute the following commands:
plot(age, effect_C, type="l", col="red", lwd=2, xlab="Age", ylab="Estimated effect")
lines(age, effect_A, col = "blue", lwd = 2)
lines(age, effect_B, col = "green", lwd = 2)
title("Estimated drug effectiveness")
legend("topleft", c("Drug A", "Drug B", "Drug C"),
lwd = c(2,2,2), col = c("blue", "green", "red"))
3
You can refresh your knowledge about the plot() and lines() functions by reviewing Tutorial 2.
11
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
In these commands, the lwd option specifies the line width relative to the default width,
which is 1.
The last two commands add a title and a legend to the plot. The first argument to the
legend command specifies its position, the second is the legend text, and the following
two just echo the same arguments of the previous plot and lines commands, as R
requires to specify them again for the legend.
You should now see the following in your Plots panel:
Visual inspection of this graph suggests that drug A is more efficient than drug B for
almost every age but the difference between them is getting smaller by age. As regards
drug C, it appears to be far less efficient for younger patients than the other two drugs,
but its efficiency increases faster by age and it is the most efficient for middle aged and
elderly people.
h) Do the casual observations we made on the plot in part (g) reflect significant differences
between the effectiveness of drugs A and C and between B and C?
Recall the population regression model is
Effecti   0  1 Agei   2 D1i  3 D1i Agei   4 D 2i  5 D 2i Agei   i
Given the definitions of the D1 and D2 dummy variables, for the three drugs it collapses
to
12
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
C : Effecti   0  1 Agei   i
A : Effecti  (  0   2 )  ( 1  3 ) Agei   i
B:
Effecti  (  0   4 )  ( 1  5 ) Agei   i
This suggests that whether the three drugs indeed differ in terms of efficiency depends
on the 2, 3, 4 and 5 parameters. In part (f), based on the reported t-statistics and pvalues, we already concluded that the estimates of these parameters are significantly
different from zero. Let’s now elaborate on this issue.
Our figure on the previous page shows that drugs A and B have higher y-intercept than
drug C. To see whether these differences are significantly positive, we need to perform
two right-tail t-tests with the following hypotheses:
H 0 : 2  0 , H A : 2  0
and
H 0 : 4  0 , H A : 4  0
The slope estimates of 2 and 4 are positive (41.304 and 21.958) and half of their
reported Pr(<|t|) values are practically zero, so we can reject both null hypotheses and
conclude that drugs A and B have larger y-intercepts than drug C.
As regards the slopes, drug C has steeper sample regression line than drugs A and B.
We can again start with two t-tests, but with two left-tail t-tests with the following
hypotheses:
H 0 : 3  0 , H A : 3  0
and
H 0 : 5  0 , H A : 5  0
The slope estimates of 3 and 5 are negative (-0.703 and -0.489) and half of their
reported Pr(<|t|) values are practically zero, so we can reject both null hypotheses and
conclude that drugs A and B have smaller slopes than drug C.
Exercise 4
(Selvanathan et al., p. 850, ex. 19.16)
Absenteeism is a serious employment problem in most countries. It is estimated that
absenteeism reduces potential output by more than 10%. Two economists launched a
research project to learn more about the problem. They randomly selected 100
organizations to participate in a one-year study. For each organization, they recorded the
average number of days absent per employee (Absent) and several variables thought to
affect absenteeism. File t11e4 contains the observations on the following variables:
Absent:
Wage:
PctPT:
PctU:
AvShift:
UMRel:
average number of days absent per employee.
average employee wage (annual, $);
percentage of part-time employees;
percentage of unionised employees;
availability of shift work (1 = yes; 0 = no);
union-management relationship (1 = good; 0 = not good);
13
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
a) Estimate a multiple regression model of Absent on the other variables in the economist’s
data set.
Launch RStudio, create a new project and script, name them t11e4, import the data from
the t11e4 Excel file and execute the following commands:
attach(t11e4)
m = lm(Absent ~ Wage + PctPT + PctU + AvShift + UMRel)
summary(m)
You should get the printout shown on the next page.
b) Is this regression useful in explaining the variation in absenteeism among the
organisations?
To answer this question, we need to perform the F-test of overall significance and to
consider the adjusted coefficient of determination.
From this printout, the F-statistic is 21.4 and it is significant at any level since its p-value
is practically zero. Consequently, the null hypothesis that all slope parameters are
simultaneously zero can be rejected at any reasonable significance level. This means
that at least one independent variable has a significant effect on absenteeism, so this
regression model is useful.
The adjusted R2 is 0.5075, so after having taken the sample size and the number of
independent variables into consideration, this multiple regression model can account for
almost 51% of the total variation in absenteeism.
c) What do you think about relationships between the dependent variable and each
independent variable? Do you think that the slope estimates have logical signs?
14
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
Given the definitions of the variables, I would expect PctU and AvShift to have positive
impacts on Absent (i.e. more union members and shift work likely increase absenteeism)
and Wage, PctPT and UMRel to have negative impacts on Absent (i.e. higher wages,
more part time work and good union-management relationship likely decrease
absenteeism). Hence, all slope estimates have the logical sign.
d) Are the slope coefficients significant in the logical directions?
Every slope coefficient has the logical/expected sign and since the p-values for the ttests are all smaller than 0.0025, we conclude that every slope estimate is significant in
the logical direction practically even at the 0.5% significance level.4
e) Interpret the slope coefficients.
The slope estimates suggest that, keeping all other independent variables in the model
constant,
When the average wage increases by one dollar ($1000), average absenteeism is
expected to drop by about 0.0002 (0.2) days.
When the proportion of part time employees increases by one percentage point,
average absenteeism is expected to drop by about 0.1069 days.
When the proportion of unionised employees increases by one percentage point,
average absenteeism is expected to rise by about 0.0599 days.
Availability of shift work is expected to increase average absenteeism by about
1.5619 days.
Good union-management relationship is expected to reduce average absenteeism
by about 2.6366 days.
f)
Can we infer that the availability of shift work is related to absenteeism?
If shift work is related to absenteeism, then the coefficient of the AvShift dummy variable
is expected to be significant. This implies a two-tail t-test with
H 0 : 4  0
,
H A : 4  0
The Pr(> | t |) value for of AvShift is 0.0025, so H0 can be rejected even at the 0.3%
significance level. Hence, we conclude that shift work is indeed related to absenteeism.
g) Is there enough evidence to infer that in organisations where the union-management
relationship is good, absenteeism is lower?
4
When you consider the reported p-values of the t-tests, i.e. Pr(> | t |), recall that on the printout e stands for
exponent, which means the number of tens you multiply a number by. For example, 8.12e-14 = 8.12 x 10-14 =
8.12 / 1014 = 8.12 / 100000000000000 = 0.0000000000000812.
15
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
Since the UMRel dummy variable is equal to 1 if the relationship is good and 0 otherwise,
this question implies a left-tail t-test with
H 0 : 5  0
,
H A : 5  0
Based on the expected signs of the slope parameters, we have already performed this
test in part (a). We rejected H0, so we can conclude that in organisations where the
union-management relationship is good, absenteeism is likely lower.
h) Does it appear that the normality requirement is violated? If it does not, is non-normality
a serious issue this time? Explain.
As in previous exercises, save the residuals, illustrate them with a histogram and a QQ
plot, obtain the usual descriptive statistics and perform the SW test by executing
res = residuals(m)
hist(res, freq = FALSE, ylim = c(0, 0.2), col = "lightblue")
lines(seq(-6, 8, by = 0.05),
dnorm(seq(-6, 8, by = 0.05), mean(res), sd(res)),
col="red")
qqnorm(res, main = "Normal Q-Q Plot",
xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
pch = 19, col = "lightgreen")
qqline(res, col = "royalblue4")
library(pastecs)
round(stat.desc(res, basic = FALSE, norm = TRUE), 4)
You should get the following printouts.
16
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
The histogram of the residuals is slightly skewed to the right and on the QQ plot many
points are above the straight line.
The mean and the median are similar, the skewness and excess kurtosis statistics are
both close to zero, but skew.2SE > 1 while kurt.2SE < 1. Finally, the p-value of the SW
test is normtest.p = 0.0389, so the null hypothesis of normality can be rejected at the
4% significance level. all things considered, the  error variables might not be normally
distributed.
Note, however, that due to the relatively large sample size, the normality assumption is
not crucial this time. Even if it is violated, we can still rely on the t and F-tests.
i)
Is multicollinearity a problem? Explain.
Like in Exercise 1 of Tutorial 10, execute
library(Hmisc)
rcorr(as.matrix(t11e4), type = "pearson")
library(car)
round(vif(m), 4)
17
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
to get
and
As you can see,
i.
The coefficient of determination is only about 0.53 though each independent
variable is strongly significant individually;
ii. Every correlation coefficient between the independent variables is below 0.1 in
absolute value;
iii. The VIF values are all much smaller than 5.
These all indicate that multicollinearity is of no concern in this model.
j)
Is heteroskedasticity likely in this model? Plot the residuals against the estimated
Absent (i.e. y-hat) series and perform White’s test for heteroskedasticity.
Like in Exercise 3 of Tutorial 10, execute
yhat = fitted.values(m)
plot(yhat, res,
main = "OLS residuals versus yhat",
col = "red", pch = 19, cex = 0.75)
library(lmtest)
bptest(m, ~ Wage + PctPT + PctU + AvShift + UMRel +
I(Wage^2) + I(PctPT^2) + I(PctU^2) + I(AvShift^2) + I(UMRel^2) +
I(Wage * PctPT) + I(Wage * PctU) + I(Wage * AvShift) + I(Wage * UMRel) +
I(PctPT * PctU) + I(PctPT * AvShift) + I(PctPT * UMRel) +
I(PctU * AvShift) + I(PctU * UMRel) + I(AvShift * UMRel))
18
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
to get
and
The residual plot does not reveal any clear pattern and the White’s test maintains the
null hypothesis of homoskedasticity (p-value = 0.6461), we do not need to worry about
heteroskedasticity.
Exercises for Assessment
Exercise 5
(Selvanathan et al., p. 846, ex. 19.7)
Create and identify indicator variables to represent the following nominal variables.
a) Religious affiliation (Catholic, Protestant and other).
b) Working shift (9 a.m.–5 p.m., 5 p.m.–1 a.m., and 1 a.m.–9 a.m.).
c) Supervisor (David Jones, Mary Brown, Rex Ralph and Kathy Smith).
19
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11
Exercise 6
(Selvanathan et al., p. 846, ex. 19.9)
The director of a graduate school of business wanted to find a better way of deciding which
students should be accepted into the MBA program. Currently, the records of the applicants
are examined by the admissions committee, which looks at the undergraduate grade point
average (UGPA) and the MBA admission score (MBAA). The director believed that the type
of undergraduate degree also influenced the student’s MBA grade point average
(MBAGPA).
The most common undergraduate degrees of students attending the graduate school of
business are BCom, BEng, BSc and BA. Because the type of degree is a qualitative variable,
the following three dummy variables were created:
D1 = 1 if the degree is BCom and 0 if the degree is not BCom
D2 = 1 if the degree is BEng and 0 if the degree is not BEng
D3 = 1 if the degree is BSc and 0 if the degree is not BSc.
The director took a random sample of 100 students who entered the program two years ago,
and recorded for each student the MBAGPA, UGPA and MBAA scores and the values of
the D1, D2, D3 dummy variables. These data are saved in the t11e6 Excel file.
a) Using these data, estimate the following model
MBAGPA   0  1UGPA   2 MBAA  3 D1   4 D2  5 D3  
Does the model seem to perform satisfactorily? How do you interpret the slope
coefficients?
b) Test to determine whether individually each of the independent variables is linearly
related to MBAGPA.
c) Is every slope estimate significantly positive?
d) Can we conclude that, on average, a BCom graduate performs better than a BA
graduate?
e) Predict the MBAGPA of a BEng graduate with 3.0 undergraduate GPA and 700 MBAA
score, first manually and then with R.
f)
Repeat part (e) for a BA graduate with the same undergraduate GPA and MBAA score.
20
L. Kónya, 2020, Semester 2
ECON20003 - Tutorial 11