Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Survey

Document related concepts

no text concepts found

Transcript

ECON20003 – QUANTITATIVE METHODS 2 TUTORIAL 11 Download the t11e3, t11e4 and t11e6 Excel data files from the subject website and save it to your computer or USB flash drive. Read this handout and try to complete the tutorial exercises before your tutorial class, so that you can ask help from your tutor during the Zoom session if necessary. After you have completed the tutorial exercises attempt the “Exercises for assessment”. You must submit your answers to these exercises in the Tutorial 11 Homework Canvas Assignment Quiz by the next tutorial in order to get the tutorial mark. For each assessment exercise type your answer in the relevant box available in the Quiz or type your answer separately in Microsoft Word and upload it in PDF format as an attachment. In either case, if the exercise requires you to use R, save the relevant R/RStudio script and printout in a Word document and upload it together with your written answer in PDF format. Using the Sample Regression Equation Once you estimated a regression model and ensured that the sample regression equation is acceptable, you can use it to predict either an element of the sub-population of the dependent variable that is generated by some given set of values of the independent variables (individual prediction) or the mean of this sub-population (mean prediction). In the first case the aim is to predict y0 0 1 x0,1 2 x0,2 ... k x0,k 0 where x0,i denotes a possible value of the ith independent variable and y0 is a random variable because it depends on the 0 random error, while in the second case the aim is to predict E ( y0 ) E (Y | x0,1 , x0,2 ,..., x0,k ) 0 1 x0,1 2 x0,2 ... k x0,k which is constant. In both cases the prediction can be either a single value or an interval. As for the point predictions, numerically there is no difference between the point prediction of an individual element, y0, and that of the conditional expected value of dependent variable, E(y0). Namely, both are equal to the value of the sample regression function evaluated over the given a set of the independent variable values, i.e. yˆ 0 Eˆ (Y | x0 ) ˆ0 ˆ1 x0,1 ˆ2 x0,2 ,..., ˆk x0,k 1 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 However, the interval predictions are different because the standard error depends on whether the aim is to predict y0 or E(y0). For example, if there is only one independent variable in the model (k = 1), the estimated standard error for an individual prediction is s yˆ0 s 1 ( x x )2 1 n0 n ( xi x ) 2 i 1 while for the mean prediction it is sEˆ ( y ) s 0 ( x x )2 1 n0 n ( xi x ) 2 i 1 where s is the estimated standard error of regression. Based on these estimated standard errors and assuming that the classical assumptions behind linear regression hold, the prediction interval for an individual value y0 and the confidence interval for the expected value E(y0) are yˆ 0 t /2,n 2 s yˆ0 and yˆ 0 t /2,n 2 sEˆ ( y 0) A comparison of the standard error of the interval predictor of y0 to that of E(y0) reveals that, due to the extra term under the square root (i.e. “1”), s yˆ0 sEˆ ( y 0) Hence, the mean of a sub-population can always be predicted with a smaller standard error than any of its elements. Consequently, given the same confidence level, the confidence interval estimate for E(y0) is always narrower, i.e. more precise, than the corresponding prediction interval estimate for y0. Finally, note that for large sample sizes t0.025 2, and the standard error of the individual prediction is just marginally bigger than the standard error of regression. Therefore, yˆ 0 2 s provides an approximate 95% individual prediction interval.1 1 We do not consider the standard error formulas for the more general cases when k > 1 because they are more complicated, and you will not need to use them in manual calculations. However, the inequality between the two standard errors and the formula for the approximate 95% individual prediction interval remain valid. 2 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 Exercise 1 Let’s return to Exercise 2 of Tutorial 10. In that exercise you regressed sons’ height to their fathers’ height using data on 400 father-son pairs’ heights in centimetres and evaluated the results. Based on that regression, answer the following questions. a) Predict the height of a son whose father is 175cm tall and the average height of all sons whose fathers are 175cm tall. From part (a), Exercise 2 of Tutorial 10, the R regression printout is Hence, the sample regression equation is 91.353 0.479 Father Son i i In this case x0 = 175 and the corresponding point prediction Eˆ ( Son | Father ) ˆ ˆ Father 91.353 0.479 175 175.178 Son 0 0 0 1 0 Hence, the height of one randomly selected son whose father is 175cm tall and the average height of all sons whose fathers are 175cm tall are predicted to be 175.178cm. b) Predict with 90% confidence the height of a son whose father is 175cm tall. To develop this prediction interval, we need the point prediction, the t reliability factor, and the estimated standard. From part (a), the point prediction is 175.178. From the t table, the reliability factor for /2 = 0.05 and df = n – 2 = 398 is t / 2,df t0.05,398 t0.05, z0.05 1.645 The estimated standard error can be calculated with the following formula: 3 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 s yˆ0 s 1 ( x x )2 1 n0 n ( xi x ) 2 i 1 In this formula s is the estimated standard error of regression. From the R printout (see the Residual standard error) it is about 8.062. We also need the sample mean and the sum of squared deviations for X (i.e. Father), which are 167.856 and 40980.480, respectively.2 Putting all these together, the estimated standard error is s yˆ0 8.062 1 1 (175 167.856)2 8.077 400 40980.480 and the 90% prediction interval for the height of a son whose father is 175cm tall is yˆ 0 t /2,n 2 s yˆ0 175.178 1.645 8.077 (161.891 , 188.465) Hence, with 90% confidence the height of a son whose father is 175cm tall is between 161.891cm and 188.465cm. To generate these results with R, launch RStudio, create a new project and script, and name them t11e1. Import the data from the t10e2 Excel file, and re-estimate the regression model by running the following commands: attach(t10e2) m = lm(Son ~ Father) summary(m) The R function for point and interval predictions from a fitted model is predict(model, newdata, interval, level, …) where newdata is an optional data frame in which to look for variables with which to predict (if omitted, the fitted values are used), interval is the type of interval calculation ("none", "confidence" or "prediction"), and level is the confidence level (the default value is 0.95). In this example newdata is generated with the newdata = data.frame(Father = 175) command and then the required point prediction and the 90% prediction interval can be obtained by executing the 2 LK: To save time I calculated them with R by executing the mean(Father) and var(Father) * (length(Father) 1) commands. 4 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 predict(m, newdata, level = 0.90, interval = "prediction") command. It returns: On this printout fit is the point estimate and lwr and upr are the lower and upper limits of the corresponding 90% prediction interval. c) Predict with 90% confidence the average height of all sons whose fathers are 175cm tall. In this case the estimated standard error is sEˆ ( y ) s 0 ( x x )2 1 1 (175 167.856) 2 n0 8.062 0.493 n ( xi x ) 2 400 40980.480 i 1 and the 90% confidence interval for the average height of all sons whose fathers are 175cm tall is yˆ 0 t /2,n 2 sEˆ ( y ) 175.178 1.645 0.493 (174.367 , 175.989) 0 Hence, with 90% confidence the average height of all sons whose fathers are 175cm tall is between 174.367cm and 175.989cm. To obtain this interval with R, execute the predict(m, newdata, level = 0.90, interval = "confidence") command. It returns: The fit value is the same than before. However, the 90% prediction interval developed for the height of a son whose father is 175cm tall is much wider, and hence provides a less precise prediction, than this 90% confidence interval developed for the average height of all sons whose fathers are 175cm tall. 5 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 Dummy Independent Variables in Regression Models In regression models qualitative variables, such as gender, race, qualification, preference etc., can be captured by dummy variables (D), also known as indicator or binary variables. They have two possible values, usually 1 for “success” and 0 for “failure”. Since a dummy variable has two different possible values, it can be used to distinguish two different categories. However, if a qualitative variable has more than 2 categories, one dummy variable is insufficient to represent all those categories. In general, m different categories (m > 1) can be represented by a set of m – 1 dummy variables. A dummy independent variable, D, can be introduced in a regression model in two different ways. Either as a standalone independent variable or in interaction with some quantitative independent variable, X. Assuming that there are only two independent variables in the model, X and D, in the first case, Y 0 1 X 2 D and D has effect on the y-intercept, which is 0 if D = 0 or 0 + 2 if D = 1, but not on the slope parameter of X, which is always 1, irrespectively of D. For this reason, D is called an intercept dummy variable. In the second case, the interaction between D and X is captured by their product, DX. Consequently, Y 0 1 X 2 DX This time the y-intercept does not depend on D, it is always 0, but the slope parameter of X is 1 if D = 0 or 1 + 2 if D = 1. For this reason, the DX interaction variable is called a slope dummy variable. These two specifications can be combined, i.e. a binary qualitative variable can be represented in a regression model with an intercept dummy variable and with a slope dummy variable at the same time: Y 0 1 X 2,1 D 2,2 DX This time the y-intercept is 0 if D = 0 or 0 + 2,1 if D = 1, while the slope parameter of X is 1 if D = 0 or 1 + 2,2 if D = 1. Exercise 2 (Selvanathan et al., p. 846, ex. 19.8) In a study of computer applications, a survey asked which computer a number of companies used. The following dummy variables were created: 6 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 D1 = 1 if Lenovo = 0 if not Lenovo D2 = 1 if Macintosh = 0 if not Macintosh What computer is being referred to by each of the following pairs of values? a) D1 = 0, D2 = 1 D1 = 0 means that the computer is not a Lenovo and D2 = 1 means that it is a Macintosh. b) D1 = 1, D2 = 0 D1 = 1 means that the computer is a Lenovo and D2 = 0 means that it is not a Macintosh. c) D1 = 0, D2 = 0 D1 = 0 means that the computer is not a Lenovo and D2 = 0 means that it is not a Macintosh either. Hence, it must be something else, for example, a HP. Exercise 3 A drug manufacturer wishes to compare three drugs (Drug: A, B, and C), which it can produce for the treatment of severe depression. The investigator would also like to study the relationship between the age of the patients (Age) and the effectiveness of the drugs (Effect measured on a scale from 1: low to 100: high). The investigator takes a random sample of 36 patients who are comparable with respect to diagnosis and severity of depression and assigns them randomly to receive drug A, B, or C. This sample data is saved in the t11e3 Excel file. The investigator intends to use a multiple regression to model Effect as a function of Age and Drug. a) The dependent variable (Effect) and the first independent variable (Age) are quantitative variables. The second independent variable (Drug), however, is a qualitative variable that has three categories. It can be represented in the regression model with two dummy variables defined, for example, as D1 = 1 if Drug = A and D1 = 0 otherwise, D2 = 1 if Drug = B and D2 = 0 otherwise. These two dummy variables are sufficient to distinguish the three different drugs, since for A D1 = 1 and D2 = 0, for B D1 = 0 and D2 = 1, and for C, which is the base category this time, D1 = 0 and D2 = 0. Launch RStudio, create a new project and script, name them t11e3, import the data from the t11e3 Excel file and execute the attach(t11e3) 7 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 command. With R, dummy variables can be created by the ifelse(condition, 1 , 0) function, where condition is a logical expression and the result is 1 if the condition is true and the result is 0 if the condition is false. In this case, to generate the D1 and D2 dummy variables the condition is that Drug equals A and B, respectively. Hence, execute the D1 = ifelse(Drug == "A", 1, 0) D2 = ifelse(Drug == "B", 1, 0) commands. b) Estimate the following multiple linear regression model: Effecti 0 1 Agei 2 D1i 3 D 2i i Execute the following commands, m1 = lm(Effect ~ Age + D1 + D2) summary(m1) to obtain: Call: lm(formula = Effect ~ Age + D1 + D2) Residuals: Min 1Q -12.5165 -3.5373 Median 0.8309 3Q 3.9782 Max 9.6501 Coefficients: Estimate Std. Error t value (Intercept) 22.02158 3.46634 6.353 Age 0.67063 0.06901 9.718 D1 10.24504 2.43816 4.202 D2 0.60979 2.43674 0.250 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Pr(>|t|) 3.92e-07 *** 4.55e-11 *** 0.000198 *** 0.803995 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.969 on 32 degrees of freedom Multiple R-squared: 0.7896, Adjusted R-squared: 0.7699 F-statistic: 40.03 on 3 and 32 DF, p-value: 6.095e-11 The sample regression equation is 22.022 0.671Age 10.245D1 0.610 D 2 Effect i i i i 8 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 c) Test the overall utility of this multiple regression model. From the printout in part (b), the F-statistic is 40.03 and the corresponding p-value is practically zero. Consequently, we can reject the null hypothesis that all three slope parameters are simultaneously zero at any reasonable significance level, so at least one Age, D1 and D2 has a significant effect on Effect, so this regression model is useful. d) Interpret the unadjusted and adjusted coefficients of determination. Why do they have different values? Again, from the printout, the unadjusted coefficient of determination is R2 = 0.7896. It means that about 79% of the sample variation in the effectiveness of the drugs is accounted for by this regression, i.e. by the age of the patients and by the type of the drug. The adjusted coefficient of determination is slightly smaller, 0.7699, so after having taken the sample size and the number of independent variables into consideration, about 77% of the sample variation in the effectiveness of the drugs is accounted for by the age of the patients and by the type of the drug. e) Interpret the coefficients. In this case the y-intercept estimate does not have a logical interpretation because Age cannot be zero. The first slope estimate suggests that keeping D1 and D2 constant, i.e. at any given drug, with every additional year of Age the effectiveness of the drug increases by about 0.671. As regards the second and the third slope estimates, since they belong to the two intercept dummy variables used to distinguish the three different drugs, they cannot be considered separately from each other. Instead, assuming that Age is kept constant, we can consider the values of Effect predicted for the three drugs and compare them to each other. Recalling that for drug A D1 = 1 and D2 = 0 while for drug C D1 = 0 and D2 = 0, the difference between the estimated effectiveness of drugs A and C for any given Age is ˆ 0 ˆ1 Age ˆ2 1 ˆ3 0 ˆ0 ˆ1 Age ˆ2 0 ˆ3 0 ˆ2 10.245 Similarly, since for drug B D1 = 0 and D2 = 1 while for drug C D1 = 0 and D2 = 0, the difference between the estimated effectiveness of drugs B and C for any given Age is ˆ ˆ Age ˆ 0 ˆ 1 ˆ 0 1 2 3 0 ˆ1 Age ˆ2 0 ˆ3 0 ˆ3 0.610 Finally, we can compare the estimated effectiveness of drugs A and B for any given Age: 9 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 ˆ 0 ˆ1 Age ˆ2 1 ˆ3 0 ˆ0 ˆ1 Age ˆ2 0 ˆ3 1 ˆ2 ˆ3 10.245 0.610 9.635 Hence, for any given Age, drug A is expected to be 10.245 more efficient than drug C, drug B is expected to be 0.610 more efficient than drug C, and drug A is expected to be 9.635 more efficient than drug B. f) In the multiple regression model estimated in part (b), the D1 and D2 dummy variables are intercept dummy variables as they alter the y-intercept only. Consequently, the model assumes that the marginal impact of Age on Effect is 1, no matter which particular Drug is used. Let’s now relax this restriction. In order to let the marginal impact of Age on Effect to change by Drug, the D1 and D2 dummy variables need to be introduced to the model as slope dummy variables as well, implying the following multiple regression model: Effecti 0 1 Agei 2 D1i 3 D1i Agei 4 D 2i 5 D 2i Agei i In this new model Age interacts with D1 and D2, i.e. with Drug. Execute the following commands, m2 = lm(Effect ~ Age + D1 + D1*Age + D2 + D2*Age) summary(m2) to obtain: Call: lm(formula = Effect ~ Age + D1 + D1 * Age + D2 + D2 * Age) Residuals: Min 1Q Median -6.584 -2.668 0.099 3Q 2.780 Max 6.416 Coefficients: Estimate Std. Error t value (Intercept) 6.21138 3.30103 1.882 Age 1.03339 0.07128 14.498 D1 41.30421 5.01075 8.243 D2 21.95811 5.01709 4.377 Age:D1 -0.70288 0.10738 -6.546 Age:D2 -0.48886 0.10879 -4.494 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 Pr(>|t|) 0.069619 4.31e-15 3.35e-09 0.000134 3.06e-07 9.69e-05 . *** *** *** *** *** ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.868 on 30 degrees of freedom Multiple R-squared: 0.9172, Adjusted R-squared: 0.9034 F-statistic: 66.43 on 5 and 30 DF, p-value: 2.59e-15 The sample regression equation is 6.211 1.033 Age 41.304 D1 0.703D1 Age 21.958D 2 0.489 D 2 Age Effect i i i i i i i i This new regression is also significant (F-test) and has a much higher adjusted coefficient of determination (0.9034) than the original regression. In addition, every slope 10 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 estimate is significantly different from zero (two-tail t-tests), supporting the new specification. g) What do the slope coefficients of the new model suggest to you? To answer this question, the best is to consider the three drugs one by one. Starting with the base category, for drug C (i.e. for D1 = 0 and D2 = 0) the sample regression equation is 6.211 1.033 Age Effect i i For drug A (D1 = 1 and D2 = 0) the sample regression equation is 6.211 1.033 Age 41.304 0.703 Age 47.515 0.330 Age Effect i i i i while for drug B (D1 = 0 and D2 = 1) it is 6.211 1.033 Age 21.958 0.489 Age 28.169 0.544 Age Effect i i i i The corresponding sample regression lines have different intercepts and slopes, which can be best compared to each other visually. In order to do so, first we need to create an artificial data series for Age and calculate the corresponding estimated Effect for each drug separately. We are going to denote these new series as age, effect_A, effect_B and effect_C so as to distinguish them from the original Age and Effect series. Execute the following commands: age = (0: 100) effect_A = 47.515 + 0.330*age effect_B = 28.169 + 0.544*age effect_C = 6.211 + 1.033*age The first command creates age and sets it equal to 0, 1, , 2, …, 100 and subsequently the other three commands calculate the corresponding estimated effect for the three drugs. Next, we can depict effect_C with the plot() function and then add two lines for effect_A and effect_B with the lines() function.3 To do so, execute the following commands: plot(age, effect_C, type="l", col="red", lwd=2, xlab="Age", ylab="Estimated effect") lines(age, effect_A, col = "blue", lwd = 2) lines(age, effect_B, col = "green", lwd = 2) title("Estimated drug effectiveness") legend("topleft", c("Drug A", "Drug B", "Drug C"), lwd = c(2,2,2), col = c("blue", "green", "red")) 3 You can refresh your knowledge about the plot() and lines() functions by reviewing Tutorial 2. 11 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 In these commands, the lwd option specifies the line width relative to the default width, which is 1. The last two commands add a title and a legend to the plot. The first argument to the legend command specifies its position, the second is the legend text, and the following two just echo the same arguments of the previous plot and lines commands, as R requires to specify them again for the legend. You should now see the following in your Plots panel: Visual inspection of this graph suggests that drug A is more efficient than drug B for almost every age but the difference between them is getting smaller by age. As regards drug C, it appears to be far less efficient for younger patients than the other two drugs, but its efficiency increases faster by age and it is the most efficient for middle aged and elderly people. h) Do the casual observations we made on the plot in part (g) reflect significant differences between the effectiveness of drugs A and C and between B and C? Recall the population regression model is Effecti 0 1 Agei 2 D1i 3 D1i Agei 4 D 2i 5 D 2i Agei i Given the definitions of the D1 and D2 dummy variables, for the three drugs it collapses to 12 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 C : Effecti 0 1 Agei i A : Effecti ( 0 2 ) ( 1 3 ) Agei i B: Effecti ( 0 4 ) ( 1 5 ) Agei i This suggests that whether the three drugs indeed differ in terms of efficiency depends on the 2, 3, 4 and 5 parameters. In part (f), based on the reported t-statistics and pvalues, we already concluded that the estimates of these parameters are significantly different from zero. Let’s now elaborate on this issue. Our figure on the previous page shows that drugs A and B have higher y-intercept than drug C. To see whether these differences are significantly positive, we need to perform two right-tail t-tests with the following hypotheses: H 0 : 2 0 , H A : 2 0 and H 0 : 4 0 , H A : 4 0 The slope estimates of 2 and 4 are positive (41.304 and 21.958) and half of their reported Pr(<|t|) values are practically zero, so we can reject both null hypotheses and conclude that drugs A and B have larger y-intercepts than drug C. As regards the slopes, drug C has steeper sample regression line than drugs A and B. We can again start with two t-tests, but with two left-tail t-tests with the following hypotheses: H 0 : 3 0 , H A : 3 0 and H 0 : 5 0 , H A : 5 0 The slope estimates of 3 and 5 are negative (-0.703 and -0.489) and half of their reported Pr(<|t|) values are practically zero, so we can reject both null hypotheses and conclude that drugs A and B have smaller slopes than drug C. Exercise 4 (Selvanathan et al., p. 850, ex. 19.16) Absenteeism is a serious employment problem in most countries. It is estimated that absenteeism reduces potential output by more than 10%. Two economists launched a research project to learn more about the problem. They randomly selected 100 organizations to participate in a one-year study. For each organization, they recorded the average number of days absent per employee (Absent) and several variables thought to affect absenteeism. File t11e4 contains the observations on the following variables: Absent: Wage: PctPT: PctU: AvShift: UMRel: average number of days absent per employee. average employee wage (annual, $); percentage of part-time employees; percentage of unionised employees; availability of shift work (1 = yes; 0 = no); union-management relationship (1 = good; 0 = not good); 13 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 a) Estimate a multiple regression model of Absent on the other variables in the economist’s data set. Launch RStudio, create a new project and script, name them t11e4, import the data from the t11e4 Excel file and execute the following commands: attach(t11e4) m = lm(Absent ~ Wage + PctPT + PctU + AvShift + UMRel) summary(m) You should get the printout shown on the next page. b) Is this regression useful in explaining the variation in absenteeism among the organisations? To answer this question, we need to perform the F-test of overall significance and to consider the adjusted coefficient of determination. From this printout, the F-statistic is 21.4 and it is significant at any level since its p-value is practically zero. Consequently, the null hypothesis that all slope parameters are simultaneously zero can be rejected at any reasonable significance level. This means that at least one independent variable has a significant effect on absenteeism, so this regression model is useful. The adjusted R2 is 0.5075, so after having taken the sample size and the number of independent variables into consideration, this multiple regression model can account for almost 51% of the total variation in absenteeism. c) What do you think about relationships between the dependent variable and each independent variable? Do you think that the slope estimates have logical signs? 14 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 Given the definitions of the variables, I would expect PctU and AvShift to have positive impacts on Absent (i.e. more union members and shift work likely increase absenteeism) and Wage, PctPT and UMRel to have negative impacts on Absent (i.e. higher wages, more part time work and good union-management relationship likely decrease absenteeism). Hence, all slope estimates have the logical sign. d) Are the slope coefficients significant in the logical directions? Every slope coefficient has the logical/expected sign and since the p-values for the ttests are all smaller than 0.0025, we conclude that every slope estimate is significant in the logical direction practically even at the 0.5% significance level.4 e) Interpret the slope coefficients. The slope estimates suggest that, keeping all other independent variables in the model constant, When the average wage increases by one dollar ($1000), average absenteeism is expected to drop by about 0.0002 (0.2) days. When the proportion of part time employees increases by one percentage point, average absenteeism is expected to drop by about 0.1069 days. When the proportion of unionised employees increases by one percentage point, average absenteeism is expected to rise by about 0.0599 days. Availability of shift work is expected to increase average absenteeism by about 1.5619 days. Good union-management relationship is expected to reduce average absenteeism by about 2.6366 days. f) Can we infer that the availability of shift work is related to absenteeism? If shift work is related to absenteeism, then the coefficient of the AvShift dummy variable is expected to be significant. This implies a two-tail t-test with H 0 : 4 0 , H A : 4 0 The Pr(> | t |) value for of AvShift is 0.0025, so H0 can be rejected even at the 0.3% significance level. Hence, we conclude that shift work is indeed related to absenteeism. g) Is there enough evidence to infer that in organisations where the union-management relationship is good, absenteeism is lower? 4 When you consider the reported p-values of the t-tests, i.e. Pr(> | t |), recall that on the printout e stands for exponent, which means the number of tens you multiply a number by. For example, 8.12e-14 = 8.12 x 10-14 = 8.12 / 1014 = 8.12 / 100000000000000 = 0.0000000000000812. 15 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 Since the UMRel dummy variable is equal to 1 if the relationship is good and 0 otherwise, this question implies a left-tail t-test with H 0 : 5 0 , H A : 5 0 Based on the expected signs of the slope parameters, we have already performed this test in part (a). We rejected H0, so we can conclude that in organisations where the union-management relationship is good, absenteeism is likely lower. h) Does it appear that the normality requirement is violated? If it does not, is non-normality a serious issue this time? Explain. As in previous exercises, save the residuals, illustrate them with a histogram and a QQ plot, obtain the usual descriptive statistics and perform the SW test by executing res = residuals(m) hist(res, freq = FALSE, ylim = c(0, 0.2), col = "lightblue") lines(seq(-6, 8, by = 0.05), dnorm(seq(-6, 8, by = 0.05), mean(res), sd(res)), col="red") qqnorm(res, main = "Normal Q-Q Plot", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", pch = 19, col = "lightgreen") qqline(res, col = "royalblue4") library(pastecs) round(stat.desc(res, basic = FALSE, norm = TRUE), 4) You should get the following printouts. 16 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 The histogram of the residuals is slightly skewed to the right and on the QQ plot many points are above the straight line. The mean and the median are similar, the skewness and excess kurtosis statistics are both close to zero, but skew.2SE > 1 while kurt.2SE < 1. Finally, the p-value of the SW test is normtest.p = 0.0389, so the null hypothesis of normality can be rejected at the 4% significance level. all things considered, the error variables might not be normally distributed. Note, however, that due to the relatively large sample size, the normality assumption is not crucial this time. Even if it is violated, we can still rely on the t and F-tests. i) Is multicollinearity a problem? Explain. Like in Exercise 1 of Tutorial 10, execute library(Hmisc) rcorr(as.matrix(t11e4), type = "pearson") library(car) round(vif(m), 4) 17 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 to get and As you can see, i. The coefficient of determination is only about 0.53 though each independent variable is strongly significant individually; ii. Every correlation coefficient between the independent variables is below 0.1 in absolute value; iii. The VIF values are all much smaller than 5. These all indicate that multicollinearity is of no concern in this model. j) Is heteroskedasticity likely in this model? Plot the residuals against the estimated Absent (i.e. y-hat) series and perform White’s test for heteroskedasticity. Like in Exercise 3 of Tutorial 10, execute yhat = fitted.values(m) plot(yhat, res, main = "OLS residuals versus yhat", col = "red", pch = 19, cex = 0.75) library(lmtest) bptest(m, ~ Wage + PctPT + PctU + AvShift + UMRel + I(Wage^2) + I(PctPT^2) + I(PctU^2) + I(AvShift^2) + I(UMRel^2) + I(Wage * PctPT) + I(Wage * PctU) + I(Wage * AvShift) + I(Wage * UMRel) + I(PctPT * PctU) + I(PctPT * AvShift) + I(PctPT * UMRel) + I(PctU * AvShift) + I(PctU * UMRel) + I(AvShift * UMRel)) 18 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 to get and The residual plot does not reveal any clear pattern and the White’s test maintains the null hypothesis of homoskedasticity (p-value = 0.6461), we do not need to worry about heteroskedasticity. Exercises for Assessment Exercise 5 (Selvanathan et al., p. 846, ex. 19.7) Create and identify indicator variables to represent the following nominal variables. a) Religious affiliation (Catholic, Protestant and other). b) Working shift (9 a.m.–5 p.m., 5 p.m.–1 a.m., and 1 a.m.–9 a.m.). c) Supervisor (David Jones, Mary Brown, Rex Ralph and Kathy Smith). 19 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11 Exercise 6 (Selvanathan et al., p. 846, ex. 19.9) The director of a graduate school of business wanted to find a better way of deciding which students should be accepted into the MBA program. Currently, the records of the applicants are examined by the admissions committee, which looks at the undergraduate grade point average (UGPA) and the MBA admission score (MBAA). The director believed that the type of undergraduate degree also influenced the student’s MBA grade point average (MBAGPA). The most common undergraduate degrees of students attending the graduate school of business are BCom, BEng, BSc and BA. Because the type of degree is a qualitative variable, the following three dummy variables were created: D1 = 1 if the degree is BCom and 0 if the degree is not BCom D2 = 1 if the degree is BEng and 0 if the degree is not BEng D3 = 1 if the degree is BSc and 0 if the degree is not BSc. The director took a random sample of 100 students who entered the program two years ago, and recorded for each student the MBAGPA, UGPA and MBAA scores and the values of the D1, D2, D3 dummy variables. These data are saved in the t11e6 Excel file. a) Using these data, estimate the following model MBAGPA 0 1UGPA 2 MBAA 3 D1 4 D2 5 D3 Does the model seem to perform satisfactorily? How do you interpret the slope coefficients? b) Test to determine whether individually each of the independent variables is linearly related to MBAGPA. c) Is every slope estimate significantly positive? d) Can we conclude that, on average, a BCom graduate performs better than a BA graduate? e) Predict the MBAGPA of a BEng graduate with 3.0 undergraduate GPA and 700 MBAA score, first manually and then with R. f) Repeat part (e) for a BA graduate with the same undergraduate GPA and MBAA score. 20 L. Kónya, 2020, Semester 2 ECON20003 - Tutorial 11