Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Regression and Correlation In-Class Demonstration Overview Multiple regression is a very flexible methodology that is used widely in the social and behavioral sciences. Using SPSS and data from the 2002 GSS, I will demonstrate some of the uses of simple and multiple regression. Operationalization of Status Attainment Theory Let’s suppose we were interested in researching the determinants of education. Imagine that we developed a status attainment theory of advantage in educational achievement across generations of family members. We could look at whether a respondent’s education attainment was related to his or her father’s status attainment: The following figure shows a scatter diagram of a respondent’s highest year of school completion against the respondent’s father’s highest year of school completed. 30 20 10 0 -10 -10 0 10 20 30 HIGHEST YEAR SCHOOL COMPLETED, FATHER Notice that there is a positive relationship between these two interval-ratio variables. High values of respondent’s education are associated with high values of father’s education and low values of respondent’s education are associated with low values of father’s education. The relationship looks roughly linear. We can estimate a regression equation which models the linear relationship between respondent’s education and father’s education. A respondent’s education is more likely to 1 be the independent variable, so will be treated as such. The simple regression for the effect of father’s education on respondent’s education yields the following statistics: Model Summary Model 1 R .421(a) R Square .177 Adjusted R Square .177 Std. Error of the Estimate 2.636 a Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER ANOVAb Model 1 Regression Sum of Squares 1647.384 Residual Total df 1 Mean Square 1647.384 7635.462 1099 6.948 9282.847 1100 F 237.114 Sig. .000 a a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER b. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED Coefficientsa Unstandardized Coefficients Model 1 (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER B 10.351 Std. Error .229 .293 .019 Standardized Coefficients Beta .421 t 45.241 Sig. .000 15.399 .000 a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED Notice that the results include a number of statistics that we have already talked about, including Pearson’s correlation coefficient (r), the coefficient of determination (r2), sum of square residual, the sum of square total, an estimated intercept (labeled “Constant”) and an estimate of slope of father’s education. How do we interpret each of these statistics? There are also a number of other statistics that we have not yet discussed. Among these is the standard error and the t statistic. These are used to determine whether the effect of each independent variable in the regression model has a statistically significant effect on the dependent variable. We will discuss them in more detail in later lectures. Multiple Regression The real world is complex and a single variable is often not sufficient to explain variation in another variable. Regression analysis is useful because it is flexible enough to be expanded to simultaneously estimate the independent effects of a number of independent 2 variables on a single dependent variable. This type of extension of the simple regression model is referred to as multiple regression. Suppose that we thought that other factors, in addition to the effect of father’s educational attainment, effected the respondent’s educational attainment. Let’s say that we thought that education was related to the age of the respondent. We can simultaneously look at the effects of respondent’s age and father’s education attainment on respondent’s educational attainment. The scatter plot of all three variables would look like the following: 30 20 HEST YEAR OF SCHOOL COMPLETED 10 0 30 20 10 HIGHEST YEAR SCHOOL COMPLETED, FATHER 0 100 80 60 40 20 AGE OF RESPONDENT This plot shows the educational attainment of the respondent plotted on the Y axis and the father’s educational attainment and age of the respondent plotted on the X and Z axes, respectively. Notice that a regression line is no longer adequate for modeling this data. Since it is three dimensions, we need a regression plane. The dimensionality of the data will change with each additional independent variable. Beyond three dimensions, it is difficult for us to imagine the dimensionality of the data. In general, the regression reduces error in a p-dimensional space, where p is the number of independent variables in the model. We can estimate the multiple regression equation for the effects of respondent’s age and father’s education attainment on respondent’s educational attainment: 3 Model Summary Model 1 2 R .420a .422b R Square .177 .179 Adjusted R Square .176 .177 Std. Error of the Estimate 2.637 2.635 a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT ANOVAc Model 1 2 Regres sion Residual Total Regres sion Residual Total Sum of Squares 1637.111 7629.016 9266.127 1654.004 7612.123 9266.127 df 1 1097 1098 2 1096 1098 Mean Square 1637.111 6.954 F 235.405 Sig. .000a 827.002 6.945 119.072 .000b a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT c. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED Coefficientsa Model 1 2 (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER AGE OF RESPONDENT Unstandardized Coefficients B Std. Error 10.354 .229 Standardized Coefficients Beta .420 t 45.205 Sig. .000 15.343 .000 25.214 .000 .293 .019 9.860 .391 .304 .020 .437 14.881 .000 7.826E-03 .005 .046 1.560 .119 a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED The output shows two regression equations. The first is the original regression showing the effect of father’s education attainment on respondent’s educational attainment. The second shows the effects of both respondent’s age and father’s education attainment on respondent’s educational attainment. 4 Notice the changes in the statistics. What happened to the value of R-Square when we added the effect of age? Did the effect of the intercept or the slope of father’s education change? Why is that? Nonlinear Effects Regression models measure the linear association between independent variables and a single dependent variable. However, sometimes the relationship between two variables is not linear. Let’s look at the relationship between respondent’s education and respondent’s age: 30 20 10 0 -10 0 20 40 60 80 100 AGE OF RESPONDENT Notice that for younger respondents (recalled that the GSS collects data on anyone 18 years or older) there is an apparent positive relationship between age and education. For older respondents, there is an apparent negative relationship. Why do you think that this is the case? Regression models are flexible enough to handle nonlinear effects. For instance, one can include a second order polynomial (a squared term) to model a curvilinear relationship. In this case we can construct an age square term by creating a new variable that equals the square of the age variable. We can then estimate the effect of age including the age square term: 5 Model Summary Model 1 2 3 R .420a .422b .445c R Square .177 .179 .198 Adjusted R Square .176 .177 .196 Std. Error of the Estimate 2.637 2.635 2.604 a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT c. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT, AGE SQUARE TERM ANOVAd Model 1 2 3 Regres sion Residual Total Regres sion Residual Total Regres sion Residual Total Sum of Squares 1637.111 7629.016 9266.127 1654.004 7612.123 9266.127 1838.377 7427.750 9266.127 df 1 1097 1098 2 1096 1098 3 1095 1098 Mean Square 1637.111 6.954 F 235.405 Sig. .000a 827.002 6.945 119.072 .000b 612.792 6.783 90.338 .000c a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT c. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT, AGE SQUARE TERM d. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED 6 Coefficientsa Model 1 2 3 (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER AGE OF RESPONDENT (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER AGE OF RESPONDENT AGE SQUARE TERM Unstandardized Coefficients B Std. Error 10.354 .229 Standardized Coefficients Beta .420 t 45.205 Sig. .000 15.343 .000 25.214 .000 .293 .019 9.860 .391 .304 .020 .437 14.881 .000 7.826E-03 7.039 .005 .665 .046 1.560 10.589 .119 .000 .300 .020 .431 14.857 .000 .138 -.001 .026 .000 .810 -.779 5.421 -5.213 .000 .000 a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED The output shows results for three equations. The first two have already been discussed previously. The third shows a model which adds an age square term. How does the addition of an age square term change the model estimates? To understand the effect of age we need to consider the both the age and age square term. If we made a table of age and age square, it would look like the following (recall the age ranges from 18 to 89): Age 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 … 89 Age Square 324 361 400 441 484 529 576 625 676 729 784 841 900 961 1024 1089 1156 1225 1296 … 7921 7 If we weighted the age and age squared data by their slope estimates and we graphed the result, we would get the following graph: Graph of the Curvilinear Effect of Age on Education 5 4 3 2 1 88 83 78 73 68 63 58 53 48 43 38 33 28 23 18 0 Age in Years Notice that in this way we can model nonlinear effects in a regression model. Ordinal and Nominal Independent Variables Regression analysis is flexible enough to include the effects of interval-ratio as well as nominal and ordinal variables. However, nominal and ordinal variable can only be included in the regression model as a series of mutually exclusive and exhaustive dichotomous variables. Why do you think this is the case? Let’s suppose that we wanted to use a measure of social class to examine the determinants of educational attainment. The GSS includes data on a respondent’s social class, which is measured as an ordinal variable: Statistics SUBJECTIVE CLASS IDENTIFICATION N Valid 1492 Missing 8 8 SUBJECTIVE CLASS IDENTIFICATION Valid Missing Frequency 89 678 676 49 1492 6 2 8 1500 LOWER CLASS WORKING CLASS MIDDLE CLASS UPPER CLASS Total DK NA Total Total Percent 5.9 45.2 45.1 3.3 99.5 .4 .1 .5 100.0 Valid Percent 6.0 45.4 45.3 3.3 100.0 Cumulative Percent 6.0 51.4 96.7 100.0 Examining the scatter diagram of social class and educational attainment would be difficult to interpret, so it would be better to split the file by the class variable and to look at separate histograms of educational attainment for each class: CLASS: 1 LOWER CLASS 50 40 30 20 10 Std. Dev = 3.18 Mean = 10.3 N = 89.00 0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 HIGHEST YEAR OF SCHOOL COMPLETED 9 20.0 CLASS: 2 WORKING CLASS 400 300 200 100 Std. Dev = 2.50 Mean = 12.6 N = 675.00 0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 HIGHEST YEAR OF SCHOOL COMPLETED CLASS: 3 MIDDLE CLASS 300 200 100 Std. Dev = 2.93 Mean = 14.2 N = 675.00 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 HIGHEST YEAR OF SCHOOL COMPLETED 10 20.0 CLASS: 4 UPPER CLASS 30 20 10 Std. Dev = 3.27 Mean = 15.0 N = 49.00 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 HIGHEST YEAR OF SCHOOL COMPLETED Notice that there are differences in the distribution and in the average educational attainment for different classes. Roughly, the higher the respondent’s social class the more education the respondent is likely to have achieved. In order to include a measure of social class in the model, we have to turn the class variables into a series of dichotomous variables. Since there are four classes, we can make four separate variables for class (LOWER CLASS, WORKING CLASS, MIDDLE CLASS, and UPPER CLASS). Using the recode procedure in SPSS, I created these four variables: LOWER CLASS Valid Missing Total NOT LOWER CLASS LOWER CLASS Total System Frequency 1403 89 1492 8 1500 Percent 93.5 5.9 99.5 .5 100.0 11 Valid Percent 94.0 6.0 100.0 Cumulative Percent 94.0 100.0 WORKING CLASS Valid Mis sing Total NOT WORKING CLASS WORKING CLASS Total System Frequency 814 678 1492 8 1500 Percent 54.3 45.2 99.5 .5 100.0 Valid Percent 54.6 45.4 100.0 Cumulative Percent 54.6 100.0 MIDDLE CLASS Valid Missing Total NOT MIDDLE CLASS MIDDLE CLASS Total System Frequency 816 676 1492 8 1500 Percent 54.4 45.1 99.5 .5 100.0 Valid Percent 54.7 45.3 100.0 Cumulative Percent 54.7 100.0 UPPER CLASS Valid Mis sing Total NOT UPPER CLASS UPPER CLASS Total System Frequency 1443 49 1492 8 1500 Percent 96.2 3.3 99.5 .5 100.0 Valid Percent 96.7 3.3 100.0 Cumulative Percent 96.7 100.0 These variables collectively can be used in the regression to measure the effect of social class. However, having information on just three of these four variables is sufficient information to be able to predict the value of the fourth variable. Entering all four variables into the regression model would introduce redundant information. Because the variables are a prefect linear combination, including all of them in a model would make it impossible for the model to estimate properly. Therefore, we purposely exclude on category, which becomes a reference category. The effects of non-excluded variables are interpreted in reference to the omitted category. In the following example, I exclude “middle class”, and I interpret the effects of the nonexcluded variables as the average difference between the variable and the reference category (in this case “middle class”). The regression results are the following: 12 Model Summary Model 1 2 3 4 R .422a .425b .444c .528d R Square .178 .181 .197 .278 Adjusted R Square .177 .179 .195 .274 Std. Error of the Estimate 2.616 2.613 2.587 2.456 a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT c. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT, AGE SQUARE TERM d. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT, AGE SQUARE TERM, UPPER CLASS, LOWER CLASS, WORKING CLASS ANOVAe Model 1 2 3 4 Regres sion Residual Total Regres sion Residual Total Regres sion Residual Total Regres sion Residual Total Sum of Squares 1616.343 7479.654 9095.996 1642.109 7453.887 9095.996 1792.789 7303.208 9095.996 2531.930 6564.066 9095.996 df 1 1093 1094 2 1092 1094 3 1091 1094 6 1088 1094 Mean Square 1616.343 6.843 F 236.196 Sig. .000a 821.055 6.826 120.285 .000b 597.596 6.694 89.273 .000c 421.988 6.033 69.945 .000d a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT c. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT, AGE SQUARE TERM d. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF RESPONDENT, AGE SQUARE TERM, UPPER CLASS, LOWER CLASS, WORKING CLASS e. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED 13 Coefficientsa Model 1 2 3 4 (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER AGE OF RESPONDENT (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER AGE OF RESPONDENT AGE SQUARE TERM (Constant) HIGHEST YEAR SCHOOL COMPLETED, FATHER AGE OF RESPONDENT AGE SQUARE TERM LOWER CLASS WORKING CLASS UPPER CLASS Unstandardized Coefficients B Std. Error 10.381 .228 Standardized Coefficients Beta .422 t 45.569 Sig. .000 15.369 .000 25.147 .000 .291 .019 9.769 .388 .305 .020 .442 15.060 .000 9.713E-03 7.201 .005 .664 .057 1.943 10.842 .052 .000 .302 .020 .437 15.031 .000 .129 -.001 9.157 .026 .000 .657 .756 -.712 5.035 -4.744 13.947 .000 .000 .000 .239 .020 .345 11.982 .000 .118 -.001 -2.952 -1.381 .719 .024 .000 .368 .161 .409 .693 -.715 -.216 -.238 .046 4.857 -5.012 -8.016 -8.570 1.759 .000 .000 .000 .000 .079 a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED What happened to the values of the fit statistics and in slope and intercept when the effect of social class was included in the model? Model Building The effect of any variable that is not included in the regression equation becomes part of the error term, an unobserved variables which can be estimated in the regression. Just as important as what is estimated in the model is what is not estimated in the model. Given this, how do we decide what to include in the model? 14