Download Regression_Correlati..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Regression and Correlation In-Class Demonstration
Overview
Multiple regression is a very flexible methodology that is used widely in the social and
behavioral sciences. Using SPSS and data from the 2002 GSS, I will demonstrate some of
the uses of simple and multiple regression.
Operationalization of Status Attainment Theory
Let’s suppose we were interested in researching the determinants of education. Imagine
that we developed a status attainment theory of advantage in educational achievement
across generations of family members. We could look at whether a respondent’s
education attainment was related to his or her father’s status attainment:
The following figure shows a scatter diagram of a respondent’s highest year of school
completion against the respondent’s father’s highest year of school completed.
30
20
10
0
-10
-10
0
10
20
30
HIGHEST YEAR SCHOOL COMPLETED, FATHER
Notice that there is a positive relationship between these two interval-ratio variables.
High values of respondent’s education are associated with high values of father’s
education and low values of respondent’s education are associated with low values of
father’s education. The relationship looks roughly linear.
We can estimate a regression equation which models the linear relationship between
respondent’s education and father’s education. A respondent’s education is more likely to
1
be the independent variable, so will be treated as such. The simple regression for the
effect of father’s education on respondent’s education yields the following statistics:
Model Summary
Model
1
R
.421(a)
R Square
.177
Adjusted R
Square
.177
Std. Error of
the Estimate
2.636
a Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER
ANOVAb
Model
1
Regression
Sum of
Squares
1647.384
Residual
Total
df
1
Mean Square
1647.384
7635.462
1099
6.948
9282.847
1100
F
237.114
Sig.
.000 a
a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
Coefficientsa
Unstandardized
Coefficients
Model
1
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
B
10.351
Std. Error
.229
.293
.019
Standardized
Coefficients
Beta
.421
t
45.241
Sig.
.000
15.399
.000
a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
Notice that the results include a number of statistics that we have already talked about,
including Pearson’s correlation coefficient (r), the coefficient of determination (r2), sum
of square residual, the sum of square total, an estimated intercept (labeled “Constant”)
and an estimate of slope of father’s education.
How do we interpret each of these statistics?
There are also a number of other statistics that we have not yet discussed. Among these is
the standard error and the t statistic. These are used to determine whether the effect of
each independent variable in the regression model has a statistically significant effect on
the dependent variable. We will discuss them in more detail in later lectures.
Multiple Regression
The real world is complex and a single variable is often not sufficient to explain variation
in another variable. Regression analysis is useful because it is flexible enough to be
expanded to simultaneously estimate the independent effects of a number of independent
2
variables on a single dependent variable. This type of extension of the simple regression
model is referred to as multiple regression.
Suppose that we thought that other factors, in addition to the effect of father’s educational
attainment, effected the respondent’s educational attainment. Let’s say that we thought
that education was related to the age of the respondent. We can simultaneously look at
the effects of respondent’s age and father’s education attainment on respondent’s
educational attainment. The scatter plot of all three variables would look like the
following:
30
20
HEST YEAR OF SCHOOL COMPLETED
10
0
30
20
10
HIGHEST YEAR SCHOOL COMPLETED, FATHER
0
100
80
60
40
20
AGE OF RESPONDENT
This plot shows the educational attainment of the respondent plotted on the Y axis and
the father’s educational attainment and age of the respondent plotted on the X and Z axes,
respectively. Notice that a regression line is no longer adequate for modeling this data.
Since it is three dimensions, we need a regression plane.
The dimensionality of the data will change with each additional independent variable.
Beyond three dimensions, it is difficult for us to imagine the dimensionality of the data.
In general, the regression reduces error in a p-dimensional space, where p is the number
of independent variables in the model. We can estimate the multiple regression equation
for the effects of respondent’s age and father’s education attainment on respondent’s
educational attainment:
3
Model Summary
Model
1
2
R
.420a
.422b
R Square
.177
.179
Adjusted
R Square
.176
.177
Std. Error of
the Estimate
2.637
2.635
a. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER, AGE OF RESPONDENT
ANOVAc
Model
1
2
Regres sion
Residual
Total
Regres sion
Residual
Total
Sum of
Squares
1637.111
7629.016
9266.127
1654.004
7612.123
9266.127
df
1
1097
1098
2
1096
1098
Mean Square
1637.111
6.954
F
235.405
Sig.
.000a
827.002
6.945
119.072
.000b
a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF
RESPONDENT
c. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
Coefficientsa
Model
1
2
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
AGE OF RESPONDENT
Unstandardized
Coefficients
B
Std. Error
10.354
.229
Standardized
Coefficients
Beta
.420
t
45.205
Sig.
.000
15.343
.000
25.214
.000
.293
.019
9.860
.391
.304
.020
.437
14.881
.000
7.826E-03
.005
.046
1.560
.119
a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
The output shows two regression equations. The first is the original regression showing
the effect of father’s education attainment on respondent’s educational attainment. The
second shows the effects of both respondent’s age and father’s education attainment on
respondent’s educational attainment.
4
Notice the changes in the statistics. What happened to the value of R-Square when we
added the effect of age? Did the effect of the intercept or the slope of father’s education
change? Why is that?
Nonlinear Effects
Regression models measure the linear association between independent variables and a
single dependent variable. However, sometimes the relationship between two variables is
not linear. Let’s look at the relationship between respondent’s education and respondent’s
age:
30
20
10
0
-10
0
20
40
60
80
100
AGE OF RESPONDENT
Notice that for younger respondents (recalled that the GSS collects data on anyone 18
years or older) there is an apparent positive relationship between age and education. For
older respondents, there is an apparent negative relationship.
Why do you think that this is the case?
Regression models are flexible enough to handle nonlinear effects. For instance, one can
include a second order polynomial (a squared term) to model a curvilinear relationship.
In this case we can construct an age square term by creating a new variable that equals
the square of the age variable.
We can then estimate the effect of age including the age square term:
5
Model Summary
Model
1
2
3
R
.420a
.422b
.445c
R Square
.177
.179
.198
Adjusted
R Square
.176
.177
.196
Std. Error of
the Estimate
2.637
2.635
2.604
a. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER, AGE OF RESPONDENT
c. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER, AGE OF RESPONDENT, AGE
SQUARE TERM
ANOVAd
Model
1
2
3
Regres sion
Residual
Total
Regres sion
Residual
Total
Regres sion
Residual
Total
Sum of
Squares
1637.111
7629.016
9266.127
1654.004
7612.123
9266.127
1838.377
7427.750
9266.127
df
1
1097
1098
2
1096
1098
3
1095
1098
Mean Square
1637.111
6.954
F
235.405
Sig.
.000a
827.002
6.945
119.072
.000b
612.792
6.783
90.338
.000c
a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF
RESPONDENT
c. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF
RESPONDENT, AGE SQUARE TERM
d. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
6
Coefficientsa
Model
1
2
3
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
AGE OF RESPONDENT
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
AGE OF RESPONDENT
AGE SQUARE TERM
Unstandardized
Coefficients
B
Std. Error
10.354
.229
Standardized
Coefficients
Beta
.420
t
45.205
Sig.
.000
15.343
.000
25.214
.000
.293
.019
9.860
.391
.304
.020
.437
14.881
.000
7.826E-03
7.039
.005
.665
.046
1.560
10.589
.119
.000
.300
.020
.431
14.857
.000
.138
-.001
.026
.000
.810
-.779
5.421
-5.213
.000
.000
a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
The output shows results for three equations. The first two have already been discussed
previously. The third shows a model which adds an age square term. How does the
addition of an age square term change the model estimates?
To understand the effect of age we need to consider the both the age and age square term.
If we made a table of age and age square, it would look like the following (recall the age
ranges from 18 to 89):
Age
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
…
89
Age
Square
324
361
400
441
484
529
576
625
676
729
784
841
900
961
1024
1089
1156
1225
1296
…
7921
7
If we weighted the age and age squared data by their slope estimates and we graphed the
result, we would get the following graph:
Graph of the Curvilinear Effect of Age
on Education
5
4
3
2
1
88
83
78
73
68
63
58
53
48
43
38
33
28
23
18
0
Age in Years
Notice that in this way we can model nonlinear effects in a regression model.
Ordinal and Nominal Independent Variables
Regression analysis is flexible enough to include the effects of interval-ratio as well as
nominal and ordinal variables. However, nominal and ordinal variable can only be
included in the regression model as a series of mutually exclusive and exhaustive
dichotomous variables. Why do you think this is the case?
Let’s suppose that we wanted to use a measure of social class to examine the
determinants of educational attainment. The GSS includes data on a respondent’s social
class, which is measured as an ordinal variable:
Statistics
SUBJECTIVE CLASS IDENTIFICATION
N
Valid
1492
Missing
8
8
SUBJECTIVE CLASS IDENTIFICATION
Valid
Missing
Frequency
89
678
676
49
1492
6
2
8
1500
LOWER CLASS
WORKING CLASS
MIDDLE CLASS
UPPER CLASS
Total
DK
NA
Total
Total
Percent
5.9
45.2
45.1
3.3
99.5
.4
.1
.5
100.0
Valid Percent
6.0
45.4
45.3
3.3
100.0
Cumulative
Percent
6.0
51.4
96.7
100.0
Examining the scatter diagram of social class and educational attainment would be
difficult to interpret, so it would be better to split the file by the class variable and to look
at separate histograms of educational attainment for each class:
CLASS: 1 LOWER CLASS
50
40
30
20
10
Std. Dev = 3.18
Mean = 10.3
N = 89.00
0
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
HIGHEST YEAR OF SCHOOL COMPLETED
9
20.0
CLASS: 2 WORKING CLASS
400
300
200
100
Std. Dev = 2.50
Mean = 12.6
N = 675.00
0
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
HIGHEST YEAR OF SCHOOL COMPLETED
CLASS: 3 MIDDLE CLASS
300
200
100
Std. Dev = 2.93
Mean = 14.2
N = 675.00
0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
HIGHEST YEAR OF SCHOOL COMPLETED
10
20.0
CLASS: 4 UPPER CLASS
30
20
10
Std. Dev = 3.27
Mean = 15.0
N = 49.00
0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
HIGHEST YEAR OF SCHOOL COMPLETED
Notice that there are differences in the distribution and in the average educational
attainment for different classes. Roughly, the higher the respondent’s social class the
more education the respondent is likely to have achieved.
In order to include a measure of social class in the model, we have to turn the class
variables into a series of dichotomous variables. Since there are four classes, we can
make four separate variables for class (LOWER CLASS, WORKING CLASS, MIDDLE
CLASS, and UPPER CLASS). Using the recode procedure in SPSS, I created these four
variables:
LOWER CLASS
Valid
Missing
Total
NOT LOWER CLASS
LOWER CLASS
Total
System
Frequency
1403
89
1492
8
1500
Percent
93.5
5.9
99.5
.5
100.0
11
Valid Percent
94.0
6.0
100.0
Cumulative
Percent
94.0
100.0
WORKING CLASS
Valid
Mis sing
Total
NOT WORKING CLASS
WORKING CLASS
Total
System
Frequency
814
678
1492
8
1500
Percent
54.3
45.2
99.5
.5
100.0
Valid Percent
54.6
45.4
100.0
Cumulative
Percent
54.6
100.0
MIDDLE CLASS
Valid
Missing
Total
NOT MIDDLE CLASS
MIDDLE CLASS
Total
System
Frequency
816
676
1492
8
1500
Percent
54.4
45.1
99.5
.5
100.0
Valid Percent
54.7
45.3
100.0
Cumulative
Percent
54.7
100.0
UPPER CLASS
Valid
Mis sing
Total
NOT UPPER CLASS
UPPER CLASS
Total
System
Frequency
1443
49
1492
8
1500
Percent
96.2
3.3
99.5
.5
100.0
Valid Percent
96.7
3.3
100.0
Cumulative
Percent
96.7
100.0
These variables collectively can be used in the regression to measure the effect of social
class. However, having information on just three of these four variables is sufficient
information to be able to predict the value of the fourth variable. Entering all four
variables into the regression model would introduce redundant information. Because the
variables are a prefect linear combination, including all of them in a model would make it
impossible for the model to estimate properly.
Therefore, we purposely exclude on category, which becomes a reference category. The
effects of non-excluded variables are interpreted in reference to the omitted category. In
the following example, I exclude “middle class”, and I interpret the effects of the nonexcluded variables as the average difference between the variable and the reference
category (in this case “middle class”).
The regression results are the following:
12
Model Summary
Model
1
2
3
4
R
.422a
.425b
.444c
.528d
R Square
.178
.181
.197
.278
Adjusted
R Square
.177
.179
.195
.274
Std. Error of
the Estimate
2.616
2.613
2.587
2.456
a. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER, AGE OF RESPONDENT
c. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER, AGE OF RESPONDENT, AGE
SQUARE TERM
d. Predictors: (Constant), HIGHEST YEAR SCHOOL
COMPLETED, FATHER, AGE OF RESPONDENT, AGE
SQUARE TERM, UPPER CLASS, LOWER CLASS,
WORKING CLASS
ANOVAe
Model
1
2
3
4
Regres sion
Residual
Total
Regres sion
Residual
Total
Regres sion
Residual
Total
Regres sion
Residual
Total
Sum of
Squares
1616.343
7479.654
9095.996
1642.109
7453.887
9095.996
1792.789
7303.208
9095.996
2531.930
6564.066
9095.996
df
1
1093
1094
2
1092
1094
3
1091
1094
6
1088
1094
Mean Square
1616.343
6.843
F
236.196
Sig.
.000a
821.055
6.826
120.285
.000b
597.596
6.694
89.273
.000c
421.988
6.033
69.945
.000d
a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF
RESPONDENT
c. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF
RESPONDENT, AGE SQUARE TERM
d. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, FATHER, AGE OF
RESPONDENT, AGE SQUARE TERM, UPPER CLASS, LOWER CLASS, WORKING
CLASS
e. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
13
Coefficientsa
Model
1
2
3
4
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
AGE OF RESPONDENT
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
AGE OF RESPONDENT
AGE SQUARE TERM
(Constant)
HIGHEST YEAR SCHOOL
COMPLETED, FATHER
AGE OF RESPONDENT
AGE SQUARE TERM
LOWER CLASS
WORKING CLASS
UPPER CLASS
Unstandardized
Coefficients
B
Std. Error
10.381
.228
Standardized
Coefficients
Beta
.422
t
45.569
Sig.
.000
15.369
.000
25.147
.000
.291
.019
9.769
.388
.305
.020
.442
15.060
.000
9.713E-03
7.201
.005
.664
.057
1.943
10.842
.052
.000
.302
.020
.437
15.031
.000
.129
-.001
9.157
.026
.000
.657
.756
-.712
5.035
-4.744
13.947
.000
.000
.000
.239
.020
.345
11.982
.000
.118
-.001
-2.952
-1.381
.719
.024
.000
.368
.161
.409
.693
-.715
-.216
-.238
.046
4.857
-5.012
-8.016
-8.570
1.759
.000
.000
.000
.000
.079
a. Dependent Variable: HIGHEST YEAR OF SCHOOL COMPLETED
What happened to the values of the fit statistics and in slope and intercept when the effect
of social class was included in the model?
Model Building
The effect of any variable that is not included in the regression equation becomes part of
the error term, an unobserved variables which can be estimated in the regression. Just as
important as what is estimated in the model is what is not estimated in the model. Given
this, how do we decide what to include in the model?
14