Download doc file - University of Hawaii

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
1
November 16, 2015
Review of Multiple Regression
Review of Multiple Regression
Let’s begin with a little review of multiple regression this week. Linear models [e.g., correlation,
t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate analysis
of variance (MANOVA)] have a long tradition in the social and behavioral sciences for
examining a variety of different data structures. For our first example, let’s consider a case where
the goal is to examine whether student achievement can be explained by a set of student
background variables.
One way to think about statistical modeling is in terms of an attempt to account for variation in a
dependent variable such as student achievement—variance that is believed to be associated with
one or more explanatory variables such as gender and other demographic categories (e.g,
socioeconomic status, race/ethnicity) or personal attributes measured as continuous variables
(e.g., motivation, previous learning). This is analogous to thinking about how much variance in
student achievement (R2) is accounted for by a given set of explanatory variables.
Let’s assume we start with a small sample of 40 students who are measured on a reading test.
We might first try to determine whether there is a difference in their scores related to gender.
When we actually split the sample reading scores by gender (with 20 males and 20 females), we
can see there is only a small difference in the average scores of males and females.
Descriptive Statistics for Example
Male
Variable
Reading
Mean
66.53
Female
SD
2.57
Mean
67.00
SD
3.20
As we have learned previously, we could use a t-test to investigate whether the difference in
male and female reading means would be considered “statistically significant” in the population
from which this small random sample is drawn. If we subtract the two means we can see that the
difference is 0.47 (with females scoring 0.47 points higher than males). Another option would
be one-way ANOVA to investigate whether this difference is statistically significant (even
through there are only two groups). For example, if we used a simple one-way ANOVA, we
would be testing the similarity of group means for males and females by partitioning the total
sum of squares for reading into a portion describing differences in reading variability due to
groups (i.e., gender) and a portion describing differences in variability due to individuals. In this
case, the F-ratio would provide an indication of the ratio of between-groups variability (i.e.,
defined as between-groups mean squares) to within-groups variability (i.e., defined as withingroups mean squares). We can see that the F ratio is not large enough to be statistically
significant (2.256/8.437 =.267).
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
2
November 16, 2015
Review of Multiple Regression
ANOVA
read
Between Groups
Within Groups
Total
Sum of Squares
2.256
320.618
322.874
Df
Mean Square
2.256
8.437
1
38
F
.267
Sig.
.608
39
Multiple Regression
We can also use multiple regression to investigate whether there is a statistically significant
effect due to gender on students’ math scores. Multiple regression is a good choice when we are
interested in the effects of a set of predictors (e.g., gender, socioeconomic status, motivation,
previous skills) on a single outcome like a reading score. It will isolate the effect of each
predictor separately while controlling for the effects of the others. In this way, it can provide a
summary of the relative effects of several variables on the Y outcome simultaneously.
We will summarize the results below. As you can see, the first output of interest is an ANOVA
table which summarizes the sum of squares information you should be familiar with from our
previous work with ANOVA.
If you look closely, you will see that the terminology is slightly different between the two
approaches. In multiple regression, the between-groups Sum of Squares is referred to as
Regression Sum of Squares and the within-groups Sum of Squares is referred to as Residual Sum
of Squares. Closer inspection of the ANOVA table indicates that there are very few sum of
squares due to the predictor ( gender) and, therefore, most of the variation in the reading outcome
in this first model is due to individuals (or residual sum of squares). As in the one-way ANOVA
table, the F ratio is that same (F= 0.267) as is the significance level (p = 0.608). We can see
while the terminology is slightly different, the results are the same (as we would expect).
Model
1
Regression
Residual
Sum of Squares
2.256
Total
ANOVAa
df
1
Mean Square
2.256
320.618
38
8.437
322.874
39
F
.267
Sig.
.608b
a. Dependent Variable: read
b. Predictors: (Constant), female
We also obtain an estimate of the variance accounted for in the reading outcome due to gender.
We can estimate the variance accounted for in reading by gender as the regression mean squares
over the residual mean squares (2.256/322.748 = 0.007), which suggests gender only accounts
for about 0.7% (less than 1%) of the variance in students’ reading scores. The adjusted r-square
coefficient (which takes into consideration the sample size, the number of variables in the model,
and strength of relationships) is actually negative (which would be impossible).
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
3
November 16, 2015
Review of Multiple Regression
Model Summary
Adjusted R
Std. Error of the
Model
R
R Square
Square
Estimate
1
.084a
.007
-.019
2.90470
a. Predictors: (Constant), female
We also obtain a table describing the effect of the predictor (female) on students’ reading scores.
If the predictor is dichotomous (as in this case), the effect is summarized as a difference in means
between the group coded 0 (males) and the group coded 1 (females).
Coefficientsa
Model
1
Unstandardized Coefficients
B
Std. Error
(Constant)
female
a. Dependent Variable: read
66.525
.650
.475
.919
Standardized
Coefficients
Beta
.084
T
Sig.
102.423
.000
.517
.608
First, we can see the intercept is 66.525, which is the reading score for males (who are coded 0).
In a multiple regression analysis, the intercept (or constant) is the expected Y estimate (the
reading score) when all the variables in the model are 0. In this case, this would refer to males,
since they are coded 0.
Second, we can see that the unstandardized beta coefficient (B) is 0.475, which is the interpreted
as the change in Y for a one-unit change in the independent variable (female). In this case, this
suggests that as gender changes from male (0) to female (1) the “estimated” math score would
increase from 66.525 by 0.475 (to 67.00), which is the reading test score for females. We can
write a prediction equation as follows:
Y  0  1 * female  e
where  0 is the intercept (where the line crosses the Y-axis) and 1 is the slope (which in this
case is the estimated difference in means), and e is an error term suggesting we cannot make
perfect predictions. When we substitute in the estimates from the table we obtain the following:
Yˆ  66.525  0.475* female
Sometimes people use Yˆ (Y-hat) to indicate a predicted score. In this case, we can see that if we
substitute in 0 (since males are coded 0) in the equation for “female” and multiply 0 by 0.475, we
will obtain the intercept score [66.525 +0.475(0) = 66.525], which is the mean reading score for
males. Similarly, if we substitute in 1 into the equation for female (since females are coded 1)
and then add the result to the intercept, we will obtain the average mean for females of 67.00
[66.525 +0.475(1) = 67.00].
Third, we can see in the table a standardized beta (Beta) coefficient which provides an estimate
of the relative size of the coefficient in standard deviation units. This is the same as the
correlation if there is only one predictor in the model. In this case we would describe the
standardized Beta as small (0.084). We can obtain the standardized beta in the table by
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
4
November 16, 2015
Review of Multiple Regression
multiplying the unstandardized beta (0.475) by the ratio of the standard deviation of female
(0.51) to the standard deviation of Y (2.88).
Descriptive Statistics
N
Read
Female
Valid N (listwise)
40
40
40
Minimum
59.10
0
Maximum
73.40
1
Mean
66.7625
.50
Std. Deviation
2.87729
.506
When we obtain the ratio of the standard deviations (0.177), we multiply it by 0.475 and obtain
0.084, which matches the regression coefficient table.
Testing the Null Hypothesis for Gender
A final important piece of information in the table is the standard error for the predictor. In this
case, for female the standard error that is relatively large (0.919) compared with the estimated
coefficient of 0.475. This can be used to develop a test of the null hypothesis that reading scores
for males and females are the same in the population. The ratio of the unstandardized beta
estimate to its standard error (β/SE) can be used to provide a t-test of statistical significance for
each predictor. In this case the ratio is 0.475/0.919, or 0.517, which is the t-ratio in the table. We
can see that this t-ratio is not large enough to be statistically significant (p = .608). In this case,
therefore, we would fail to reject the null hypothesis that gender affects reading outcomes.
Examining a Continuous Predictor
Now let’s see what happens if we investigate a continuous predictor. In this case we will use a
measure of student socioeconomic status (SES). We can start with a scatterplot to observe how
strong the relationship appears. We can see in the scatter plot below that there is some
relationship between the tendency for higher SES background to be related to higher reading
scores in this small sample. You can imagine putting a regression line in between the pairs of
scores for the 40 individuals in the sample. You might imagine a line starting in the lower left of
the graph (somewhere around Y = 60) and having a positive slope passing through the highest
concentration of points.
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
5
November 16, 2015
Review of Multiple Regression
We can actually add such a regression line that provides the variance in reading scores accounted
for by SES. We can also obtain the equation for the “best fitting” regression line, which we will
discuss subsequently. We can see the equation in the figure matches the estimates for the
intercept (constant) and SES-reading slope in the regression table on page 6.
We can see from the ANOVA table that the relationship between SES and reading is much
stronger than for gender (since the regression sum of squares is 133.824 out of 322.874). The
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
6
November 16, 2015
Review of Multiple Regression
estimated r-square would therefore be estimated as 0.414. If we take the square root, we can see
the correlation (r) between SES and reading is about 0.64.
Model
1
Regression
Residual
Sum of Squares
133.824
Total
ANOVAa
Df
1
Mean Square
133.824
189.050
38
4.975
322.874
39
F
26.899
Sig.
.000b
a. Dependent Variable: read
b. Predictors: (Constant), SES
We can confirm this in the Model Summary table with the r-square being 0.414(and the adjusted
r-square being slightly less at 0.40).
Model Summary
Adjusted R
Std. Error of the
Model
R
R Square
Square
Estimate
1
.644a
.414
.399
2.23047
a. Predictors: (Constant), SES
The table with the regression coefficients provides the relationship between SES and reading.
Here we can see that the intercept is much lower than the previous model (30.826). This would
be interpreted as the reading score for an individual who had an SES level of 0. In this case, the
actual SES scores in the data set (which we might think of as a scale from 0 to 10) range from
about 6 on the scatter plot to a maximum of 7.50 (with a mean of about 6.7). So we have a
situation where the intercept (i.e. the predicted reading score when an individual has an SES
score of 0) does not actually describe the reading level of anyone in the sample.
Coefficientsa
Model
1
Unstandardized Coefficients
B
Std. Error
(Constant)
30.826
6.938
SES
a. Dependent Variable: read
5.347
1.031
Standardized
Coefficients
Beta
.644
t
Sig.
4.443
.000
5.186
.000
There are ways to recode the data so the intercept describes an individual with the “lowest” SES
score in the sample, or even the “average” SES score in the sample. But we can also just leave it
as it is. We can write out the mathematical equation for the prediction student reading scores
according to their levels of SES.
Y  0  1 * SES  e
In this case if we substitute in the information from the regression coefficient table, we obtain the
following:
Yˆ  30.826  5.347* SES
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
7
November 16, 2015
Review of Multiple Regression
We can see in the table that the unstandardized beta is 5.347. We can use this coefficient to
estimate what would be the predicted reading score for individuals with selected levels of SES.
For example, if a person had an SES score of 6.0, we would estimate that her or his predicted
reading score would be the following:
Yˆ  30.826  5.347*6.0 .
If we multiply 5.347*6 (32.082) and add it to 30.826, we obtain a predicted score of 62.908. If
you look on the graph, you will see that this corresponds to the level of Y when SES is equal to 6.
Combining Both Predictors in a Model
Finally, let’s see what happens when we include both predictors in the model. The proposed
equation will be the following:
Y  0  1 * female   2 * SES  e
We obtain the following information. We can see that in the ANOVA table the whole model is
below the p = 0.05 commonly used level of statistical significance (i.e., p < .001). This evaluates
the overall set of predictors in explaining reading achievement.
Model
1
Regression
Residual
Total
Sum of Squares
151.742
ANOVAa
Df
2
Mean Square
75.871
171.132
37
4.625
322.874
39
F
16.404
Sig.
.000b
a. Dependent Variable: read
b. Predictors: (Constant), SES, female
Similarly, the r-square (and adjusted r-square) increase when both predictors are in the model.
This is helpful in evaluating the fit of the model to the data (higher values indicating better fit).
The two variables account for almost 50% of the variance in reading scores.
Model Summary
Adjusted R
Std. Error of the
Model
R
R Square
Square
Estimate
a
1
.686
.470
.441
2.15062
a. Predictors: (Constant), SES, female
Finally, we can evaluate the effects of each predictor on the reading outcome. What would you
conclude in the table below?
First regarding, gender should we reject or fail to reject the null hypothesis at the suggested p
value of 0.05?
Second, should we reject or fail to reject the null hypothesis that SES is related to student
achievement in reading? For SES, a one-unit increase in SES (say from 0 to 1) would result in a
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
8
November 16, 2015
Review of Multiple Regression
5.8 increase in reading score (holding gender constant). Using the last equation, for males, the
predicted score would be 27.080 + 5.802 = 32.882. For females, we would add 1.374 (32.882 +
1.374 = 34.256).
Coefficientsa
Model
1
Unstandardized Coefficients
B
Std. Error
(Constant)
female
SES
a. Dependent Variable: read
27.080
6.955
1.374
5.802
.698
1.021
Standardized
Coefficients
Beta
.242
.699
t
Sig.
3.894
.000
1.968
5.685
.057
.000
Examining Residuals
Besides examining the r-square coefficient as a measure of the model’s fit to the data, another
way of considering model fit is to examine the residuals (or errors) resulting from the model. For
example, we can either plot the predicted values from the model and the residuals or the
standardized predicted values and standardized residuals. Residuals are defined as the observed
values in the model minus the predicted values (observed – predicted = residual). So if the
observed score of an individual is 42 and the predicted value is 42, the residual would be 0, and
the individual would lie right on the “regression line.” The resulting plot should result in no
relationship between predicted and residual values. In the figure below you can see the residuals
are stretched out across the standardized predicted values, indicating no relationship.
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
9
November 16, 2015
Review of Multiple Regression
Another way to examine the normality of the residuals is to plot the observed distribution of
residuals to the expected distribution of residuals (referred to as a P-P plot). This provides a
summary of how well the model predicts for various values of the dependent variable. If the two
distributions are identical, they would all lie on the line.
In this case, in the normal probability plot below initially the observed residuals are below the
line (suggesting a smaller number of large negative residuals than expected). Near the middle,
the observed points are above the line, suggesting the observed cumulative proportion exceeds
the expected proportion. Then at the top, there are more below the line again. You can note that
the closer the circles are to the line overall, the more plausible it is that the data were sampled
from a normally distributed population.
Finally, another common examination is the distribution of the standardized residuals. Often a
standardized residual greater than +/-3 is considered an outlier. The standardized residuals
should have a mean of 0 and the standard deviation should be close to 1.0 (which they do).
Descriptive Statistics
N
Standardized Residual
Valid N (listwise)
40
40
Minimum
-1.88907
Maximum
3.09323
Mean
.0000000
Std. Deviation
.97402153
If we print a histogram of the residuals we can see there is one residual larger than +3 in
this small data set. We can say the residuals are relatively normally distributed (i.e., the skewness
is 0.45 and the kurtosis is about 1.36, not tabled). Overall we can conclude the model fits the data
reasonably well.
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
10
November 16, 2015
Review of Multiple Regression
Investigating an Interaction
Finally, we might test whether there is an interaction present between SES and gender. The test
of an interaction may be considered as a test of parallel lines; that is, whether the effect of SES
on reading is the same for males and females. We can start by viewing a scatter plot of reading
by SES with separate lines indicated for males and females. Here we can see that the slopes of
the lines for males and females are different, with the SES-reading slope being steeper for
females (dotted line) that the SES-reading slope for males. The null hypothesis would be that the
SES-reading slope does not depend on gender.
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
11
November 16, 2015
Review of Multiple Regression
We can create the interaction term (using compute and multiplying female by SES) and save it in
the data set. When we multiply female (coded 1) by SES we will obtain values of the interaction
for females but since males are coded 0, the interaction value for males will be 0. So we can
interpret the interaction as the possible “advantage or disadvantage” of SES on reading for
females.
When we estimate the model we obtain the following results.
Model
1
Regression
Residual
Total
Sum of Squares
154.614
ANOVAa
Df
3
Mean Square
51.538
168.260
36
4.674
322.874
39
F
11.027
Sig.
.000b
a. Dependent Variable: read
b. Predictors: (Constant), femaleSES, SES, female
We can see in the ANOVA table above the model is significant (F = 11.027, p < .001). More
information, however, is provided by the table of unstandardized and standardized coefficients.
This table reveals that neither female (p = .497) nor the interaction of female*SES (p = .438) is
significant in accounting for students’ reading scores. When interactions are added to the model,
they often change the coefficient of the main effects.
We can also notice the coefficients look a little “odd” compared with the last table. They are not
incorrect, but the interaction can complicate the calculations a little bit. We would interpret the
intercept in this case as the predicted score of a male (coded 0) who has an SES score of 0
(31.435). The female (coded 1) with an SES score of 0 would be assumed to have a score of
31.435 – 9.687 or 21.748 (which accounts for the fact that in the graph the male and female
regression lines cross). The effect of SES for males is simple to estimate from the equation
above. For a one unit increase in SES, the estimated effect on reading scores for males would be
the following:
Yˆ = 31.435 -9.687(0) + 5.161(1) + 1.65(0) = 31.435 + 5.161
The estimated score would then be 36.596. For females, the estimated score would be as follows:
Yˆ = 31.435 -9.687(1) + 5.161(1) + 1.65(1) = 31.435 - 2.876
The estimated score would then be 28.599. The estimated score reflects the slight positive
advantage for the interaction effect (femaleSES) in reducing the gap in reading scores for
females. It should be noted in passing that at high levels of SES, there would be an advantage
associated with the interaction term for females. We can observe in the graph that the advantage
in reading scores for females becomes positive and increases at about SES = 6.0 or above.
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
12
November 16, 2015
Review of Multiple Regression
In this case, however, even though the previous graph looks like the lines are not parallel, we
must fail to reject the null hypothesis. In other words, we assume that the effect of SES on
reading achievement does not depend on gender.
Coefficientsa
Model
1
Unstandardized Coefficients
B
Std. Error
(Constant)
31.435
8.930
female
SES
femaleSES
a. Dependent Variable: read
-9.687
5.161
1.650
14.130
1.312
2.105
Standardized
Coefficients
Beta
T
-1.705
.621
1.933
Sig.
3.520
.001
-.686
3.935
.784
.497
.000
.438
When the test of the interaction is not statistically significant, we can revert back to the model
without the interaction as the final model, unless there is some theoretical reason why we might
want to keep the model with the interaction effect as the final model.
Centering SES on the Individual with lowest SES
We may be able to improve the look of the coefficients in the interaction model by centering
SES on a more meaningful value in the data set. For example, we might choose to re-center SES
on the individual with lowest SES in the sample. We will select 6.0. We can compute the value
by giving the variable a new name (SESlow) and then computing the new value as follows: SES
– 6.0. This subtracts 6.0 from everyone’s score so the new “intercept” individual will have an
SES value of 0 for the original value of 6.0 on the SES scale.
Notice how this changes the interpretation of the new interaction model. We can see the effect
of the new variable SESlow is still the same (5.161) and the interaction for female*SESlow is
also the same (1.650). The intercept however is now 62.404 (interpreted as the value for a male
with SES of 0, which is now equal to the old value of 6.0). The effect for female at this point on
the original graph is now 0.213, which is not significant (p = 0.897). So substantively the model
does not change but if you look at the graph of the male and female regression lines, the model is
now centered right near the original point where the two regression lines cross (at SES of about
6.0). The model is not substantively different but it may be easier for readers to understand if we
use a more meaningful place to center SES.
Coefficientsa
Model
1
Unstandardized Coefficients
B
Std. Error
(Constant)
62.404
1.154
female
SESlow
femaleSESlow
a. Dependent Variable: read
.213
5.161
1.650
1.639
1.312
2.105
Standardized
Coefficients
Beta
.038
.621
.222
t
Sig.
54.099
.000
.130
3.935
.784
.897
.000
.438
Ronald H. Heck
EDEA 606 (F2015): Multivariate Methods
The University of Hawai‘i at Mānoa
13
November 16, 2015
Review of Multiple Regression
How to Recode a Variable in SPSS
In our example, the lowest value of SES is 6.0 (actually 5.98). We can re-code SES so that the
meaning of the intercept in the regression model is equal to the person with a score of SES score
of 6.0 (approximately the lowest score in the data set). We can recode individuals’ SES scores
such that 0 will represent the person with the lowest SES score in the data set. In this case, we
will use 6.0. This can be accomplished by subtracting 6.0 to everyone’s current SES score. So for
example, the first individual with an SES score of 6.76 after subtracting 6.0 will have a score of
0.76.
To do so, Open TRANSFORM and then open “Compute Variable.” If you open that, you will
see “Target Variable” on the left top. We can type the name of the new variable (SESlow) there.
This will save the new variable in the data set. Next we select “SES” and click it into the
Numeric Expression box. Then we will subtract 6 from everyone’s SES score. So the Numeric
Expression box should look like this:
SES – 6
Then click on OK. You can check that the first person now has a score of 0.76 (since we have
subtracted 6 from the original score of 6.76). If that is the case, then you have performed the
transformation accurately. Now you can run your regression using Female and SESlow as the
two predictors. You will obtain the following results. The meaning of the intercept is now a
male student with an SES score of 6.0 (which has been recoded to be 0). You can see that other
coefficients are unaffected by the recode.
Coefficientsa
Unstandardized
Standardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
61.892
.946
female
1.374
.698
.242
SESlow
5.802
1.021
.699
a. Dependent Variable: read
t
65.409
1.968
5.685
Sig.
.000
.057
.000
Creating an Interaction Term in SPSS
To create the interaction term, we open TRANSFORM and “Compute Variable.” We will give
the target variable the name femaleSESlow. We place female in the Numeric Expression Box.
Then we click in an asterisk. Finally, we place SESlow in the Numeric Expression Box. The
equation should look like this:
female*SESlow
We then click OK. You should see the interaction in the data set. Now you can run the new
regression model. It should match the interaction model in the handout.