Download ED 793 LAB #1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Omnibus test wikipedia , lookup

Mediation (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
ED 793 LAB #10

Multiple Regression: Determining how well a model fits the data and how well each
independent variable in the model predicts the dependent variable.
MULTIPLE REGRESSION
Assumptions:
1. Independence – Scores for any subject are independent of all other subjects.
2. Normality – In the population, the scores on the dependent variable are normally distributed.
3. Homoscedasticity (Homogeneity) – In the population, the variances of the dependent variable are
equal for all levels of the independent variables.
4. Linearity – In the population, the relationship between the dependent and independent variable is
linear.
For now we can assess normality with a histogram (syntax embedded in the frequencies command), and
linearity with a scatterplot of the dependent and independent variables separately. Assume
homoscedasticity to be true and we will pick up assessing and troubleshooting these assumptions in
more detail next term.
Things to know/keep in mind about multiple regression:
1. Regression analysis explores how independent variables explain the variation of a dependent
variable
2. There are several different methods of estimating coefficients. The most common method is OLS
(Ordinary Least Squares) regression, which we’ll use today and for your projects.
3. There are several different ways for adding the independent variables into the regression equation.
Today we are using a way SPSS calls “Enter.”
4. Generally, you get two coefficients listed on the output:
 The unstandardized coefficient, B: based on the unit of measure that the independent variable comes
in. It is good for discussing the effect that variable has on the dependent variable.
 The standardized coefficient,  (Beta): is a standardized version of B. It is good for comparing the
relative strength of the effects across independent variables.
5. When judging your model and how well it explains the variance in the dependent variable, you want
your R-squared (R2) to be close to 1. If R2 = 1, the model is perfect (explains all of the variance). In
educational research, it is not uncommon to see values as low as 0.15 or 0.20.
An example
Let’s say we wanted to see how well we could predict a student’s score on the SATM. We might
theorize that the student’s high school gpa (HSGPA), father’s education (FATHEDUC), and self rated
mathematical ability (RATE9810) were possible predictors. My independent variable is SATM, and
independent variables are HSGPA, FATHEDUC, and RATE9810. The regression model looks like this:
SATMi = 0 + 1 * HSGPA + 2 * FATHEDUCi + 3 * RATE9810i + ei
6/29/17
ED 793 Lab
1
THE COMMANDS
FREQUENCIES VAR=SATM
/HISTOGRAM.
PLOT
/PLOT SATM WITH HSGPA FATHEDUC RATE9810.
REGRESSION
/VARIABLES=SATM HSGPA FATHEDUC RATE9810
/DESCRIPTIVES=MEAN STDDEV CORR
/DEPENDENT=SATM
/METHOD=ENTER.
REGRESSION tells SPSS you want to perform a regression analysis.
/VARIABLES=list of variables identifies for SPSS the variables in your analysis.
/DESCRIPTIVES= MEAN STDDEV CORR tells SPSS to put the mean, standard deviation,
and correlations
for the variables in your output.
identifies for SPSS which of the variables in the /VARIABLES= line is your
dependent variable.
/METHOD=ENTER is a command that tells SPSS how to enter the variables into the analysis. This needs to
be here but is not of concern until ED 795.
/DEPENDENT=variable name
THE OUTPUT
* * * *
M U L T I P L E
R E G R E S S I O N
* * * *
Listwise Deletion of Missing Data 
SATM
HSGPA
FATHEDUC
RATE9810
Mean
Std Dev
667.316
7.129
6.824
4.089
73.407
.916
1.563
.845
N of Cases =
Label
SAT MATH SCORE
AVERAGE HIGH SCHOOL GRADE
FATHER'S EDUCATION
MATHEMATICAL ABILITY
2342
Correlation:
SATM
HSGPA
FATHEDUC
RATE9810
SATM
HSGPA
FATHEDUC
RATE9810
1.000
.286
.167
.524
.286
1.000
.042
.283
.167
.042
1.000
.026
.524
.283
.026
1.000
* * * *
Equation Number 1
6/29/17
M U L T I P L E
R E G R E S S I O N
 Dependent Variable..
SATM
* * * *
SAT MATH SCORE
ED 793 Lab
2
Descriptive Statistics are printed on Page
Block Number
1.
Method:
Variable(s) Entered
1..
RATE9810
2..
FATHEDUC
3..
HSGPA
Multiple R
R Square
Adjusted R Square
Standard Error
4
Enter 
on Step Number
MATHEMATICAL ABILITY
FATHER'S EDUCATION
AVERAGE HIGH SCHOOL GRADE
.56318
.31717 
.31629
60.69789
Analysis of Variance
DF
3
2338
Regression
Residual
F =
361.99227
Sum of Squares
4000992.27012
8613738.28027
Signif F =
Mean Square
1333664.09004
3684.23365
Our global F-test for the significance of
the model on the whole.
.0000
------------------ Variables in the Equation -----------------B
Variable
HSGPA
FATHEDUC
RATE9810
(Constant)
11.505318
6.979713
41.644284
367.370031
End Block Number
1
SE B
1.429056
.803221
1.547726
11.573436
Beta
.143552
.148655
.479490
T
8.051
8.690
26.907
31.743
Sig T
.0000
.0000
.0000
.0000
All requested variables entered.
A KEY TO THE OUTPUT
Multiple R
This number is squared to get R square. It is the correlation between the
dependent variables and the set of independent variables collectively.
R square
Percent variance in the dependent variable explained by the set of independent
variables collectively. An estimate of how well the model fits the data. It ranges
from 0 to 1, just like the square of the correlation coefficient ( Pearson’s r).
Adjusted R square It is the R2 value, mathematically ‘adjusted’ for the number of predictors in the
model.
6/29/17
ED 793 Lab
3
F
It tests H0: 2 = 3 = 4 = … = k = 0 or H0: R2 = 0. We call this a goodness of fit
test. Does the model fit the data?
B
This column gives you the estimated unstandardized coefficients.
SE B
This is the standard error of the estimated coefficient (B).
Beta
This column gives you the standardized coefficients.
T
A test statistic for whether there really is a linear relationship between the variable
and the dependent variable in the population. It tests H0: i = 0.
Sig T
The probability that we would see such a t-value (see above) if the real linear
relationship between the variable and the dependent variable was nonexistent
(null hypothesis true).
INTERPRETING THE OUTPUT
The regression output consists of three blocks of information. The first block we see is a listing of the
descriptive statistics and the correlation matrix that we asked for. The second block has the “Goodness
of Fit” measures. The last block is about the statistics for the independent variables in the model. See
the output above for the matching symbols (e.g., ) that go with the discussion below.
 Listwise deletion of missing data. This means that only cases which have values for all variables are
included in the analysis.
For now we will ignore the rest of the first block. It contains info we are comfortable with by now.
 Tells you what your chosen dependent variable is.
 It shows the method you are using to enter the variables into the analysis. Other options are
STEPWISE, FORWARD, BACKWARD, and so on. Not important today.
 This means that the three independent variables account for about 32% of the variance in the
dependent variable, student’s SATM score in our case. That’s pretty good, though it tells us that there is
plenty of variation in SATM scores that these independent variables do not explain. Based on this, we
could predict someone’s SATM score with some accuracy from these measures, but we couldn’t predict
them perfectly.
 The F-statistic tells us whether our model fits the data well. In this case we can say, “There is
evidence to suggest the model fits our data (F = 361.99.3, sig F < 0.001).”
 These are the estimated coefficients—that is, the amount that SATM will change for every one-unit
change is the independent variable holding all else constant. In our model, for example, every 1 point
6/29/17
ED 793 Lab
4
change in HSGPA will produce, on average, an 11.5 increase in SATM, holding FATHEDUC and
RATE9810 constant.
 Beta tells us what the coefficient is in standardized form. Using beta, we can compare the relative
“power” of the independent variables in explaining variation in the dependent variable. In our model,
we see that RATE9810 is the “strongest” variable for predicting SATM (another way to say this is that
RATE9810 is the strongest predictor of SATM). To get technical, beta indicates the expected change in
the dependent variable that is associated with a one standard deviation change in an independent
variable while holding the remaining variables constant.
 We have seen these values before. They are the B value divided by the standard error. Why?
The significance of a t-value tells us whether our independent variables are considered significant
predictors of our dependent variable. In our model, HSGPS, FATHEDUC, and RATE9810 are all
significant predictors of SATM.
FINAL POINTS ABOUT MULTIPLE REGRESSION
 B v. Beta
1. Beta is sample-specific because it has been standardized using the standard error for B, which comes
from the sample itself. Therefore, it cannot be used for the purpose of generalizing across settings
and populations. Whereas, the B can be thought of as an estimate of the population parameter, and
with a representative sample can be used to generalize to the population.
 R-squared v. Adjusted R-squared
6. R-squared: As the number of predictors (IVs) goes up, R-squared almost invariably increases and
NEVER decreases. So, it’s somewhat optimistic.
7. Adjusted R-squared: As the number of predictors goes up, Adjusted R-squared can decrease
because it is “adjusted” for the number of predictors in the model.

Maximizing Adjusted R-squared
From the previous discussion, it seems self-evident that the higher the Adjusted R-square, the better.
Generally speaking, this is true. However, the following is a warning about fully adopting that idea.
Sometimes researchers play the game of maximizing Adjusted R-square, that is, choosing the model that gives the
highest Adjusted R-square. But this may be dangerous, for in regression analysis our objective is not to obtain a
high Adjusted R-square per se but rather to obtain dependable estimates of the true population regression
coefficients and draw statistical inferences about them… Therefore, the researcher should be more concerned about
the logical or theoretical relevance of the explanatory variables to the dependent variable and their statistical
significance. If in this process we obtain a high Adjusted R-square, well and good; on the other hand, if Adjusted Rsquare is low, it does not mean the model is necessarily bad. (Pedhazur, 1997).
6/29/17
ED 793 Lab
5