Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ED 793 LAB #10 Multiple Regression: Determining how well a model fits the data and how well each independent variable in the model predicts the dependent variable. MULTIPLE REGRESSION Assumptions: 1. Independence – Scores for any subject are independent of all other subjects. 2. Normality – In the population, the scores on the dependent variable are normally distributed. 3. Homoscedasticity (Homogeneity) – In the population, the variances of the dependent variable are equal for all levels of the independent variables. 4. Linearity – In the population, the relationship between the dependent and independent variable is linear. For now we can assess normality with a histogram (syntax embedded in the frequencies command), and linearity with a scatterplot of the dependent and independent variables separately. Assume homoscedasticity to be true and we will pick up assessing and troubleshooting these assumptions in more detail next term. Things to know/keep in mind about multiple regression: 1. Regression analysis explores how independent variables explain the variation of a dependent variable 2. There are several different methods of estimating coefficients. The most common method is OLS (Ordinary Least Squares) regression, which we’ll use today and for your projects. 3. There are several different ways for adding the independent variables into the regression equation. Today we are using a way SPSS calls “Enter.” 4. Generally, you get two coefficients listed on the output: The unstandardized coefficient, B: based on the unit of measure that the independent variable comes in. It is good for discussing the effect that variable has on the dependent variable. The standardized coefficient, (Beta): is a standardized version of B. It is good for comparing the relative strength of the effects across independent variables. 5. When judging your model and how well it explains the variance in the dependent variable, you want your R-squared (R2) to be close to 1. If R2 = 1, the model is perfect (explains all of the variance). In educational research, it is not uncommon to see values as low as 0.15 or 0.20. An example Let’s say we wanted to see how well we could predict a student’s score on the SATM. We might theorize that the student’s high school gpa (HSGPA), father’s education (FATHEDUC), and self rated mathematical ability (RATE9810) were possible predictors. My independent variable is SATM, and independent variables are HSGPA, FATHEDUC, and RATE9810. The regression model looks like this: SATMi = 0 + 1 * HSGPA + 2 * FATHEDUCi + 3 * RATE9810i + ei 6/29/17 ED 793 Lab 1 THE COMMANDS FREQUENCIES VAR=SATM /HISTOGRAM. PLOT /PLOT SATM WITH HSGPA FATHEDUC RATE9810. REGRESSION /VARIABLES=SATM HSGPA FATHEDUC RATE9810 /DESCRIPTIVES=MEAN STDDEV CORR /DEPENDENT=SATM /METHOD=ENTER. REGRESSION tells SPSS you want to perform a regression analysis. /VARIABLES=list of variables identifies for SPSS the variables in your analysis. /DESCRIPTIVES= MEAN STDDEV CORR tells SPSS to put the mean, standard deviation, and correlations for the variables in your output. identifies for SPSS which of the variables in the /VARIABLES= line is your dependent variable. /METHOD=ENTER is a command that tells SPSS how to enter the variables into the analysis. This needs to be here but is not of concern until ED 795. /DEPENDENT=variable name THE OUTPUT * * * * M U L T I P L E R E G R E S S I O N * * * * Listwise Deletion of Missing Data SATM HSGPA FATHEDUC RATE9810 Mean Std Dev 667.316 7.129 6.824 4.089 73.407 .916 1.563 .845 N of Cases = Label SAT MATH SCORE AVERAGE HIGH SCHOOL GRADE FATHER'S EDUCATION MATHEMATICAL ABILITY 2342 Correlation: SATM HSGPA FATHEDUC RATE9810 SATM HSGPA FATHEDUC RATE9810 1.000 .286 .167 .524 .286 1.000 .042 .283 .167 .042 1.000 .026 .524 .283 .026 1.000 * * * * Equation Number 1 6/29/17 M U L T I P L E R E G R E S S I O N Dependent Variable.. SATM * * * * SAT MATH SCORE ED 793 Lab 2 Descriptive Statistics are printed on Page Block Number 1. Method: Variable(s) Entered 1.. RATE9810 2.. FATHEDUC 3.. HSGPA Multiple R R Square Adjusted R Square Standard Error 4 Enter on Step Number MATHEMATICAL ABILITY FATHER'S EDUCATION AVERAGE HIGH SCHOOL GRADE .56318 .31717 .31629 60.69789 Analysis of Variance DF 3 2338 Regression Residual F = 361.99227 Sum of Squares 4000992.27012 8613738.28027 Signif F = Mean Square 1333664.09004 3684.23365 Our global F-test for the significance of the model on the whole. .0000 ------------------ Variables in the Equation -----------------B Variable HSGPA FATHEDUC RATE9810 (Constant) 11.505318 6.979713 41.644284 367.370031 End Block Number 1 SE B 1.429056 .803221 1.547726 11.573436 Beta .143552 .148655 .479490 T 8.051 8.690 26.907 31.743 Sig T .0000 .0000 .0000 .0000 All requested variables entered. A KEY TO THE OUTPUT Multiple R This number is squared to get R square. It is the correlation between the dependent variables and the set of independent variables collectively. R square Percent variance in the dependent variable explained by the set of independent variables collectively. An estimate of how well the model fits the data. It ranges from 0 to 1, just like the square of the correlation coefficient ( Pearson’s r). Adjusted R square It is the R2 value, mathematically ‘adjusted’ for the number of predictors in the model. 6/29/17 ED 793 Lab 3 F It tests H0: 2 = 3 = 4 = … = k = 0 or H0: R2 = 0. We call this a goodness of fit test. Does the model fit the data? B This column gives you the estimated unstandardized coefficients. SE B This is the standard error of the estimated coefficient (B). Beta This column gives you the standardized coefficients. T A test statistic for whether there really is a linear relationship between the variable and the dependent variable in the population. It tests H0: i = 0. Sig T The probability that we would see such a t-value (see above) if the real linear relationship between the variable and the dependent variable was nonexistent (null hypothesis true). INTERPRETING THE OUTPUT The regression output consists of three blocks of information. The first block we see is a listing of the descriptive statistics and the correlation matrix that we asked for. The second block has the “Goodness of Fit” measures. The last block is about the statistics for the independent variables in the model. See the output above for the matching symbols (e.g., ) that go with the discussion below. Listwise deletion of missing data. This means that only cases which have values for all variables are included in the analysis. For now we will ignore the rest of the first block. It contains info we are comfortable with by now. Tells you what your chosen dependent variable is. It shows the method you are using to enter the variables into the analysis. Other options are STEPWISE, FORWARD, BACKWARD, and so on. Not important today. This means that the three independent variables account for about 32% of the variance in the dependent variable, student’s SATM score in our case. That’s pretty good, though it tells us that there is plenty of variation in SATM scores that these independent variables do not explain. Based on this, we could predict someone’s SATM score with some accuracy from these measures, but we couldn’t predict them perfectly. The F-statistic tells us whether our model fits the data well. In this case we can say, “There is evidence to suggest the model fits our data (F = 361.99.3, sig F < 0.001).” These are the estimated coefficients—that is, the amount that SATM will change for every one-unit change is the independent variable holding all else constant. In our model, for example, every 1 point 6/29/17 ED 793 Lab 4 change in HSGPA will produce, on average, an 11.5 increase in SATM, holding FATHEDUC and RATE9810 constant. Beta tells us what the coefficient is in standardized form. Using beta, we can compare the relative “power” of the independent variables in explaining variation in the dependent variable. In our model, we see that RATE9810 is the “strongest” variable for predicting SATM (another way to say this is that RATE9810 is the strongest predictor of SATM). To get technical, beta indicates the expected change in the dependent variable that is associated with a one standard deviation change in an independent variable while holding the remaining variables constant. We have seen these values before. They are the B value divided by the standard error. Why? The significance of a t-value tells us whether our independent variables are considered significant predictors of our dependent variable. In our model, HSGPS, FATHEDUC, and RATE9810 are all significant predictors of SATM. FINAL POINTS ABOUT MULTIPLE REGRESSION B v. Beta 1. Beta is sample-specific because it has been standardized using the standard error for B, which comes from the sample itself. Therefore, it cannot be used for the purpose of generalizing across settings and populations. Whereas, the B can be thought of as an estimate of the population parameter, and with a representative sample can be used to generalize to the population. R-squared v. Adjusted R-squared 6. R-squared: As the number of predictors (IVs) goes up, R-squared almost invariably increases and NEVER decreases. So, it’s somewhat optimistic. 7. Adjusted R-squared: As the number of predictors goes up, Adjusted R-squared can decrease because it is “adjusted” for the number of predictors in the model. Maximizing Adjusted R-squared From the previous discussion, it seems self-evident that the higher the Adjusted R-square, the better. Generally speaking, this is true. However, the following is a warning about fully adopting that idea. Sometimes researchers play the game of maximizing Adjusted R-square, that is, choosing the model that gives the highest Adjusted R-square. But this may be dangerous, for in regression analysis our objective is not to obtain a high Adjusted R-square per se but rather to obtain dependable estimates of the true population regression coefficients and draw statistical inferences about them… Therefore, the researcher should be more concerned about the logical or theoretical relevance of the explanatory variables to the dependent variable and their statistical significance. If in this process we obtain a high Adjusted R-square, well and good; on the other hand, if Adjusted Rsquare is low, it does not mean the model is necessarily bad. (Pedhazur, 1997). 6/29/17 ED 793 Lab 5