Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Expectation–maximization algorithm wikipedia , lookup
Data assimilation wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Lasso (statistics) wikipedia , lookup
Choice modelling wikipedia , lookup
Time series wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Linear regression wikipedia , lookup
Lecture 6: Simple Regression Regression Analysis The development of a formula for weighting or combining the values of 1 or more independent variables to predict or explain variation in values of a dependent variable. Always 1 DV. It'll be labeled Y. Simple Regression 1 independent variable. It'll be labeled X. Multiple Regression 2 or more independent variables The first will be labeled X1. The second will be labeled X2. And so forth. Linear Regression The formula involves only the first power of X(s) and no higher powers or products. e.g., Y = a + b1X1 + b2X2 Nonlinear Regression The formula involves powers of X(s) or transformations of X(s) such as logarithmic or exponential transformations. . e.g., Y = a + bX12 + cX1X2 + d*log(X4) Simple Linear Regression The formula has the form: Predicted Y = a + bX. Multiple Linear Regression The formula has the form Predicted Y = ay.12 + by1.2X1 + by2.1X2 It will also be written as Predicted Y = a + b1X1 + b2X2. Later on, we’ll write this as Predicted Y = B0 + B1X1 + B2X2. (B0 + B1X1 + B2X2 if I forget to subscript the subscripts) Symbols that will be used to stand for predicted Y Predicted Y is written as Y-hat, Y’, or Y Regression Analysis Vs. Correlation Analysis. Correlation: An analysis which assesses the strength and direction of relationship between X and Y. Regression: Requires a prior correlation analysis confirming a relationship between X and Y. An analysis that allows you to predict or explain Y from X. So correlation analysis is a step on the way to regression analysis. More on this later. Lecture 6 – Simple Regression - 1 5/8/2017 Why perform regression analysis? 1. Convenience. The formula serves as a convenient way to generate predicted Y values for persons for whom we have only X or X's. See the following data matrix of test performance and 1st year sales. What would be the predicted SALES for a person who scored 30 on the test? PERSON 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 TEST 32 23 36 34 31 32 26 21 27 38 32 29 28 32 36 37 36 22 29 33 31 28 22 29 24 SALES 890 790 1330 855 990 1285 865 900 725 1115 1135 1060 1160 1195 1100 1165 1295 720 1090 1040 805 925 520 1070 975 2. Objectivity. The formula serves as an objective way to generate predicted Y values for persons for whom we have only X or X's. Suppose the boss's daughter or son scored 25 on the test. Should he/she be hired? The formula is a way of generating predictions which depend only on the data in an objective fashion. You can say, “Gosh, Boss. I’d really like to hire your kid, but the formula won’t allow it.” 3. Regression Extras. A byproduct of the analysis allows us to determine the accuracy of our predictions. That is, beside the fact that the formula gives us a convenient, objective prediction, we also are able to say how accurate that prediction is (by way of a confidence interval). 4. Theory. The form of the relationship (linear vs. nonlinear) may be of theoretical interest. Some theories predict linear relationships between variables. Others predict specific forms of nonlinear relationships. Regression analysis affords tests of those predictions. LaHuis, D. M., Martin, N. R., & Avis, J. M.(2005). Investigating Nonlinear Conscientiousness–Job Performance Relations for Clerical Employees. Human Performance, 18, 199-212. 5. Statistical Control. Multiple regression analysis allows us to investigate the effects of variables while statistically controlling for the effects of other variables. These statistical controls are often the only kinds which are possible for many real life data analytic situations. Lecture 6 – Simple Regression - 2 5/8/2017 SIMPLE REGRESSION LECTURE EXAMPLE – Start here on 2/21/17. Consider an insurance company's desire to predict performance of its sales persons. Clearly it would be of benefit to the company to know which prospective employees would be good salespersons and which would be expected to be poor. The company could hire only those expected to do well in the sales position. Suppose that a test of sales ability is being given to all current employees. In addition, a record of sales in the first year on the job is available for all employees. The interest here is in using the relationship between test scores and first year sales performance of previously hired employees to predict first year sales of prospective employees. Suppose data on 25 current employees are available. Test scores can range from 0 to 50, with 50 representing the best possible test score. Sales figures can range from 0 to $2,000,000. For sake of exposition, sales figures are expressed in no. Of $1000's, ranging from 0 to 2000. PERSON 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 TEST 32 23 36 34 31 32 26 21 27 38 32 29 28 32 36 37 36 22 29 33 31 28 22 29 24 SALES 890 790 1330 855 990 1285 865 900 725 1115 1135 1060 1160 1195 1100 1165 1295 720 1090 1040 805 925 520 1070 975 Lecture 6 – Simple Regression - 3 5/8/2017 The formula method for computing a and b. The formula method computes b and a such that the sum of squares of the differences between Y's and Y-hats is smaller than it would be for any other values. Called a least squares solution. The mean is a least squares measure. The variance (and standard deviation) about the mean is smaller than it is about any other value. The mean is “closer” to all the scores (in the least squares sense) than any other value. Estimate of b The estimate of b can be obtained using several formula's. bYX = Covariance of X and Y divided by Variance of X's. b= (X-X-bar)(Y-Y-bar) / n ---------------------------(X-X-bar)2/ n SY = r * ------------SX The y-intercept. Once b has been computed, a is computed from a = Y-bar - b*X-bar When r = 0. Note that b depends on r. If r is 0, b is 0, and all predictions will be the same value, Y-bar. That is, the prediction equation would be Predicted Y = a + b*X = a + 0*X = a Predicted Y = a = Y-bar. So, if there is no correlation between Xs and Ys, the predicted Y for every person is simply the mean of the Ys. Lecture 6 – Simple Regression - 4 5/8/2017 The stat package method. (Formula method in disguise.) There’ll be an extended SPSS example later in the lecture. Model Summary Model R R Square Std. Error of the Adjusted R Square Estimate 1 .684a .468 .445 149.203 a. Predictors: (Constant), TEST Coefficientsa Unstandardized Coefficients Standardized Coefficients B Std. Error Beta (Constant) (a) 176.474 185.606 TEST (b) 27.524 6.123 Model 1 .684 Standard error: The amount that the estimate would be expected to vary if multiple regressions on different samples were conducted. It’s the estimated standard deviation of values across multiple samples. Sig: p-value for a test of the null hypothesis that in the population the coefficient is zero. t Sig. .951 .352 4.495 .000 a. Dependent Variable: SALES Predicted Y = 176.474 + 27.524*TEST For TEST = 20, Predicted Y = 176.474 + 27.524*20 = 726.924 Lecture 6 – Simple Regression - 5 5/8/2017 Visual representation of the regression analysis A scatterplot with observed points, predicted points, a couple of residuals and the best fitting straight line for this analysis. (Created by hand.) 1500 1400 Actual point 1300 1200 Predicted point 1100 1000 Positive Residual 900 Sales 800 700 600 Negative Residual 500 400 300 Residual: Observed Y – Predicted Y. 200 Positive Residual: Y is bigger than predicted. Observed was better than predicted. 100 Negative Residual: Y is smaller than predicted. Observed was worse than predicted. 0 0 10 20 Test 30 40 Interpretation of the regression parameters a Expected value of Y when X = 0. In this example, as in most psychological examples, the X=0 situation is of little interest. What is 0 on most psychological scales is arbitrary. b Interpretation 1: Expected difference in Y between two people who differ by 1 on X Interpretation 2: Expected change in Y when X increases by 1 if X is manipulable. Here we would expect about a 27, 500 difference in sales between two people who differed by 1 point on the test. Lecture 6 – Simple Regression - 6 5/8/2017 50 Regression as a model for a relationship Model: Something not as complex or detailed as the original that performs similarly to the original. The above shows how a regression equation serves as a model of a relationship. The filled in points are an idealization of the Test~Sales relationship. It shows how Sales would relate to Test if it weren’t for the errors introduced by idiosyncrasies of individuals. You will hear data analysts speak of the regression model of the data. The scatterplot of filled points is what they are referring to. What to get from the simple regression. 1. The r-squared. If it is not significantly different from 0, the regression analysis is of no value in explanation or prediction except to rule out this X as a predictor/explainer of Y. 2. The regression parameters. a: b: It may be of interest to know what Y would be expected when X=0. (Probably not, though.) It may be of interest to know how much Y would be expected to change if X increased by 1. 3. Predicted Ys. It may be of interest to know the expected Y value of a person with a particular value of X. 4. The residuals. It may be of interest to know how a person did relative to what he/she was expected to do. Lecture 6 – Simple Regression - 7 5/8/2017 Difference between a Correlation Analysis and a Regression Analysis Correlation 1. No Independent/Dependent distinction Regression There is a Dependent variable and an Independent variable. 2. A binomial effect size table is appropriate. A binomial effect size table is appropriate. 3. The correlation coefficient is computed. The correlation coefficient is computed. 4. A scatterplot is typically created. A scatterplot with BFSL is created. 5. Prediction equation with regression parameters is reported. 6. Interpretation focuses on the relationship. Interpretation focuses on relationship and on predicted values vs observed values. What to watch for in both. . . 1. Is the relationship essentially linear? 2. Are there any points that are particularly poorly predicted? 3. Are there any points that appear to be too influential? 4. Is the relationship what your theory said it would be? Lecture 6 – Simple Regression - 8 5/8/2017 Regression Statistics for Individual Cases (boring, arcane, grit your teeth) Here for your reference. Be sure to take Central Tendency: Mean of predicted Ys is the mean of the Y's. 2 Variability: Variance of the predicted Ys is r *Variance of the Ys. advantage of it if you must. 1) The fact that the variance of the predicted Ys is less than or equal to the variance of Ys means that the predicted Ys will be conservative – predicted Y for a large Y will be slightly smaller than the actual Y, for example. Predicted Ys regress to the mean. Predicted Y's: Predicted Y = a + b *X 2) The predicted Ys are perfectly linearly related to the Xs. The predicted Ys are simply a restatement or linear recoding of the Xs. The predicted Ys are the Xs, thinly disguised. 3) When you plot predicted Ys vs Xs, you’ll always get a perfectly straight line of points.. Residuals: Y - Y-hat. Central Tendency: Mean of residuals is 0. Variability: Variance of the residuals = (1-r2)*Variance of Ys. A positive residual: Y outperformed the prediction. Y overachieved.. A negative residual: Y underperformed the prediction. Y underachieved. Information about residuals 1) Residuals represent variation in the Ys that is unrelated to the Xs. So the correlation between the residuals and the Xs (and the Y-hats) is perfectly 0. 2) The regression analysis has extracted all the variation in Ys that is related to Xs and embodied it in the predicted Ys. All the variation in Ys that remains is variation that is not related to Xs. 3) This variation may be related to variables other than X. In fact, if the variance of the residuals is large, this is an indication that there is variation in Y remaining to be predicted. 4) So large residuals in a simple regression will cause us to search for other predictors of Y. 5) Residuals are said to represent the unique variation of Y with respect to X. Partitioning the Variance of the Ys. We often say that regression analysis divides the variation of the Ys into two components – a) variation that is completely related to X – the variation of the Y-hats - and b) variation that is completely independent of X – the variation of the residuals Lecture 6 – Simple Regression - 9 5/8/2017 Summary Statistics for the whole sample Coefficient of determination, R2 (It may often be symbolized as r2.) It is interpreted as the percentage of variance of the Y’s which is linearly related to the X’s. (Y-hat - Y-bar)2 / N Coefficient of determination = -------------------------- = (Y - Y-bar)2 / N Variance of the Y-hats ---------------------------Variance of the Y's Coefficient of determination ranges from 0 to 1. 0: Y is not related to X in a linear fashion.. 1: Y is perfectly related to X. The coefficient of determination, i.e., r2, is the most often used measure of goodness of fit of the regression model. Standard Deviation of the residuals SY-Y-hat = (Y-Y-hat - 0)2 / n Typically, it is written as SY-Y-hat = (Y-Y-hat)2 / n since the mean of the residuals = 0. Standard Error of Estimate S-hatY-Y-hat = (Y-Y-hat-0)2 / (n-2) since the mean of the residuals = 0. Standard error measures how much the points vary about the regression line. Roughly, it’s a measure of how close we could expect an actual Y to be to its predicted Y. A large Standard Error of Estimate means that prediction is poor. A small Standard Error of Estimate means that the prediction equation is doing a good job. If normal distribution assumptions are met, about 2/3 of Y’s will be within 1 SEE of Y-hat. About 95% of Y’s will be within 2 SEE’s of Y-hat. Talking about regression We always regress the dependent variable onto the independent variable(s). Lecture 6 – Simple Regression - 10 5/8/2017 Regression with Standardized Variables, Z-scores If all X's and Y's are converted to their respective Z-scores . . . Predicted ZY = r * ZX Since r is invariably less than 1, this equation predicts regression to the mean. The distance of ZY from its mean will be predicted to be less than the distance of ZX from its mean. Identifying outliers and influential Cases X-outlier X A cases whose X-value is way out in the upper or lower tail of the Xdistribution. Compute ZX. Those values >= 2 in absolute value are suspect. Regression outlier A case whose residual is way out in the upper or lower tail of the distribution of residuals. A case whose Y is especially poorly predicted by the regression equation. Compute ZY-Y-hat. Those values >= 2 in absolute value are suspect. DFBETA A measure of the extent to which a case affects a parameter of the regression equation. DFBETAa for a case = a computed from all cases minus a with the case excluded. Measures how much a person’s presence in analysis affects a. DFBETAb for a case = b computed from all cases minus b with the case excluded. Measures how much a person’s presence in analysis affects b. In each instance, DFBETA is the amount by which the parameter changed when the case was included. That is, adding case i changed a (or b) by DFBETAa or b. Large positive dfbetab Large positive dfbetaa On the left, the case represented by the small circle affects the y-intercept (a) but not the slope. On the right, the case represented by the small circle primarily affects the slope (b). Lecture 6 – Simple Regression - 11 5/8/2017 SPSS Worked Out Example Prediction of P5130 scores from P5100/P5110 Scores We’ll examine the extent to which P5130 scores can be predicted from P5100/P5110 scores. The scores are proportion of total possible points in the course. The data below are real, gathered over the past nearly 20 years. Here’s a scatterplot of the relationship . . . The scatterplot, along with SPSS’s best fitting straight line and the r2 value printed in the scatterplot present much of what many data analysts would like to know about the situation. It shows that the overall relationship is strong and positive. But there is scatter about that relationship, enough to show that a person who does poorly in the fall course doesn’t necessarily have to do as poorly in the spring course and that a person who does well in the fall course won’t necessarily do as well in the spring course. Lecture 6 – Simple Regression - 12 5/8/2017 Here are univariate statistics on each variable . . . Statistics p511g N Valid p513g 358 358 0 0 Mean .8763 .8751 Median .8800 .8900 .07591 .08841 Missing Std. Deviation Both distributions are unimodel. The distribution of P513 scores is slightly more negatively skewed than the P511g distribution. Lecture 6 – Simple Regression - 13 5/8/2017 Simple regression analysis Lecture 6 – Simple Regression - 14 5/8/2017 The Output Syntax, if you’re interested . . . REGRESSION /DESCRIPTIVES MEAN STDDEV CORR SIG N /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA Boilerplate syntax that represents defaults /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT p513g /METHOD=ENTER p511g /SCATTERPLOT=(*ZRESID ,*ZPRED) /RESIDUALS HISTOGRAM(ZRESID) NORMPROB(ZRESID). Regression [DataSet1] G:\MdbO\Dept\Validation\GRADSTUDENTS.sav (P511yr < 2010) Descriptive Statistics Mean Std. Deviation N p513g .8751 .08841 358 p511g .8763 .07591 358 Correlations p513g Pearson Correlation Sig. (1-tailed) N p511g p513g 1.000 .739 p511g .739 1.000 p513g . .000 p511g .000 . p513g 358 358 p511g 358 358 Variables Entered/Removedb Model 1 Variables Variables Entered Removed p511ga Default output Method . Enter a. All requested variables entered. b. Dependent Variable: p513g Lecture 6 – Simple Regression - 15 5/8/2017 Model Summaryb Model R R Square .739a 1 Adjusted R Std. Error of the Square Estimate .546 .545 Key information. .05964 a. Predictors: (Constant), p511g Standard error of estimate – variability of points about the regression line. b. Dependent Variable: p513g ANOVAb Model 1 Sum of Squares df Mean Square Regression 1.524 1 1.524 Residual 1.266 356 .004 Total 2.790 357 F Sig. .000a 428.526 This is redundant information for a simple regression analysis – redundant with the information printed below. a. Predictors: (Constant), p511g b. Dependent Variable: p513g Coefficientsa Model 1 Unstandardized Standardized Coefficients Coefficients B Std. Error (Constant) a .121 .037 p511g b .861 .042 Beta t .739 Sig. 3.301 .001 20.701 .000 a. Dependent Variable: p513g Residuals Statisticsa Minimum Predicted Value Maximum Mean Std. Deviation N .6544 1.0418 .8751 .06534 358 -.19796 .22065 .00000 .05955 358 Std. Predicted Value -3.377 2.551 .000 1.000 358 Std. Residual -3.319 3.700 .000 .999 358 Residual a. Dependent Variable: p513g The prediction equation: Predicted P5130 scores = 0.121 + 0.861*P5111g. So, if a student got the lowest possible A in 511g, 0.90, predicted P5130 score would be .121+.861*.90 = .896 ~~ .9, also the lowest possible A. A student with lowest possible B would be predicted to earn .121+.861*.8 = .81 in P5130. Lecture 6 – Simple Regression - 16 5/8/2017 Key information. Charts Should be a classic US distribution. Should be a straight line. This plot should be a classic zero correlation scatterplot – a shotgun blast on a wall. Lecture 6 – Simple Regression - 17 5/8/2017 Comparing observed performance to expected performance The results of a regression analysis can be used to develop an expectation of a person’s performance. The person’s actual performance can then be compared to that expected performance. Characterizing scores without any expectation Consider the P5130 scores above. Below is a distribution of the scores with one specific student identified. The student’s score in P5130 was .92. The Z-score corresponding to a .92 was Z = (Y – Mean of Ys)/SD of Ys = (.92 - .8751)/.08841 =+ .51. So the student was about ½ standard deviation above the mean in a pretty rigorous statistics course. Not bad, huh? This is a characterization of the score without any expectations for that student based on prior knowledge. Lecture 6 – Simple Regression - 18 5/8/2017 Characterizing scores with an expectation But, let’s compute the student’s expected performance in P5130, based on the student’s P5100/5110 performance. That student’s P5100/5110 score was 1.02. (I know this from access to the data set.) Using the above regression equation, based on the P5100/5110 performance, the student was expected to score: Y-hat = .121 + .861*.92 = 1.00 = Y-hat for this person. The student’s actual Y value was below the predicted Y value. The different was -0.08 = residual. This difference can be divided by the standard error of estimate to get a Z-score. The SEM is .05964, from the Model Summary of the above regression. So the student’s Z-score of his/her performance relative to what was expected was . . . Z of residual = Residual -.08 ------------------------------------ = --------------------- = -1.34 Standard Error of Residuals .05964 The student did poorly relative to what he/she was expected to do based on p5110/5110 performance. So the student’s performance was good, based on no information concerning expectation, but poor, based on a refined expectation. Lecture 6 – Simple Regression - 19 5/8/2017 Here’s the scatterplot, with that student’s 5110/5130 point highlighted. Expected 5130 score Actual 5130 score What does “I did good.” Mean. Or what does, “I did poorly.” Mean. This kind of analysis can be applied to every point on a scatterplot. Those points near the best fitting straight line represent persons whose performance on the vertical variable was about what was expected from consideration of their performance on the horizontal variable. But every point far from the best fitting straight line is weird, the farther they are from the bfsl, the weirder. Some represent good weirdness – performance better than expected; others represent bad weirdness - performance worse than expected. This refined evaluation of performance can be carried out whenever you’ve developed an expectation for performance based on a regression analysis. Lecture 6 – Simple Regression - 20 5/8/2017 Examples of Simple Regression Analyses 1) Regression of College GPA onto ACTComp: Biderman, Nguyen, & Cunningham, Common method variance in NEO-FFI and IPIP personality measurement. SIOP, 2009. Regression [DataSet1] G:\MdbR\1BiasStudy\BiasStudyLWEQ1_090216.sav Descriptive Statistics Mean UGPA1 GPA obtained from records during sem of participation Std. Deviation N 3.1014 .76895 115 22.92 3.918 115 ACTComp Correlations UGPA1 GPA obtained from records during sem of participation Pearson Correlation UGPA1 GPA obtained from records during sem of participation 1.000 .526 .526 1.000 ACTComp Model Summary Model R R Square .526a 1 Adjusted R Square .277 Std. Error of the Estimate .270 .65684 a. Predictors: (Constant), ACTComp ANOVAb Model 1 Sum of Squares df Mean Square Regression 18.654 1 18.654 Residual 48.753 113 .431 Total 67.407 114 F Sig. .000a 43.236 a. Predictors: (Constant), ACTComp b. Dependent Variable: UGPA1 GPA obtained from records during sem of participation Coefficientsa Unstandardized Coefficients Model 1 B Std. Error (Constant) .735 .365 ACTComp .103 .016 Standardized Coefficients Beta t .526 Sig. 2.014 .046 6.575 .000 a. Dependent Variable: UGPA1 GPA obtained from records during sem of participation Predicted Y = 0.735 + 0.103*ACTComp So, for ACTComp = 20, Predicted Y = 0.735 + .103*20 = 2.80 For ACTComp = 30, Predicted Y = 0.735 + .103*30 = 3.82. Lecture 6 – Simple Regression - 21 ACTComp 5/8/2017 2) Regression of Core First Year I/O Grades onto Formula Score The dependent variable is a Z-score measure of performance in the 1st year of the I-O program. The independent variable is a formula score, prformula, used to guide our selection of I-O students. Mean prformula = 488.46; SD prformula = 54.60. Mean core1st = 0.002; SD core1st = 0.77. Regression Variables Entered/Removeda Model 1 Variables Entered Variables Removed prformulab Method . Enter a. Dependent Variable: core1st b. All requested variables entered. Model Summary Std. Error of the Model R R Square .543a 1 Adjusted R Square .295 Estimate .290 .65119 a. Predictors: (Constant), prformula ANOVAa Model 1 Sum of Squares df Mean Square F Regression 25.528 1 25.528 Residual 61.063 144 .424 Total 86.591 145 Sig. 60.202 .000b a. Dependent Variable: core1st b. Predictors: (Constant), prformula Coefficientsa Standardized Unstandardized Coefficients Model 1 B (Constant) prformula a. Dependent Variable: core1st Coefficients Std. Error Beta -3.752 .487 .008 .001 t .543 Sig. -7.707 .000 7.759 .000 .007685 So Predicted Core1st = --3.752 + .007685*prformula. For prformula = 400, Predicted Core1st = -3.752 + .007685*400 = -.68, about 1 SD below the mean For prformula = 500, Predicted Core1st = -3.752 + .007685*500 = + .09, about average. For prformula = 600, Predicted Core1st = -3.752 + .007685*600 = + .0.85, about 1 SD above the mean. Lecture 6 – Simple Regression - 22 5/8/2017