Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PSYCH 706: STATS II Class #6 AGENDA • Assignment #3 due 4/5 • Correlation (Review) • Simple Linear Regression • Review Exam #1 tests • SPSS Tutorial: Simple Linear Regression CORRELATION • Pearson’s correlation: Standardized measure of covariance • Bivariate • Partial • Assumptions: Linearity and Normality (outliers are a big deal here) • When assumptions not met for Pearson’s use other bivariate: • Spearman’s rho – rank-orders data • Kendall’s tau – use for small sample sizes, lots of tied ranks • Testing significance of correlations • Is one correlation different from zero? • Are correlations significantly different between two samples? http://www.quantpsy.org/corrtest/corrtest.htm • Are correlations significantly different within one sample? http://quantpsy.org/corrtest/corrtest2.htm CORRELATION EXAMPLE • Class #4 on Blackboard: Album Sales.spv • Do the following predictors share variance with the following outcome? • X1 = Advertising budget • X2 = Number of plays on the radio • X3 = Rated attractiveness of band members (0 = hideous potato heads, to 10 = gorgeous sex objects) • Y = Number of albums sold • Right now we are not going to worry about assumptions (linearity, etc.) SPSS BIVARIATE CORRELATIONS AnalyzeCorrelateBivariate • Move variables you want to correlate into Variables box • Click two-tailed and flag significant correlations • Click Pearson and/or Spearman’s and/or Kendall’s tau SPSS BIVARIATE CORRELATIONS SPSS BIVARIATE CORRELATIONS SPSS PARTIAL CORRELATIONS AnalyzeCorrelatePartial • Move variables you want to correlate (Album Sales and Radio Plays) into Variables box • Put Band Attractiveness in the Controlling For box • Click two-tailed and display actual significance level SPSS PARTIAL CORRELATIONS Correlation between Album Sales and Radio Plays decreased from .599 (bivariate correlation) to .580 when removing shared variance from Band Attractiveness, and the correlation is still significant Conclusion: Radio Plays shares significant unique variance with Album Sales not shared with Band Attractiveness QUESTIONS ABOUT CORRELATION? HOW IS REGRESSION RELATED TO CORRELATION? • Correlation indicates strength of two variables, X and Y. • In regression analyses, you can easily compare the degree to which multiple X variables predict Y within the same statistical model In this graph, since there is only one X variable, data in the scatterplot can be quantified either way: as a correlation (standardized) and as a regression equation (unstandardized) SIMPLE REGRESSION • Correlation is standardized, but regression is not • As a result, we include an intercept in the model • Equation for a straight line (“linear model”) • Outcome = Intercept + Predictor Variable(s) + Error •Y = b0 + bX +E Regression coefficients Slope EQUATION FOR A STRAIGHT LINE • b0 • Intercept (expected mean value of Y when X = 0) • Point at which the regression line crosses the Y-axis (ordinate) • b1 • Regression coefficient for the predictor • Gradient (slope) of the regression line • Direction/Strength of Relationship Yi b0 b1X i i INTERCEPTS AND SLOPES (AKA GRADIENTS) ASSUMPTIONS OF THE LINEAR MODEL • Linearity and Additivity • Errors (also called residuals) should be independent of each other AND normally distributed • Homoscedasticity • Predictors should be uncorrelated with “external variables” • All predictor variables must be quantitative/continuous or categorical • Outcome variable must be quantitative/continuous • No multicollinearity (no perfect correlation between predictor variables if there’s more than one) • BIGGEST CONCERN: Outliers!!!! METHOD OF LEAST SQUARES HOW GOOD IS OUR REGRESSION MODEL? • The regression line is only a model based on the data. • This model might not reflect reality. • We need some way of testing how well the model fits the observed data. • Enter SUMS OF SQUARES! SUMS OF SQUARES SS total = differences between each data point and mean of Y SUMS OF SQUARES SS total = differences between each data point and mean of Y SS model = differences between mean value of Y and slope SUMS OF SQUARES SS total = differences between each data point and mean of Y SS model = differences between mean value of Y and slope SS residual = differences between each data point and slope SUMS OF SQUARES SS total = differences between each data point and mean of Y SS model = differences between mean value of Y and slope SS residual = differences between each data point and slope R² = SS model / SS residual F = MS model / MS residual THIS LOOKS A LOT LIKE OUR ONEWAY ANOVA CALCULATIONS! One-Way ANOVA Review Three Group Means = pink, green, and blue lines Grand Mean = black line overall mean of all scores regardless of Group Individual scores = pink, green, and blue points SS Total Difference between each score and the grand mean SS Model Difference between each group mean and the grand mean SS Residual Difference between each score and its group mean REGRESSION ANOVA (F test) used to test the OVERALL regression model: • If all predictor variables together (X1, X2, X3) share significant variance with the outcome variable (Y) T-tests used to test SIMPLE effects: • Whether individual predictors (slope of X1, X2, or X3) significantly different from zero This is similar to ANOVA testing whether there is an OVERALL difference between groups and post-hoc comparisons testing SIMPLE effects between specific groups WHAT IS THE DIFFERENCE BETWEEN ONE-WAY ANOVA AND SIMPLE REGRESSION? • They are exactly the same calculations but presented in a different way • In both you have one dependent variable, Y • In ANOVA, your independent variable, X, is required to be categorical • In simple regression, your independent variable, X, can be categorical or continuous • Would it be helpful to see an example of how they are the same next week at the start of class? REGRESSION EXAMPLE • Class #4 on Blackboard: Album Sales.spv • How do the following predictors separately and together influence the following outcome? • X1 = Advertising budget • X2 = Number of plays on the radio • X3 = Rated attractiveness of band members (0 = hideous potato heads, to 10 = gorgeous sex objects) • Y = Number of albums sold REGRESSION ASSUMPTIONS, PART 1 • Linearity and Normality, Outliers • • • • • Skewness/Kurtosis z-score calculations Histograms Boxplots Transformations if needed Scatterplots between all variables • Multicollinearity • Bivariate correlations between predictors should be less than perfect (r < .9) • Non-Zero Variance • Predictors should all have some variance in them (not all the same score) • Type of Variables Allowed • Predictors must be scale/continuous or categorical • Outcome must be scale/continuous • Homoscedasticity • Variance around the regression line should be about the same for all values of the predictor variable (look at scatterplots) REGRESSION ASSUMPTIONS, PART 2 • Errors (also called residuals) should be independent of each other AND normally distributed • Predictors should be uncorrelated with “external variables” = DIFFICULT TO CHECK!!! CHECKING ASSUMPTIONS • You could try to figure out assumptions while you’re running the regression • I like to check assumptions as much as possible BEFORE running the regression so that I can more easily focus on what the actual results are telling me • You can also select extra options in the regression analysis to get a lot of info on assumptions THIS IS THE PLAN • We are going to check assumptions for all variables in our Album Sales SPSS file as if we were going to run a multiple regression with three predictors • However, we’re going to save that multiple regression for next week • Today we’ll run a simple linear regression first and interpret the output to get you used to looking at the results CREATE HISTOGRAMS X1 Y CREATE HISTOGRAMS X2 X3 X1 Y X2 X3 DIVIDE SKEWNESS & KURTOSIS BY THEIR STANDARD ERRORS CUTOFF: ANYTHING BEYOND Z=+/-1.96 (P<.05) IS PROBLEMATIC 4.96 X1 0.26 Y 0.35 X2 -7.48 X3 0.69 -1.99 -0.10 10.95 NEXT STEPS • X2 (No. of Plays on Radio) and Y (Album Sales) look normally distributed • Problems with Normality for X1 (Adverts) and X3 (Band Attractiveness) • Lets look at boxplots to view outliers/extreme scores • Lets transform the data and see if that fixes the skewness/outlier problem BOX PLOTS X1 X3 TRANSFORMED ADVERTS SO NO LONGER SKEWED X1 X1 BY TRANSFORMING ADVERTS, THE OUTLIERS ARE NO LONGER OUTLIERS! X1 X1 X1 TRANSFORMED BAND ATTRACTIVENESS IS STILL SKEWED W/ OUTLIERS X3 X3 LET’S TRANSFORM ATTRACTIVENESS SCORES INTO Z-SCORES • AnalyzeDescriptive StatisticsDescriptives • Put original Attractiveness variable in box • Check Save Standardized Values as Variables • New variable created: Zscore: Attractiveness of Band • Plot Histogram of z-scores • 4 Outliers > 3SD!!! OUTLIERS: A COUPLE OF OPTIONS • You have 200 data points which is a lot – you could calculate power with the 4 outliers removed and see how much it might affect your ability to find an effect… • You could remove them from analysis entirely • Documenting subject #, etc. and reason for removal • Save data file with new name (AlbumSales_Minus4outliersOnAttract.sav) • You could replace the 4 outliers with the next highest score on Attract, which is a ‘3’ or you could replace with the mean score (both reduce variability though) • Document this change • Saving file with new name (AlbumSales_4outliersOnAttractmodified.sav) OUTLIERS: ANOTHER OPTION • We could leave outliers in the data set and run a bunch of extra tests in our regression to see if any of these data points cause undue influence on our overall model • We’ll get to those tests during next class • Essentially you could run the regression with and without the outliers included in the model and see what happens • DataSelect CasesIf condition is satisfied: 3 > ZAttract > -3 • This means include all data points if the z-score value of Attractiveness is within 3 SD • Let’s say we went with deleting the 4 outliers • Now let’s look at other potential outliers using scatterplots • This will also show us the relationships between the variables (positive versus negative) • This will also let us check the homoscedasticity assumption: The variance around the regression line should be about the same for all values of the predictor variable NEXT STEPS SCATTERPLOTS: HOMOSCEDASTICITY CHECK Y Y X1 X2 SCATTERPLOTS: HOMOSCEDASTICITY CHECK Y Y X3 X2 Pearson’s (parametric) MULTICOLLINEARITY CHECK: BIVARIATE CORRELATIONS X1 X1 X2 X3 X2 X3 Y JUST CHECKING OUT RELATIONSHIP OF PREDICTORS TO OUTCOME VARIABLE Pearson’s (parametric) X1 X1 X2 X3 X2 X3 Y Kendall’s tau (non-parametric) MULTICOLLINEARITY CHECK: BIVARIATE CORRELATIONS X1 X1 X2 X3 X2 X3 Y JUST CHECKING OUT RELATIONSHIP OF PREDICTORS TO OUTCOME VARIABLE Kendall’s tau (non-parametric) X1 X1 X2 X3 X2 X3 Y NON-ZERO VARIANCE ASSUMPTION AnalyzeDescriptive StatisticsFrequencies Move variables in box, click Statistics, select Variance and Range X1 X2 X3 Y PREDICTOR VARIABLES MUST BE QUANTITATIVE/CONTINUOUS OR CATEGORICAL Look at Label in SPSS Variable View: Are X1, X2, and X3 considered Scale or Nominal variables? OUTCOME VARIABLE MUST BE QUANTITATIVE/CONTINUOUS Look at Label in SPSS Variable View: Is Y considered a Scale variable? LET’S REVIEW ASSUMPTIONS • Linearity and normality, outliers taken care of ~ X3 is kinda sketchy (skew is gone, but still problems with kurtosis) • Predictor variables continuous or categorical ~ X3 is kinda sketchy • Outcome variable continuous = YES! • Non-zero variances ~ X3 is kinda sketchy • No multicollinearity between predictors = YES! • Homoscedasticity ~ X3 is kinda sketchy • Residuals/errors normally distributed = WE WILL SEE! • So far, X1, X2 and Y look pretty great in terms of assumptions SIMPLE REGRESSION IN SPSS: ONE PREDICTOR VARIABLE • H0: Advertising (X1) does not share variance with Album Sales (Y) • H1: Advertising (X1) does share variance with Album Sales (Y) Album Sales Y = B0 + B1X1 + E Error Line Intercept Advertising Line Slope SIMPLE REGRESSION IN SPSS • • • • AnalyzeRegressionLinear Independent variable: Sqrt Adverts (X1) Dependent variable: Album Sales (Y) If you click on Statistics button, you can get: • Residuals: Durbin-Watson tests for correlations among errors/residuals – values less than 1 or greater than 3 are problematic [assumption for running regression is that your errors are uncorrelated] • Residuals: Case-wide Diagnostics: shows you >3 SD outliers in your data SPSS: SIMPLE REGRESSION OUTPUT Tells you what your independent variable (predictor) and your dependent variable (outcome) were SPSS: SIMPLE REGRESSION OUTPUT Standardized covariation between X1 and Y (r) Effect Size How much variance in Y is accounted for by our predictor X1 “Adverts accounts for 30% of the variance in album sales” Effect Size How much variance would be accounted for if the model had been derived from the population from which this sample was taken Amount of correlations among errors/residuals (Less than 1 or greater than 3 = problem) SPSS: SIMPLE REGRESSION OUTPUT Overall test of all predictors in the model Reject H0 Less than (p<.001) .1% chance that an F ratio this large would happen if the null hypothesis were true Tells us about individual contributions to the model Intercept Slope B0 = 96.459 B1 = 4.294 SPSS: SIMPLE REGRESSION Tells us direction of OUTPUT relationship (positive or negative correlation) When no money is spent on advertising, 96.459 thousand records will still be sold When 1 unit is spent on advertising, an extra 4.294 thousands of records (sqrt) will be sold Y = B0 + B1X1 + E Issue when transforming data: This is in square root units! Tells us about individual contributions of predictors to the model SPSS: SIMPLE REGRESSION OUTPUT Pearson’s r T-test: Is the beta value significantly different from zero? Advertising budget makes significant contribution to album sales. SPSS: SIMPLE REGRESSION OUTPUT Potential problem cases Residual z-scores >3 SD SPSS REGRESSION: PLOTS Gives you a histogram of the residuals (errors) to see if they are normally distributed (this is one of the assumptions you need to meet if interpreting linear regression results) SPSS REGRESSION: PLOTS Residuals for our model look normally distributed! We can check off that assumption as valid! SPSS REGRESSION: PLOTS Gives you a scatterplot of the standardized predicted values of Y and the standardized residuals (errors) of model (in this case X1) SPSS REGRESSION: PLOTS Residuals (errors) should not be correlated with predicted values in the model They should be randomly distributed PROBLEM • Since we transformed our Advertising variable, it is now in square root units • To more easily interpret results, we might want to standardize all variables (z-score them) before including them in regression so that they will all be on the same scale STANDARDIZING ALL VARIABLES Analyze ↓ Descriptives ↓ Save standardized values as variables REVISED SPSS OUTPUT You get the same results for Model Summary and ANOVA BUT, because we transformed all of our variables into standardized z-scores, Your unstandardized coefficients change to standardized ones where constant is zero and beta = correlation coefficient (r) INTERPRETATION WITH STANDARDIZED COEFFICIENTS Y = B0 + B1X1 Album Sales= 0 + .55(Advertising) As advertising $$ increases by 1 standard deviation, album sales increase by .55 standard deviations WRITING METHODS/RESULTS IN A PAPER • A simple linear regression was computed, with square-root transformed Advertisements as the independent variable, and Album Sales (in thousands of dollars) as the dependent variable. • Both variables were standardized before being entered into the model. • Results indicated that our overall regression model was significant, F(1,198)=86.45, p<.001. • Findings showed that as advertising increased by 1 standard deviation, album sales increased by .55 standard deviations, t=9.18, p<.001 (r=.55, large effect size). QUESTIONS ON SIMPLE REGRESSION?