Download Week 5: Simple And Multiple Regression

Document related concepts

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
PSYCH 706: STATS II
Class #6
AGENDA
• Assignment #3 due 4/5
• Correlation (Review)
• Simple Linear Regression
• Review Exam #1 tests
• SPSS Tutorial:
Simple Linear Regression
CORRELATION
• Pearson’s correlation: Standardized measure of covariance
• Bivariate
• Partial
• Assumptions: Linearity and Normality (outliers are a big deal here)
• When assumptions not met for Pearson’s use other bivariate:
• Spearman’s rho – rank-orders data
• Kendall’s tau – use for small sample sizes, lots of tied ranks
• Testing significance of correlations
• Is one correlation different from zero?
• Are correlations significantly different between two samples?
http://www.quantpsy.org/corrtest/corrtest.htm
• Are correlations significantly different within one sample?
http://quantpsy.org/corrtest/corrtest2.htm
CORRELATION EXAMPLE
• Class #4 on Blackboard: Album Sales.spv
• Do the following predictors share variance with
the following outcome?
• X1 = Advertising budget
• X2 = Number of plays on the radio
• X3 = Rated attractiveness of band members (0 =
hideous potato heads, to 10 = gorgeous sex
objects)
• Y = Number of albums sold
• Right now we are not going to worry about
assumptions (linearity, etc.)
SPSS BIVARIATE CORRELATIONS
AnalyzeCorrelateBivariate
• Move variables you want to
correlate into Variables box
• Click two-tailed and flag
significant correlations
• Click Pearson and/or Spearman’s
and/or Kendall’s tau
SPSS BIVARIATE CORRELATIONS
SPSS BIVARIATE CORRELATIONS
SPSS PARTIAL CORRELATIONS
AnalyzeCorrelatePartial
• Move variables you want to
correlate (Album Sales and Radio
Plays) into Variables box
• Put Band Attractiveness in the
Controlling For box
• Click two-tailed and display
actual significance level
SPSS PARTIAL CORRELATIONS
Correlation between Album Sales and Radio Plays decreased
from .599 (bivariate correlation) to .580 when removing shared
variance from Band Attractiveness, and the correlation is still
significant
Conclusion: Radio Plays shares significant unique variance with
Album Sales not shared with Band Attractiveness
QUESTIONS ABOUT
CORRELATION?
HOW IS REGRESSION RELATED TO
CORRELATION?
• Correlation indicates strength of two variables, X and Y.
• In regression analyses, you can easily compare the degree to which multiple X
variables predict Y within the same statistical model
In this graph, since there is only one X
variable, data in the scatterplot can be
quantified either way: as a correlation
(standardized) and as a regression
equation (unstandardized)
SIMPLE REGRESSION
• Correlation is standardized, but regression is not
• As a result, we include an intercept in the model
• Equation for a straight line (“linear model”)
• Outcome = Intercept + Predictor Variable(s) + Error
•Y
= b0
+ bX
+E
Regression
coefficients
Slope
EQUATION FOR A STRAIGHT LINE
• b0
• Intercept (expected mean value of Y when X = 0)
• Point at which the regression line crosses the Y-axis (ordinate)
• b1
• Regression coefficient for the predictor
• Gradient (slope) of the regression line
• Direction/Strength of Relationship
Yi  b0  b1X i   i
INTERCEPTS AND
SLOPES (AKA GRADIENTS)
ASSUMPTIONS OF THE LINEAR
MODEL
• Linearity and Additivity
• Errors (also called residuals) should be independent of each other AND
normally distributed
• Homoscedasticity
• Predictors should be uncorrelated with “external variables”
• All predictor variables must be quantitative/continuous or categorical
• Outcome variable must be quantitative/continuous
• No multicollinearity (no perfect correlation between predictor variables if
there’s more than one)
• BIGGEST CONCERN: Outliers!!!!
METHOD OF LEAST SQUARES
HOW GOOD IS OUR REGRESSION
MODEL?
• The regression line is only a model based
on the data.
• This model might not reflect reality.
• We need some way of testing how well
the model fits the observed data.
• Enter SUMS OF SQUARES!
SUMS OF SQUARES
SS total = differences between
each data point and mean of Y
SUMS OF SQUARES
SS total = differences between
each data point and mean of Y
SS model = differences
between mean value of Y and
slope
SUMS OF SQUARES
SS total = differences between
each data point and mean of Y
SS model = differences
between mean value of Y and
slope
SS residual = differences
between each data point and
slope
SUMS OF SQUARES
SS total = differences between
each data point and mean of Y
SS model = differences
between mean value of Y and
slope
SS residual = differences
between each data point and
slope
R² = SS model / SS residual
F = MS model / MS residual
THIS LOOKS A LOT
LIKE OUR ONEWAY ANOVA
CALCULATIONS!
One-Way ANOVA
Review
Three Group Means = pink,
green, and blue lines
Grand Mean = black line
overall mean of all scores
regardless of Group
Individual scores = pink,
green, and blue points
SS Total
Difference
between each
score and the
grand mean
SS Model
Difference
between each
group mean and
the grand mean
SS Residual
Difference
between each
score and its
group mean
REGRESSION
ANOVA (F test) used to test the OVERALL regression model:
• If all predictor variables together (X1, X2, X3) share significant variance with
the outcome variable (Y)
T-tests used to test SIMPLE effects:
• Whether individual predictors (slope of X1, X2, or X3) significantly different
from zero
This is similar to ANOVA testing whether there is an OVERALL difference
between groups and post-hoc comparisons testing SIMPLE effects between
specific groups
WHAT IS THE DIFFERENCE BETWEEN
ONE-WAY ANOVA
AND
SIMPLE REGRESSION?
• They are exactly the same calculations but presented in a different way
• In both you have one dependent variable, Y
• In ANOVA, your independent variable, X, is required to be categorical
• In simple regression, your independent variable, X, can be categorical or continuous
• Would it be helpful to see an example of how they are the same next week at the start of class?
REGRESSION EXAMPLE
• Class #4 on Blackboard: Album Sales.spv
• How do the following predictors separately and
together influence the following outcome?
• X1 = Advertising budget
• X2 = Number of plays on the radio
• X3 = Rated attractiveness of band members (0 =
hideous potato heads, to 10 = gorgeous sex
objects)
• Y = Number of albums sold
REGRESSION ASSUMPTIONS, PART 1
• Linearity and Normality, Outliers
•
•
•
•
•
Skewness/Kurtosis z-score calculations
Histograms
Boxplots
Transformations if needed
Scatterplots between all variables
• Multicollinearity
• Bivariate correlations between predictors should be less than perfect (r < .9)
• Non-Zero Variance
• Predictors should all have some variance in them (not all the same score)
• Type of Variables Allowed
• Predictors must be scale/continuous or categorical
• Outcome must be scale/continuous
• Homoscedasticity
• Variance around the regression line should be about the same for all values of the predictor
variable (look at scatterplots)
REGRESSION ASSUMPTIONS, PART 2
• Errors (also called residuals)
should be independent of
each other AND normally
distributed
• Predictors should be
uncorrelated with “external
variables” = DIFFICULT TO
CHECK!!!
CHECKING ASSUMPTIONS
• You could try to figure out assumptions
while you’re running the regression
• I like to check assumptions as much as
possible BEFORE running the regression so
that I can more easily focus on what the
actual results are telling me
• You can also select extra options in the
regression analysis to get a lot of info on
assumptions
THIS IS THE PLAN
• We are going to check assumptions for
all variables in our Album Sales SPSS file
as if we were going to run a multiple
regression with three predictors
• However, we’re going to save that
multiple regression for next week
• Today we’ll run a simple linear regression
first and interpret the output to get you
used to looking at the results
CREATE HISTOGRAMS
X1
Y
CREATE HISTOGRAMS
X2
X3
X1
Y
X2
X3
DIVIDE SKEWNESS &
KURTOSIS BY THEIR
STANDARD ERRORS
CUTOFF: ANYTHING
BEYOND Z=+/-1.96
(P<.05) IS PROBLEMATIC
4.96
X1
0.26
Y
0.35
X2
-7.48
X3
0.69
-1.99
-0.10
10.95
NEXT STEPS
• X2 (No. of Plays on Radio) and Y (Album Sales) look normally distributed
• Problems with Normality for X1 (Adverts) and X3 (Band Attractiveness)
• Lets look at boxplots to view outliers/extreme scores
• Lets transform the data and see if that fixes the skewness/outlier problem
BOX PLOTS
X1
X3
TRANSFORMED ADVERTS SO NO
LONGER SKEWED
X1
X1
BY TRANSFORMING ADVERTS, THE
OUTLIERS ARE NO LONGER OUTLIERS!
X1
X1
X1
TRANSFORMED BAND ATTRACTIVENESS
IS STILL SKEWED W/ OUTLIERS
X3
X3
LET’S TRANSFORM ATTRACTIVENESS
SCORES INTO Z-SCORES
• AnalyzeDescriptive
StatisticsDescriptives
• Put original Attractiveness
variable in box
• Check Save Standardized
Values as Variables
• New variable created:
Zscore: Attractiveness of
Band
• Plot Histogram of z-scores
• 4 Outliers > 3SD!!!
OUTLIERS: A COUPLE
OF OPTIONS
• You have 200 data points which is a lot – you could
calculate power with the 4 outliers removed and see
how much it might affect your ability to find an
effect…
• You could remove them from analysis entirely
• Documenting subject #, etc. and reason for removal
• Save data file with new name
(AlbumSales_Minus4outliersOnAttract.sav)
• You could replace the 4 outliers with the next highest
score on Attract, which is a ‘3’ or you could replace
with the mean score (both reduce variability though)
• Document this change
• Saving file with new name
(AlbumSales_4outliersOnAttractmodified.sav)
OUTLIERS: ANOTHER OPTION
• We could leave outliers in the data set and run a bunch of extra tests in our
regression to see if any of these data points cause undue influence on our overall
model
• We’ll get to those tests during next class
• Essentially you could run the regression with and without the outliers included in the
model and see what happens
• DataSelect CasesIf condition is satisfied: 3 > ZAttract > -3
• This means include all data points if the z-score value of Attractiveness is within 3 SD
• Let’s say we went with deleting the
4 outliers
• Now let’s look at other potential
outliers using scatterplots
• This will also show us the
relationships between the variables
(positive versus negative)
• This will also let us check the
homoscedasticity assumption:
The variance around the regression
line should be about the same for all
values of the predictor variable
NEXT STEPS
SCATTERPLOTS:
HOMOSCEDASTICITY CHECK
Y
Y
X1
X2
SCATTERPLOTS:
HOMOSCEDASTICITY CHECK
Y
Y
X3
X2
Pearson’s (parametric)
MULTICOLLINEARITY CHECK:
BIVARIATE CORRELATIONS
X1
X1
X2
X3
X2
X3
Y
JUST CHECKING OUT RELATIONSHIP OF
PREDICTORS TO OUTCOME VARIABLE
Pearson’s (parametric)
X1
X1
X2
X3
X2
X3
Y
Kendall’s tau (non-parametric)
MULTICOLLINEARITY CHECK:
BIVARIATE CORRELATIONS
X1
X1
X2
X3
X2
X3
Y
JUST CHECKING OUT RELATIONSHIP OF
PREDICTORS TO OUTCOME VARIABLE
Kendall’s tau (non-parametric)
X1
X1
X2
X3
X2
X3
Y
NON-ZERO VARIANCE
ASSUMPTION
AnalyzeDescriptive StatisticsFrequencies
Move variables in box, click Statistics, select Variance and Range
X1
X2
X3
Y
PREDICTOR VARIABLES MUST BE
QUANTITATIVE/CONTINUOUS OR
CATEGORICAL
Look at Label in SPSS Variable View: Are X1, X2, and
X3 considered Scale or Nominal variables?
OUTCOME VARIABLE MUST BE
QUANTITATIVE/CONTINUOUS
Look at Label in SPSS Variable View: Is Y considered a Scale variable?
LET’S REVIEW ASSUMPTIONS
• Linearity and normality, outliers taken care of ~ X3 is kinda sketchy
(skew is gone, but still problems with kurtosis)
• Predictor variables continuous or categorical ~ X3 is kinda sketchy
• Outcome variable continuous = YES!
• Non-zero variances ~ X3 is kinda sketchy
• No multicollinearity between predictors = YES!
• Homoscedasticity ~ X3 is kinda sketchy
• Residuals/errors normally distributed = WE WILL SEE!
• So far, X1, X2 and Y look pretty great in terms of assumptions
SIMPLE REGRESSION IN SPSS:
ONE PREDICTOR VARIABLE
• H0: Advertising (X1) does not share variance with Album Sales (Y)
• H1: Advertising (X1) does share variance with Album Sales (Y)
Album
Sales
Y = B0 + B1X1 + E
Error
Line
Intercept
Advertising
Line
Slope
SIMPLE REGRESSION IN SPSS
•
•
•
•
AnalyzeRegressionLinear
Independent variable: Sqrt Adverts (X1)
Dependent variable: Album Sales (Y)
If you click on Statistics button, you can
get:
• Residuals: Durbin-Watson tests for
correlations among errors/residuals –
values less than 1 or greater than 3 are
problematic [assumption for running
regression is that your errors are
uncorrelated]
• Residuals: Case-wide Diagnostics:
shows you >3 SD outliers in your data
SPSS: SIMPLE REGRESSION OUTPUT
Tells you what your
independent variable
(predictor) and your
dependent variable
(outcome) were
SPSS: SIMPLE REGRESSION OUTPUT
Standardized
covariation
between X1
and Y (r)
Effect Size
How much
variance in Y is
accounted for
by our predictor
X1
“Adverts
accounts for 30%
of the variance
in album sales”
Effect Size
How much
variance would
be accounted
for if the model
had been
derived from the
population from
which this
sample was
taken
Amount of
correlations
among
errors/residuals
(Less than 1 or
greater than 3 =
problem)
SPSS: SIMPLE REGRESSION OUTPUT
Overall test of all
predictors in the model
Reject H0
Less than (p<.001) .1%
chance that an F ratio
this large would happen
if the null hypothesis
were true
Tells us about individual
contributions to the
model
Intercept
Slope
B0 = 96.459
B1 = 4.294
SPSS: SIMPLE REGRESSION
Tells us direction of
OUTPUT
relationship (positive or
negative correlation)
When no money is spent on advertising,
96.459 thousand records will still be sold
When 1 unit is spent on advertising, an extra
4.294 thousands of records (sqrt) will be sold
Y = B0 + B1X1 + E
Issue when
transforming
data: This is in
square root units!
Tells us about individual
contributions of
predictors to the model
SPSS: SIMPLE
REGRESSION OUTPUT
Pearson’s r
T-test: Is the beta value
significantly different from zero?
Advertising budget makes significant
contribution to album sales.
SPSS: SIMPLE REGRESSION OUTPUT
Potential problem cases
Residual z-scores >3 SD
SPSS REGRESSION: PLOTS
Gives you a histogram of
the residuals (errors) to
see if they are normally
distributed (this is one of
the assumptions you
need to meet if
interpreting linear
regression results)
SPSS REGRESSION: PLOTS
Residuals for our model
look normally distributed!
We can check off that
assumption as valid!
SPSS REGRESSION: PLOTS
Gives you a scatterplot
of the standardized
predicted values of Y
and the standardized
residuals (errors) of
model (in this case X1)
SPSS REGRESSION: PLOTS
Residuals (errors) should
not be correlated with
predicted values in the
model
They should be randomly
distributed
PROBLEM
• Since we transformed our Advertising variable,
it is now in square root units
• To more easily interpret results, we might want
to standardize all variables (z-score them)
before including them in regression so that
they will all be on the same scale
STANDARDIZING ALL VARIABLES
Analyze
↓
Descriptives
↓
Save standardized
values as variables
REVISED SPSS OUTPUT
You get the same results
for Model Summary and
ANOVA
BUT, because we
transformed all of our
variables into
standardized z-scores,
Your unstandardized
coefficients change to
standardized ones where
constant is zero and beta
= correlation coefficient
(r)
INTERPRETATION WITH
STANDARDIZED COEFFICIENTS
Y = B0 + B1X1
Album Sales= 0 + .55(Advertising)
As advertising $$ increases by 1 standard deviation, album
sales increase by .55 standard deviations
WRITING METHODS/RESULTS
IN A PAPER
• A simple linear regression was computed, with square-root transformed
Advertisements as the independent variable, and Album Sales (in thousands
of dollars) as the dependent variable.
• Both variables were standardized before being entered into the model.
• Results indicated that our overall regression model was significant,
F(1,198)=86.45, p<.001.
• Findings showed that as advertising increased by 1 standard deviation,
album sales increased by .55 standard deviations, t=9.18, p<.001 (r=.55, large
effect size).
QUESTIONS ON SIMPLE
REGRESSION?