Download Calculating R 2

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
ORDINARY LEAST SQUARES REGRESSION
NICHOLAS CHARRON
ASSOCIATE PROF.
DEPT. OF POLITICAL SCIENCE
Section Outline
Day 1: overview of OLS regression (Wooldridge, chap. 1-4)
Day 2 (25 jan): assumptions of OLS regression (Wooldridge, chap. 5-7)
Day 3 (27 jan): alternative estimation, interaction models (Wooldridge,
chap. 8-9 + Brambor et al article)
Next topic: Limited dependent variables
Section Goals
• To understand the basic ideas and formulas behind linear regression
• Calculate (by hand) simple bivariate regression coefficient
• Working with ’real’ data, apply knowledge, perform regression & interpret results, compare effects of variables
in multiple regression
• Understant the basic asumptions of OLS estimation
• How to check for violations, and what to do (more in later lectures also)
• What to do when X and Y relationship are not directly linear – interaction effect, variable transformation
(logged variables)
• Apply knowledge in STATA!
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
Introductions!
-Name,
-Department
-year as PhD student,
-where you are from (country),
-how much statisitica have you had?
-What is your research topic?
Linear Regression: brief history
A bit
Sir Francis Galton – interested in heredity of
plants,
’regression toward mediocraty’, meaning in his
time the median (now known more as regression
toward the mean..)
Emphais on ’on average’ what can we expect?
was not a mathmatician however..
Karl Pearson (Galton’s biographer) took Galton’s
work and developed several statistical measures
Together with previous ’least squares’ method
(Gauss 1812), regression analysis was born
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
Simple statistical methods: cross tabulations & correlations
Used widely, especially in survey research, probably many of you are familiar with this…
At least one Categorical variable – nominal or ordinal
If we want to know how two variables related in terms of: strength, direction & effect, we
can use various tests
Nominal level – only strength (Cramer’s V)
Ordinal level – strength and direction (tau-b & tau-c, gamma)
Interval (ratio) level – strength, direction (and effect) (Pearson’s r, regression)
WHY DO WE USE REGRESSION?
• To test hypotheses about causal relationships for a continuous/ordinal outcome
• To make predictions
• The preferred method when your statistical model has more than two explanitory
variables and you want to elaborate a causal relationship
• However, always remember that correlation is not causation!
• We test hypotheses about causal relationships but the regression doesn’t express
causal direction (why theory is important!)
Key advantages of linear regression
• Simplicity: a linear relationship is the most simple, non-trivial relationship. Plus,
most people can even do the math by hand (as opposed to other estimation
techniques..)
• Flexability: even if the relationship between X and Y is not really linear, the
variables can be transformed (more later)
• Interprebility: we get strength, direction & effect in a simple easy to understand
package
Some essential terminology
Regression: the mean of the outcome variable (Y) as a function of one or more independent
variables (X)
𝜇 𝑌|𝑋
Regression Model: explaining Y’s in the ’real world’ is very complicated. A model is our
APPROXIMATION (simplification) of our relationship
Simple (bivariate) regression model
𝑌 = 𝛽0 +𝛽1 X
Y: the dependent Variable
X: the independent variable
𝛽0 : the intercept or ’constant’ (in other words??), also notated as 𝛼 (alpha)
𝛽1 : the slope
𝛽’s are called ’coefficients’
More terminology
• Dependent variable (Y): aka explained variable, response variable
• Ind. Variable (X): aka. Explanitory variable, control variable
Two types of models broadly speaking:
1. Deterministic model:
– Y=a+bx (the equation of a straight line)
2. Probabilistic model (e.g. what we are most interested in..):
– Y=a+bx+e –
A deterministic model: visual
120
Y: Total expenses
100
80
60
40
20
0
0
1
2
3
4
5
6
7
X: # beers
12
A deterministic model: simple ex.
Calculation of the slope:
Person
# beers
(X)
Total
expenses
(Y)
Stefan
0
20
Martin
2
50
Thomas
4
80
Rasmus
5
95
Christian
6
110
β
110  20
6 0
 15
Calculation of intercept:
α = 50 – (2  15) = 20, or
α = 80 – (4  15) = 20
The equation for the
relationship:
Y = α + βX = 20 + 15X
13
The probabilistic model: with ‘error’
14
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
Even more terminology
• and, most often, we’re dealing with ’probabilistic’ models:
Fitted values (’Y hat’): for any observation ’i’ our model gives an expected mean:
𝒀 = 𝒇𝒊𝒕𝒊 = 𝜷𝒐 + 𝜷𝟏 𝑿
Residual: is the error (how much our model is ’off’) for obs ’i’
𝒓𝒆𝒔𝒊 = 𝒀𝒊 − 𝒇𝒊𝒕𝒊 = 𝒀𝒊 − 𝒀
Where 𝑟𝑒𝑠𝑖 is normally written as 𝑒𝑖
Least Squares: our method to find the estimates that MINIMIZE the SUM of SQUARE
RESIDUALS (SSE)
𝑛
𝑛
(𝑦𝑖 − 𝛽𝑜 + 𝛽1 𝑥1 )2 =
𝑖=1
𝑛
(𝑦𝑖 − 𝑦)2 =
𝑖=1
𝑒𝑖 ²
𝑖=1
OLS- Regression
X(age)
Y(income)
Pelle
20
21
Lisa
19
22,4
Kalle
54
47,3
Ester
42
17
Ernst
39
35
Stian
67
23,8
Lise
40
39,3
Relationship between income and age
Each dot represents a
respondent’s income(Y)
at age(X)
OLS- Regression
Relationship between income and age – shows strength,
direction & effect, but causality?
OLS- Regression
Relationship between income and age
How to estimate the coefficients?
• We use the ’least squares method’ of course!
To calculate the slope coefficient (𝜷) of X:
𝜷𝟏 =
𝒏
𝒊=𝟏(𝒙𝒊 − 𝑿)(𝒚𝒊 −
𝒏
𝟐
𝒊=𝟏(𝒙𝒊 − 𝑿)
𝒀)
the slope coefficient is the covariance between X and Y over the variance of X, or the rate of
change of Y relative to change in X
And to calculate the constant:
𝜷𝟎 = 𝒀 − 𝜷𝟏 𝑿
Simply speaking, the constant is just the mean of Y – the mean of X times 𝛽
In class excercise –
Calculation of beta and alpha value in
OLS by hand!
_
b
X
Y
Pelle
2
4
Lisa
1
5
Kalle
5
2
Ester
4
1
Ernst
3
3
_
 ( X  X )(Y  Y )
(X  X )
_
a  Y  bX
2
20
Calculation of b-value in OLS
X
Y
X
Y
X-X
Y-Y
(X-X)2
(X-X)*(Y-Y)
Pelle
2
4
3
3
-1
1
1
-1
Lisa
1
5
3
3
-2
2
4
-4
Kalle
5
2
3
3
2
-1
4
-2
Ester
4
1
3
3
1
-2
1
-2
Ernst
3
3
3
3
0
0
0
0
Summa:
15
15
15
15
0
0
10
-9
_
b
_
 ( X  X )(Y  Y )
(X  X )
_
2
21
Calculation of b-value in OLS
X
Y
X
Y
X-X
Y-Y
(X-X)2
(X-X)*(Y-Y)
Pelle
2
4
3
3
-1
1
1
-1
Lisa
1
5
3
3
-2
2
4
-4
Kalle
5
2
3
3
2
-1
4
-2
Ester
4
1
3
3
1
-2
1
-2
Ernst
3
3
3
3
0
0
0
0
Summa:
15
15
15
15
0
0
10
-9
_
_
 ( X  X )(Y  Y )
b
(X  X )
_
b
2
a  Y  bX
9
10
b
9
 0.90
10
a  3  0.90 * 3  5.7
22
.8
.6
.4
.2
• Now for every value of X,
we have an expected
value of Y
mean of X = 4.073
mean of Y = 0.176
0
corruption
• A cool property of least
squares estimation is that
the regression line will
always cross the mean of
X and mean of Y
2
from Charron et al 2017
3
4
pub. sec. meritocracy
5
6
𝜷 compared to Pearson’s r
• The effect in a linear regression = 𝜷
_
b
_
 ( Xi  X )(Yi  Y )
 ( Xi  X )
_
2
• Correlation– Pearson’s r
• Same numerator, but takes the variance of Y also in the denomonater into
account
 Xi  X Yi  Y 
r
2
2
 Xi  X   Yi  Y 
Q: When will these two be equal?



Interpretation of Pearson’s r
Source: wikipedia
25
Correlation and regression, a comparison
• Pearson’s r is standardized and varies between -1 (perfect neg.
relationship) and 1 (perfect pos. relationship) 0 = no
relationship, n is not taken into account
𝑺𝒀
𝒓𝒙𝒚
𝑺𝑿
• The regression coefficient (𝜷) has no given minimum or
maximum values and the interpretation of the coefficient
depends on the range of the scale
• Unlike the correlation r, the regression is used to predict values
of one variable, given values of another variable
Objectives and goals of linear regression
• We want to know the probability distribution of Y as a function of X (or
several X’s)
• Y is a strait line (e.g. linear) function of X, plus some random noise (error
term)
• Goal is to find the ’best line’ that fits explains variation of Y with X
• Important! The marginal effect of X on Y is assumed to be CONSTANT
across all values of X. What does this mean??
Applied bivariate example
• Data: QoG Basic, two variables from World Value Survey
• Dependent variable (Y): Life happiness (1-4, lower=better)
• Independent variable (X): State of health (1-5, lower=better)
• Units of analysis: countries (aggregated from survey data ), 20 randomly selected
Our Model:
Y(Happiness) = α + β1(health) + e
• H1: the healthier a country feels, the happier it is on average
• Let’s Estimete of α and β1 based on our data!
• Open file in STATA from GUL: health_happy ex.dta
***To do what I’ve done in the slides, see the do.file
28
Some basic statistics
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
1. Summary stats
sum health happiness
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+--------------------------------------------------------health |
20
2.1925
.241004
1.84
2.65
happiness |
20
1.948
.2783901
1.59
2.58
2. Pairwise correlations (Pearson’s r)
pwcorr health happiness
|
health happin~s
-------------+-----------------health |
1.0000
happiness |
0.6752
1.0000
2.2
2
1.8
1.6
happy
2.4
2.6
3. Scatterplot w/line – in STATA:
twoway (scatter happiness health) (lfit happiness health)
1.8
2
2.2
health
2.4
2.6
30
Now for the regression
• Observations
• F-test of model
. reg happiness health
Source
SS
df
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537
Coefficient (b)
Std. Err.
.2009521
.4429099
Constant (a)
t
3.87
0.55
standard error
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.001
0.590
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
• R sq
• Mean sq. error
[95% Conf. Interval]
.3558981
-.6873656
t-test (of sig.)
1.200267
1.173673
95% interval of confidence
Now for the regression
. reg happiness health
Source
SS
df
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537
Std. Err.
.2009521
.4429099

Xi  X Yi  Y 

b
 Xi  X 
2
t
3.87
0.55
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.001
0.590
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
[95% Conf. Interval]
.3558981
-.6873656
1.200267
1.173673
a  Y  bX
Yˆi  a  bX i
32
2.2
2
1.8
1.6
happiness
2.4
2.6
Some interpretation
1.8
2
2.2
health
happiness
2.4
2.6
What
is the predicted
mean of happines
for a country with a
mean in health of
2.3?
Fitted values
33
2.2
2
1.8
1.6
happiness
2.4
2.6
Interpretation
Y=.243 + .778(2.3) = 2.02
1.8
2
2.2
health
happiness
2.4
2.6
What is the
predicted mean of
happines for a
country with a mean
in health of 2.3?
Fitted values
Answer: 2.02..
34
Ok, now what??
• Ok great, but the calculation of beta and alpha are just simple math…
• Now, we want to see how much we can INFER from this relationship – as we do not
have ALL the observations (e.g. a ’universe’) with perfect data, we are only making an
inference
• A key to doing this is to evaluate how ”off” our model predictions are relative to actual
observations in Y
• We can do this both for our model on whole and for individual coefficients (both betas
and alpha). We’ll start with calculating the SUM of SQUARES
• Two questions:
a.
how ’sure’ are we of our estimates, e.g. significance, or probability that the relationship
we see is not just ’white noise’
b.
Is this (OLS) actually the most valid estimation method?
Assumptions of OLS (more on this next week!)
OLS is fantastic if our data meets several assumptions, and before we make
any inferences, we should always check:
In order to make inference:
1.
The linear model is suitable
2.
The conditional standard deviation is the same for all levels of X
(homoscedasticity)
3.
Error terms are normally distributed for all levels of X
4.
The sample is selected randomly
5.
There are no severe outliers
6.
There is no autocorrelation
7.
No multicollinearity
8.
Our sample is representative of the population (all estimation)
36
Regression Inference
• In order to test several of our assumptions, we need to observe the
residuals in our estimation
• These allow us to both check OLS assumptions AND provide
significance testing
• Plotting the residuals against the explanatory (X) variable is helpful
in checking these conditions because a residual plot magnifies
patterns. This you should ALWAYS look at
Least squares: Sum of the squared
error term – getting our error terms
• A measure of how far the line is from the observations, is the
sum of all errors: The smaller it is, the closer is the line to the
observations (and thus, the better our model.)
• In order to avoid positive and negative errors to cancel out in the
calculation, we square them:
term
•The
Theerror
sum
of all for observation i: e i 2  Yi - Ŷi 2

The Residual sum
of squares (RSS)
n
n

RSS   ei   Yi - Ŷi
i 1
2
i 1


2
38
90
Residual Sum of squares – a simple visual
80
'actual Yi'
residuals
50
60
70
'best fit' line
0

ei  Yi - Ŷi
2
5
10
Unemply1012
15

2
Back to our exercise:
. reg happiness health
Source
SS
df
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537
RSS  

Y  Yˆ
Std. Err.
.2009521
.4429099
t
3.87
0.55
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
[95% Conf. Interval]
0.001
0.590
.3558981
-.6873656

1.200267
1.173673
 Y  Yˆ 
2
2
ˆ 

TSS   Y  Y

2
RSS

n2
n2
The typical deviation around
the bar (ie the conditional
standard deviation) mean
40
square
Getting our standard errors for beta
• The standard error (s.e.) tells us something about the average
deviation between our observed values and our predicted
values.
• The s.e. for a slope coefficient is then calculated as the square
root of the RSS divided by the number of observations - the # of
parameters:
Se 
RSS
nK
=
Se 
 (Y  Yˆ )
2
nK
Where RSS= Residual Sum of Square, aka Sum of Squared Errors
n=number of cases
K=number of parameters (in bivariate regression - intercept and b-coefficient = 2
Getting our standard errors for b
• The precision of b depends (among others) of the variation
around the slope – i.e. how large the spread is around the line?
• This spread, we have assumed constant for all levels of X, but
how is it calculated?
• See. Earlier: The sum of squared deviations of the line is given
by:
n
2

RSS   Yi - Ŷi
i 1

• As we just saw, the typical deviation around the bar (ie the
conditional standard deviation) is then given by:
ˆ 
RSS

n2

Y  Yˆ
n2

2
42
Standard errors for b
•
The standard error is then defined as the conditional standard
error divided by the variation in X
ˆ b 
•
ˆ
 X  X 
2

ˆ
 X  X   n  1
2

ˆ
sX
n 1
n  1
Factors affecting the standard error of Beta:
1.
2.
3.
The spread on the line σ - the smaller σ, the smaller the standard
error
Sample size: n - the larger n, the smaller the standard error
The variation of X: - the greater the variation, the smaller the
standard error
2
 X  X 
43
Standard errors for b: the ‘ideal’
44
Back to our example (see excel sheet for ‘by hand
calculations..)
. reg happiness health
Source
SS
df
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537
Standard error of a
Std. Err.
.2009521
.4429099
t
3.87
0.55
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.001
0.590
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
[95% Conf. Interval]
.3558981
-.6873656
1.200267
1.173673
Standard error of b
45
Standard errors for b
• The standard errors can then be used for hypothesis testing. By dividing our slopecoefficients by their s.e.’s we receive their t-values.
• The t-value can then be used as a measure for statistical significance and allows us to
calculate a p-value (what is this??)
• Old school: one can consult a t-table where the degrees of freedom (the number of
observations - 1) and your level of security (p<.10, .05 or .01 for ex.) decides whether
your coefficient is significantly different from zero or not (t-table – can be found as
appendices in books in statistics, like Woodbridge)
• New school: rely on statistical programs (like STATA, SPSS)
• H0: β1= 0
• H1: β1≠ 0
47
Hypothesis testing & confidence interval for β
Hypothesis test of independence between the variables
(Ho: β = 0):
b0
b
t 

ˆ b
ˆ b
Calculation of t-value:
b
t
SE
95 pct. Confidence intervals:
β  b  t(σ̂ b )
48
confidence intervals for β
H0 - b = 0 ± 1,96
H1 – b ≠ 0 ± 1,96 on a 95% confidence level
90% confidence interval. t=1.645
95% confidence interval. t=1.96
99% confidence interval. t=2,576
Forming a 95% confidence interval for a
single slope-coefficient:
bx ± t(SEbx)
Back to our example..
• Observations
• F-test of model
. reg happiness health
Source
SS
df
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537
Std. Err.
.2009521
.4429099
standard error
t
3.87
0.55
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.001
0.590
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
• R sq
• Mean sq. error
[95% Conf. Interval]
.3558981
-.6873656
t-test (of sig.) p-value from t-test
1.200267
1.173673
95% upper/lower limit
TAKING A CLOSER LOOK ’UNDER THE HOOD
OF THE CAR’…
OPEN EXCEL FILE: REGRESSION HAPPINES_VS_HEALTH
Basic OLS model diagnostics:
1. R²
2. F-Test
3. MSE
1. R2 : EXPLAINED VARIANCE
• A.k.a. “coefficient of determination”
• R2 ranges from 0 and1 (0 ≤ r2 ≤ 1)
• R2 defined how close to the estimated regression line the observed values (the
dots)
• R2 is a direct measure of linearity but is interpreted as explained variance.
When R2 =1, we’ve explained all variation in Y, when R2 =0, we’ve explained
nothing… good way to compare models!
• In many (social science) research models building on survey data (individual
level), R2 is often a low value (rarely exceeds .40)
• Is calculated using three sum of squares formulas
Calculating R2 - FIRST COMPONENT
TSS   Yi  Y 
2
sum of squares (TSS) - as we’ve use in other equations,
this is the sum value on the dependent variable for each
observation minus the mean value of the dependent variable.
. Total
e.g., This is the total variation in the dependent variable, 𝜎 2
Calculating R2 - SECOND COMPONENT
ESS  

Yˆi  Y

2
Explained sum of squares (ESS) – the sum of the predicted values of the dependent variable for
each observation minus the mean value of the dependent variable, squared.
If our regression does not explain any variation in the dependent variable, ESS = 0. Our best prediction
is the mean value of Y and if our model has any explanatory power, ESS > 0 and the model adds
something beyond the mean to our understanding of the outcome (Y).
This is also called ‘regression sum of squares (RSS)’ (confusing, right??)
Calculating R2 - THIRD COMPONENT

RSS  e i  Yi - Ŷi
2

2
Residual sum of squares (RSS) – which we covered a few slides ago, e.g. each
observation’s value on the dependent variable minus the predicted value.
This is the variation our model cannot explain and is therefore labeled as the error
term (or residual).
This is also called the error sum of squares (ESS) (huh, wtf??)
EXPLAINED VARIANCE
• As noted, R2 is defined as:
R2  1
TSS  ESS
RSS
 1
TSS
TSS
• The total variation in Y – TSS – can be divided into two parts :
The closer ESS is to TSS, or the lower RSS is relative to TSS – the higher
the R2 value
• Therefore, R2 is commonly interpreted as the part of the variation in Y
explained by X
• Note! R2 will be lower if the relationship between our variables have a nonlinear relationship!
. reg happiness health
Source
SS
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537

RSS  e i  Yi - Ŷi
2
df
Std. Err.
.2009521
.4429099
t
3.87
0.55
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.001
0.590
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
[95% Conf. Interval]
.3558981
-.6873656
1.200267
1.173673

2
TSS   Yi  Y 
2
ESS=explained (model)
sum of squares
RSS=residual sum of squared
TSS=total sum of squares
ESS  

Yˆi  Y

2
.
reg
Amount of explained variance in happiness
explained by health
happiness
health
Source
SS
df
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537
Std.
Err.
.2009521
.4429099
t
3.87
0.55
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.001
0.590
[95%
.3558981
-.6873656
R2 = 1-RSS/TSS = 1- (0.805/1.476) = 0.45
TSS  ESS
RSS
R  1
 1
TSS
TSS
2
Conf.
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
Interval]
1.200267
1.173673
A visual of how R² works:
R² = 1 - (rss/tss)
our DV
Total
sum of squares
(TSS)
IV
Residual
sum of squares
(RSS)
Our IV
Explained
variation (ESS)
B values vs. R2 values – important distinction!
R2 =
0.10
b = 4.33
R2 doesn’t say
ANYTHING 2
about the effectRsize
b ≈ 4.33 R2 ≈ 0.90
2. Testing model significance: F-test
• If our null is of the form, Ho : β1 = β2 = . . . = βk = 0, then we can write the test
statistic in the following way:
(𝑅𝑆𝑆1 − 𝑅𝑆𝑆2 )/(𝑃2 − 𝑃1 )
𝑓0 =
𝑅𝑆𝑆2 /(n − (𝑃2 ))
• This compares whether any/all betas we put in a model explained variation
significantly better than an empty model with just a constant
• It is basically the explained variance over the residual variance
• Degrees of freedom - n is the number of observations, 𝑃2 is the number of
independent variables total in our ‘restricted’ model, while 𝑃1 is just the more
’unrestricted’ model, e.g. in this case just a constant. Where 𝑃2 >𝑃1
• This can also be used to test ’nested models’ (more later…)
2. Testing model significance: F-test
• H0: β1= β2 =…. βk = 0
• Ha: At least one β is different from 0
• If p<0,05, we reject the null hypothesis in favor of Ha
• Note! A significant F value does not necessarily mean we have a good
model. However, if we cannot reject H0, our model is indeed bad!
. reg happiness health
Source
SS
df
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537
Std. Err.
.2009521
.4429099
t
3.87
0.55
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.001
0.590
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
[95% Conf. Interval]
.3558981
-.6873656
1.200267
1.173673
Mean Squared Error (MSE)
• MSE tells us how ‘off’ the models is on average, in the unit of the DV. Some units less
‘intuative’ than others, when less so, compare MSE with the st. Dev. Of the DV
• Also useful to compare different models with same DV
• The MSE here tells us that our predictions on average are ‘off’ by 0.21
. reg happiness health
Source
SS
df
MS
Model
Residual
.670585718
.805119395
1
18
.670585718
.044728855
Total
1.47570511
19
.07766869
happiness
Coef.
health
_cons
.7780828
.2431537
Std. Err.
.2009521
.4429099
t
3.87
0.55
Number of obs
F(
1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.001
0.590
=
=
=
=
=
=
20
14.99
0.0011
0.4544
0.4241
.21149
[95% Conf. Interval]
.3558981
-.6873656
1.200267
1.173673
ADDING ADDITIONAL VARIABLES:
MULTIPLE REGRESSION
Last week
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
• Regression introduction
• Basics of OLS –
• calcualtions of beta, alpha, error term, etc.
• bivariate analysis
• Basic model diagnositics: R2, F-tests, MSE
• Today:
• multivariate regression
• Assumptions of OLS, dection of violations and what to do about it..
Back to Sir Galton….
Multiple regression
• So far, we’ve just kept it simple with bivariate regression models:
𝑌 = 𝛽0 +𝛽1 X + 𝑒
With multiple regression, we’re of course adding more variables (’parameters’), to our
model. In stats terms, we’re estimating a more ’constrained’ or ’restricted’ model:
𝑌 = 𝛽0 + 𝛽1 X + 𝜷𝟐 𝐙 + 𝑒
We’re thus able to account for a greater number of explanations as to why Y varies.
Additional variables can be included for a number of reasons:
controls, additional theoretical, interactions (later)
How now to interpret our coefficients?
𝛽𝑛 = the change in Y for a one unit change in xn,
holding all other variables constant, (or ‘all things
being equal’, or ‘ceteris paribus’).
In other words, the average marginal effect across
all values of additional X’s in the model
𝛼 (intercept) – is the estimated value of Y when all
other X’s are held at ‘0’. This may or may not be
relaistic
G: The variation in the dependent variable
NOT explained by the independent variables –
this is the variation that could be explained by
additional independent variables (RSS)
y
C: Shared
covariance
between all three
variables
D: Covariance
between the two
independent
variables not
including the
dependent variable
Circle x1: The
total variation
in the first
independent
variable
G
A
E
C
B
D
F
Circle Y: The total
variation in the
dependent variable
(TSS)
A: The unique
covariance
between the
independent
variable x1
and y
B: The unique
covariance
between the
indepdent
variable x2
and y
x2
x1
Variation in x1 (E) and
x2 (F) respectively that
is not associated with
Circle x2: The
total variation
in the second
independent
variable
B coefficients in multiple regression, cont.
Regression for y (dependent) och x2(independent)
𝑦 = 𝑐1 + 𝑐2 𝑥2 + 𝑤
y
Area C and B are predicted by the equation:
G
𝑦 = 𝑐1 + 𝑐2 𝑥2
A
Area A and G are shown in w (error), which equals:
E
𝑤 =
(𝑦 − 𝑦)
C
D
x1
Areas A and G are secured in y through w. Now, we can calculate the
unique effect of x1 on y under control for x2.
B
F
x2
Calculation of the b coefficients in multiple regression
y= B1 + Bx2 +Bx3 +e
"B1" = intercept
Starting simple: dummy variables in regression
• If an independent variable is a nominal, we can still use them by creating
dummy variables (if >2 categories)
• A dummy variable is a dichotomous variable coded 0 and 1 (based on an
original nominal or ordinal variable)
• The number of dummy variables needed depends on the number of
categories on the original variable
• Number of categories on the original variable minus 1 = number of dummy
variables.
• Ex. Party affiliation: Alliansen, R-G, SD – we would include a dummy for 2
groups and these betas are compared with the third (omitted) group
• We can also do this for ordinal IV’s, like low, middle and high f/e.
• In any regression, the intercept will equal the mean on the dependent variable when
X’s =0, thus for a dummy variable this =Y for the reference category (RC).
• The coefficients shows each category’s difference from the mean relative to the RC
• If we add other independent variables in our model, the interpretations of the intercept
is when ALL independent variables are 0.
• Still, the interpretation of the coefficients for the dummy variables should be in relation
to the reference category but under control for the additional variable we entered into
the model.
Example: support for EU integration, EES data (on GUL)
• Let’s say we’re interested in explaining why support for further EU integration varies at the
individual level in Sweden.
• DV: Some say European unification should be pushed further. Others say it already has gone too
far. What is your opinion? Please indicate your views using a scale from 0 to 10, where '0‘ means
unification "has already gone too far" and '10' means it "should be pushed further".
• 3 IV’s: gender (0=men, 1=female), education (1=some post-secondary+, 0 if otherwise) and
European identity (attachment, 0-3, greater, 0=very unattached, 3=very attachment)
𝐸𝑈 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 = 𝛽0 + 𝛽1 𝒇𝒆𝒎𝒂𝒍𝒆 + 𝛽2 𝒆𝒅𝒖𝒄𝒂𝒕𝒊𝒐𝒏 + 𝛽3 (𝑬𝒖𝒓𝒐 𝒊𝒅𝒆𝒏𝒕𝒊𝒕𝒚) + e
300
0
100
200
Summary stats
0
2
4
6
Supp_EU_int
8
10
sum Supp_EU_int female some_college EU_attach
• DV ranges from 0-10
• 2 binary IV’s
• 1 ordinal IV
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+--------------------------------------------------------Supp_EU_int |
1,112
4.644784
2.561694
0
10
female |
1,144
.4318182
.4955461
0
1
some_college |
1,144
.6975524
.4595189
0
1
EU_attach |
1,131
2.228117
.7741086
0
3
reg Supp_EU_int female some_college EU_attach
-----------------------------------------------------------------------------Supp_EU_int |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------female |
-.4034422
.1486337
-2.71
0.007
-.6950803
-.1118041
some_college |
.3745538
.1601731
2.34
0.020
.0602739
.6888338
EU_attach |
1.05227
.0947614
11.10
0.000
.8663361
1.238204
_cons |
2.209169
.2469881
8.94
0.000
1.724547
2.693791
------------------------------------------------------------------------------
1. intercept: ?
the predicted level of the DV when all variables = 0 (men, w/out college, who are strongly detached from
Europe)
2. female: the effect of gender is significant. Holding constant education and European identity, females
support further EU integration by -0.4 on average.
3. Education: the effect is also signfificant. Having some post-secondary education increases support for
EU integration by 0.37 holding gender and and European identity constant
4. European attachment: is signficant: Holding constant education and gender, a one unit increase in
attachment results in an increase in suppport for the DV by 1.05 on average.
A visual with gender and identity
4
Effect of Euro attach.
3
Effect of gender
2
Linear Prediction
5
6
Predicted levels of Support
0
1
2
European Attachment
males
females
3
Some predictions from our model
• 𝐸𝑈 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 = 2.21 − 0.40 𝒇𝒆𝒎𝒂𝒍𝒆 + 0.37 𝒆𝒅𝒖𝒄𝒂𝒕𝒊𝒐𝒏 + 1.05 𝑬𝒖𝒓𝒐 𝒊𝒅𝒆𝒏𝒕𝒊𝒕𝒚 + e
• What is the predicted level of support for further EU integration for a:
1. male with some university and a strong European identity (3) ?
= 2.21 -0.40(0) + 0.37(1) + 1.05(3) = 5.73
2. Female with no university and a very weak European attachment (0) ?
= 2.21 -0.40(1) + 0.37(0) + 1.05(0) = 1.81
Comparing marginal effects
• Significance values - not always interesting ...most everything tends to become significant
with many observations, like in large survey data…
• Another great feature of OLS is that we can compare both marginal and total effects
of all B’s
• when you are about to publish your results you often want to say which variables have the
greatest impact in this model?
• Here we can show both the marginal effects (showed in the regression output). These
effects/b-values only show the change in Y caused by on unit change in X, AND, the total
effects (min to max effect, or the effects within a certain range) one has to consider the
scale.
• Question: what is the marginal and total effect of our 3 variables?
Answer..
• For binary variables, marginal and total are the same
• For ordinal/continuous variables, we can do a few things to check this:
1. ’normalize’ (re-scale) the variable to 0/1 (see do file for this)
2. Compare standardized coefficients (just add command ’beta’)
3. Alternative – use ’margins’ command (more later..)
-----------------------------------------------------------------------------------Supp_EU_int |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------------+---------------------------------------------------------------female |
-.4034422
.1486337
-2.71
0.007
-.6950803
-.1118041
some_college |
.3745538
.1601731
2.34
0.020
.0602739
.6888337
normal_EU_attach0_1 |
3.15681
.2842841
11.10
0.000
2.599008
3.714611
_cons |
2.209169
.2469881
8.94
0.000
1.724547
2.693791
-------------------------------------------------------------------------------------
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
For our model….
variable
Marginal Effect
Total (max-min) effect
female
-0.4
-0.4
education
0.37
0.37
Euro attach
1.05
3.15
Direct comparison: Standardized coefficients
• Standardized coefficients can be used to make direct comparison of effects of IV’s
• When standardized coefficients are used (beta values), the scale unit of all variables
are deviations from the mean – number of standard deviations
• Thus, we gain comparison but loose the intuitive feeling of our interpretation of the
results, but we can always report both ‘regular’ betas and standardized.
reg Supp_EU_int female some_college EU_attach, beta
----------------------------------------------------------------------------Supp_EU_int |
Coef.
Std. Err.
t
P>|t|
Beta
-------------+---------------------------------------------------------------female |
-.4034422
.1486337
-2.71
0.007
-.0779299
some_college |
.3745538
.1601731
2.34
0.020
.0672729
EU_attach |
1.05227
.0947614
11.10
0.000
.3167175
_cons |
2.209169
.2469881
8.94
0.000
.
------------------------------------------------------------------------------
STANDARDIZED COEFFICIENTS (BETAS)
• The standardization of b:
• standardized scores are also known as z-scores, so often they are labeled with a ‘z’
In STATA:
• gen zy=(y - r(mean))/r(sd)
• gen zx=(x - r(mean))/r(sd)
• Beta=b*zy/zx
min/max change
interquartile change (25th to 75th %ile)
Nat. attach.
Income
Population
Econ. Left-Right
Female
Education
Age
Gal-Tan
EU integration
Vote EU Skep.
Corruption
Trust EU
Immigration
Econ. sat.
Europe attach.
2.2
2.4
2.6
2.8
Another way of reporting comparative effects…
(Bauhr and Charron 2017)
ORDINARY LEAST SQUARES REGRESSION
DAY 2
NICHOLAS CHARRON
ASSOCIATE PROF.
DEPT. OF POLITICAL SCIENCE
OLS is ’BLUE’
• What is this?
• It is the Best Linear Unbiased Estimator
• Aka ’Gauss–Markov theorem’
• “states that in a linear regression model in which the errors
have expectation zero and are uncorrelated and have equal
variances, the best linear unbiased estimator (BLUE) of the
coefficients is given by the ordinary least squares (OLS)
estimator. Here "best" means giving the lowest variance of the
estimate, as compared to other unbiased, linear estimators.
Assumptions of OLS
OLS is fantastic if our data meets several assumptions, and before we make
any inferences, we should always check:
In order to make inference:
1.
Correct model specification - the linear model is suitable
2.
No severe multicollinearity
3.
The conditional standard deviation is the same for all levels of X
(homoscedasticity)
4.
Error terms are normally distributed for all levels of X
5.
The sample is selected randomly
6.
There are no severe outliers
7.
There is no autocorrelation
89
1) Model specification
a) - causality in the relationships - not so much a problem for the statistical model but
rather a theoretical problem.
Better data and modelling - use panel data, experiments or theory!
b) is the relationships betweenDV and IV LINEAR?
if not - OLS regression will give biased results
c) all theoretically relevant variables should be included.
• - if they are not this will lead to "omitted variable bias", - if an important variable is
being left out in a model - this will influence the coefficients of the other variables in the
model.
remedy? Theory, previous literature. Motivate all variables. Some statistical
tests/checks
Linear model is suitable
• When 1 or more IV’s has a non-linear effect on the DV, thus a relationship
exists, but cannot be properly detected in standard OLS
• This one is probably one of the easiest to detect
1. Bivariate Scatterplot: If the scatter plot doesn’t show an approximately
linear pattern, the fitted line may be almost useless.
2. Ramsey RESET test (F-test)
3. theory
• If X and Y do not fit a linear pattern, there are several measures you can take
1. Run regression in STATA
2.4
• Scatter looks ok, but let’s check more
formally with the Ramsey RESET test:
3 steps:
2.6
Checking for this: health and happieness (in GUL)
3. Run command ovtest
2.2
2. Run command linktest
A significant squared residual or F-stat
implies that the model is incorrectly
specified
If sig., make adjustment and re-run
regression & test
1.8
1.6
Ovtest, Ho: model is specified correctly
2
The linktest estimates your DV with the
residual and squared residual of your
model as IVs.
1.8
2
2.2
health
2.4
2.6
• The 3 steps
• What do you
see?
Example with health and happiness data
reg happiness health
----------------------------------------------------------------------------happiness |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------health |
.7799197
.2008376
3.88
0.001
.3579756
1.201864
_cons |
.2380261
.4428564
0.54
0.598
-.6923806
1.168433
------------------------------------------------------------------------------
linktest
-----------------------------------------------------------------------------happiness |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_hat |
3.503184
5.25501
0.67
0.514
-7.583918
14.59029
_hatsq |
-.6307252
1.322438
-0.48
0.639
-3.420826
2.159376
_cons |
-2.461618
5.186894
-0.47
0.641
-13.40501
8.481771
ovtest
Ramsey RESET test using powers of the fitted values of happiness
Ho:
model has no omitted variables
F(3, 15) =
0.10
Prob > F =
0.9606
0
2
4
6
Freedom House/Polity
Control of Corruption - Estimate
80
60
40
8
10
20
-2
-1
0
1
Corruption Perceptions Index
2
100
Non-linearity can be detected
0
Fitted values
500000
1000000
Total population (in thousands)
1500000
Issues with non-linearity
• Problems with curve-linear relationships - we will under- or overestimate the effect in the
dependent variable for different values of the independent variable.
• However, this is a ’sexy’ problem to have at times..
• OLS can be used for relationships that are not strictly linear in y and x by using non-linear functions
of y and x
3 standard approaches depending on the data:
1. natural log of x, y or both (e.g. loggorithm)
2. quadratic forms of x or y
3. interactions of x variables
• Or adding more data/observations…
–
the natural logarithm will downplay extreme values and make it more normally distributed.
Variable transformation: natural loggorithm
• Log models are invariant to the scale of the variables since they are now measuring
percent changes.
•
Sometimes done to constrain extreme outliers, and downplay their effect in the model,
make distribution more ‘compact’.
• Standard variables in social science that researchers tend to log:
1. Positive variables representing wealth (personal income, country GDP, etc.).
2. Other variables that take large values – population, geographic area size, etc
• Important to note- the rank order does not change from the original scale!
Transforming your variables
• Using the natural logarithm (e.g. the inverse of the exponential function). Only for x>0.
• Ex. corruptio nexplained by country size (population)
• Population and corruption
20
20
40
40
60
60
80
80
100
100
logged population and corruption
0
500000
1000000
Total population (in thousands)
Corruption Perceptions Index
In Stata: reg DV IV
1500000
4
6
8
10
Corruption Perceptions Index
Fitted values
gen logIV= log(IV)
12
logpop
reg DV logIV
Fitted values
14
Interpretation of transformations with logs
1. Logged DV and non-logged IV: ln(y) = β0 + β1x + u
– „β1 is approximately the percentage change in y given an absolute change in x. a 1 step
increase in the IV gives the coefficient*100 percent increase in the DV. (%Δy=100⋅β1)
2. Logged IV and non-logged DV: y = β0 + β1ln(x) + u
β1 is approximately the absolute change in y for a percentage change in x. 1 percent
increase in the IV gives the coefficient/100 increase in the DV in absolute terms.
(Δy=(β1/100)%Δx)
3. Logged DV and IV: ln(y) = β0 + β1ln(x) + u
– „„β1 is the elasticity of y with respect to x (%Δy=β1%Δx)
– β1 is thus the percentage change in y for a percentage change in x
 NOTE: The interpretation is only applicable for log base e (natural log) transformations.
Rules for interpreation of Beta with logged transformed
variables
0
• Explained later by an interaction with
economic development
2
4
6
• Ex. Democracy versus corruption
8
10
Quadratic forms (e.g. squared)
0
2
4
6
Freedom House/Polity
Corruption Perceptions Index
8
10
Fitted values
Charron, N., & Lapuente, V. (2010).
Does democracy produce quality of
government?. European Journal of
Political Research, 49(4), 443-470.
0
2
4
6
8
10
Quadratic forms –capture diminishing or increasing returns
0
2
4
6
Freedom House/Polity
Corruption Perceptions Index
Fitted values
8
Fitted values
How to model this?
Quite simple, add a squared term of the non-linear IV
10
Quadratic forms: interpretation
•
Analyses including quadratic terms can be viewed as a special case of
interactions (more on Friday on this topic)
•
Include both original variable and the squared term in your model:
y = β0 + β1x + β2x2 + u
• For ‘u’ shaped curves, B1 should be negative, while B2 should be positive
• Including the squared term means that β1 can’t be interpreted alone as
measuring the change in y for a unit change in x, we need to take into
account β2 as well since:
𝑆𝑙𝑜𝑝𝑒 =
∆𝑦
≈ 𝛽1 + 2𝛽2 𝑥
∆𝑥
In stata
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
• 2 approaches:
1. Generate a new squared variable:
gen democracy2 = democracy*democracy
2. Tell STATA in the regression with the ’#’ sign:
For continuous or ordinal variables we need to add the ’c.’ prior to the variable:
Ex. reg corruption c.democracy c.democracy#c.democracy
Comparing the results, we see..
reg wbgi_cce fh_polity2
Source |
SS
df
MS
-------------+---------------------------------Model | 53.3873279
1 53.3873279
Residual | 110.238236
161 .684709542
-------------+---------------------------------Total | 163.625564
162 1.01003435
Number of obs
F(1, 161)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
163
77.97
0.0000
0.3263
0.3221
.82747
-----------------------------------------------------------------------------wbgi_cce |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------fh_polity2 |
.1852964
.0209846
8.83
0.000
.1438558
.226737
_cons | -1.319789
.1476037
-8.94
0.000
-1.611278
-1.0283
-----------------------------------------------------------------------------reg wbgi_cce c.fh_polity2 c.fh_polity2#c.fh_polity2
Source |
SS
df
MS
-------------+---------------------------------Model | 84.1888456
2 42.0944228
Residual | 79.4367185
160 .496479491
-------------+---------------------------------Total | 163.625564
162 1.01003435
Number of obs
F(2, 160)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
163
84.79
0.0000
0.5145
0.5085
.70461
------------------------------------------------------------------------------------------wbgi_cce |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------------------+---------------------------------------------------------------fh_polity2 | -.4563782
.0834032
-5.47
0.000
-.6210913
-.291665
|
c.fh_polity2#c.fh_polity2 |
.0565682
.0071819
7.88
0.000
.0423847
.0707516
|
_cons | -.0634611
.2030729
-0.31
0.755
-.46451
.3375879
-------------------------------------------------------------------------------------------
Quadratic forms – getting concrete model predictions using
the margins command
reg wbgi_cce c.fh_polity2 c.fh_polity2#c.fh_polity2
------------------------------------------------------------------------------------------wbgi_cce |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
--------------------------+---------------------------------------------------------------fh_polity2 |
-.4563782
.0834032
-5.47
0.000
-.6210913
-.291665
.0565682
.0071819
7.88
0.000
.0423847
.0707516
-.0634611
.2030729
-0.31
0.755
-.46451
.3375879
|
c.fh_polity2#c.fh_polity2 |
|
_cons |
-------------------------------------------------------------------------------------------
margins, at (fh_polity2 =(0 (1)10))
marginsplot
∆𝑦
1.5
Adjusted Predictions with 95% CIs
-2
.5
0
-1
-.5
-1
0
Linear Prediction
1
1
2
• 𝑆𝑙𝑜𝑝𝑒 = ∆𝑥 ≈ 𝛽1 + 2𝛽2 𝑥
0
2
4
6
Freedom House/Polity
8
10
0
1
2
3
4
5
6
Freedom House/Polity
7
8
9
10
Otehr things to watch for for assumption 1
1. The sample is a simple random representative sample
(SRS) from the population.
2. Model has correct values
3. Data is valid and accurately measures concepts
4. No omitted variables (exogeneity)
No omitted IV’s - exogeneity
• Error term has zero population mean (E(εi)=0).
• Error term is not correlated with X’s, ‘exogeneity’, E(εi|X1i,X2i,…, XNi,)=0,
• This assumption is also called ’exogeneity’. It basically means that X’s are not
correlated with the error term in any systematic way.
• This is a result of ommitted variable bias
• Can be checked via checking the correlations and scatterplots with the residual
and the IV’s – if a correlation/pattern exists, this can lead to bias (more later on
this)
2. No severe multicollinearity
• What is multicollinearity?
• ’perfect’ multicollinearity is when two variables X1 and X2 are
correlated at 1 (or -1), but is also a problem when X1 and X2 are
highly correlated, say above/below 0.6 or -0.6.
• Example: if we esitmate one’s shoe size with height, and include
measures of height in cm and inches
Cont.
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
• Since an inch = 2.54 cm, we know that if someone is 63 inches then
they are 160cm for example
• What happens?
𝑠ℎ𝑜𝑒 𝑠𝑖𝑧𝑒𝑖 = 𝛽0 + 𝛽1 ℎ𝑒𝑖𝑔ℎ𝑡_𝑖𝑛𝑐ℎ𝑒𝑠 + 𝛽2 ℎ𝑒𝑖𝑔ℎ𝑡_𝑐𝑚 + 𝑒𝑖
• What is the effect of 𝛽1 on Y?
• The effect of inches on shoe size when holding constant cm (𝛽2 ) –
but inches don’t vary when holding cm constant! So 𝛽’s will be=0,
undefined..
multicollinearity
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
• Other examples:
• Nominal/categorical variables: employment (1. private sector, 2. public
sector, 3. not working), must exclude one category as a ’reference’
• But these examples are mainly error by us..
• What heppens if X1 and X2 are just highly correlated?
• OLS BLUE is not violated, estiamtes still unbiased, but they become less
EFFICIENT (higher standard errors)
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
RSS
Effect of X1 & X2
DV
Effect of X1 only
Effect of X2 only
X2
X1
Detecting multicollinearity
1. You run a model where none of the X’s are sig., but the overall F-test is significant
2. Look at a Pearson’s correlation table – if any variables are correlated at above (rule of
thumb) 0.6 or -0.6, then this could be an issue
3. A post regression, VIF (variance inflation factor) test. This tests if any/all X’s in the model
are in linear combination with any other X
1
𝑉𝐼𝐹𝑗 =
1 − 𝑅²𝑗
If there is no correlation between Xj and any other X’s, then Rj=0, and thus VIFj=1, which is the
lowest value.
You will get a VIF for each X and the model on whole. Any value above 10 (ruole of thumb) is
considered a problem.
In STATA, post regression: estat vif
What to do about multicollinearity??
1. If X1 and X2 are highly correlated, drop one (the least important) –
-this points to a possible trade-off between BIAS and EFFICIENCY
2. Increase N, multicollienarity has a larger impact on smaller sample sizes
3. Combine the variables into an index. This can be done via principle component or
factor analysis for example,
4. Do nothing and just be clear about the problem
Short exercise
• Open dataset on GUL: practicedata.dta
• Explain share of women in parliament (DV) as a function of corruption, population, and
spending on primary education
• Check scatterplots, correlations, and do a multivariate regression
• Interpret all coefficients, check model statistics
• Test/examine whether the linear relationship is appropriate for all IV’s
• Make the proper transformation if necessary
• Run regression with transformed variable. Compare results in terms of Beta, p-values
and R2 with the non-transformed regression output – what do you see?
• Check for multicollinearity
• Based on correlation tables & Run a VIF test – what do you see?
Assumptions of OLS
OLS is fantastic if our data meets several assumptions, and before we
make any inferences, we should always check:
In order to make inference:
1. Correct model specification - the linear model is suitable
2. No severe multicollinearity
3. Error terms are normally distributed for all levels of X
4. The conditional standard deviation is the same for all levels of X
(homoscedasticity)
5. There are no severe outliers
6. There is no autocorrelation
7. The sample is selected randomly/ is representative
115
3. No extreme outliers
• Outliers?
• Outliers if undetected, can have a severe impact on your bet estimates. You must check for
these, especially where Y’s or X’s are continuous.
Three ways we should think about outlying observations:
1. Leverage outlier – an observation far from the mean of Y or X (for ex., 2 or 3+ st.
deviations from mean)
2. Reisdual outlier – an observation that ’goes against our prediction’ (e.g. has a lot of error)
3. Influence: if we take this observation out, do the results change signfincantly?
A leverage outlier is not necessarily a problem (if it is in line with our predictions). However, a
leverage outlier makes things very misleading if it is also a big residual outlier, meaning it will
be an influence observation.
use http://www.ats.ucla.edu/stat/stata/dae/crime, clear
• Run a regression
explaining crime in
a state (# of violent
crimes/100,000
people):
3 IV’s
• %metro area,
• poverty rate %
• %of single parent
households
• Interpretation?
regress crime pctmetro poverty single
Source |
SS
df
MS
-------------+----------------------------------
Number of obs
=
51
F(3, 47)
=
82.16
Model |
8170480.21
3
2723493.4
Prob > F
=
0.0000
Residual |
1557994.53
47
33148.8199
R-squared
=
0.8399
Adj R-squared
=
0.8296
Root MSE
=
182.07
-------------+---------------------------------Total |
9728474.75
50
194569.495
-----------------------------------------------------------------------------crime |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------pctmetro |
7.828935
1.254699
6.24
0.000
5.304806
10.35306
poverty |
17.68024
6.94093
2.55
0.014
3.716893
31.6436
single |
132.4081
15.50322
8.54
0.000
101.2196
163.5965
_cons |
-1666.436
147.852
-11.27
0.000
-1963.876
-1368.996
• Any near top right corner can
especially -bias results!
.2
• x- axis_ residual
ak
ms
wv
la
vt
mt
nj
ky
wy
mectsd
ar
ca
nd md id
nv
ok
ma
tn
hi
nm
co
pa
ny
ne
de
nhtx
ga
va ut
sc
mi
az
kswi
al
il wa
mn
mo
oh
or
in nc
0
• Y-axis = leverage
Leverage
• We do this after a regression
in STATA
dc
.4
• A simple leverage residual
plot can give us a clear
visual
.6
Detection of influence of obs: lvr2plot
0
fl
ia ri
.05
.1
Normalized residual squared
.15
.2
outliers via ‘studentized’ residuals
• We can check with normal residuals, but they are dependent on their scale, which make it
hard to compare different models.
• As our model is an estimate of the ‘true’ relationship, so are the errors
• The issue is that although the variance of the error term is equal (assumption,
homoskedastic), the estimates are often not equal for all levels of X. The variance might
decrease as X increases f/e.
• Studentized residuals are adjusted. They are re-calculated residuals whereby the
regression line is re-calculated by taking out each observation one at a time.
• We then compare the first estimates (all obs) with the estimates removing each obs, for
each obs. For obs where the line moves a lot, the obs has a larger studentized residual..
Normal (raw), vs. studentized residuals
• Normal
10
Frequency
4
0
5
2
Frequency
6
15
8
Studentized
-400
-200
0
200
400
Residuals
0
-600
-4
Z and Stud can be related
To the Z-score where 95% of
The resid. fall within ± 2 std.dev
-2
0
Studentized residuals
2
4
Looking at obs on extremes of distribution
hilo r state, show(5)
• Command ’hilo’
• Specify with
’show(#) how
many you want to
see (default =10)
5 lowest and highest observations on r
+-------------------+
|
r
state |
|-------------------|
| -3.570789
ms |
| -1.838577
la |
| -1.685598
ri |
| -1.303919
wa |
|
oh |
-1.14833
+-------------------+
• Any obs -2 or +2
(esp. -3 or +3)
should be further
looked at
+------------------+
|
r
state |
|------------------|
| 1.151702
il |
| 1.293477
id |
| 1.589644
ia |
| 2.619523
fl |
| 3.765847
dc |
+------------------+
Influence of each observation: Cook’s D
• In STATA, after a regression: predict d, cooksd
If Cook’s d = 0 for an obs, than the obs has no influence, the higher the d
value, the greater the influence. It is calcualted via an F-test, testing whether
Xi=Xi(minus obs i)
The ’rule of thumb’ for observations with possibly troublesome influence is d >
4/n
To avoid adding observations with missing data, specify: if d>4/51
Compare outlier’s stats on variables with sample
list state d crime pctmetro poverty single if d>4/51
+--------------------------------------------------------+
| state
d
crime
pctmetro
poverty
single |
|--------------------------------------------------------|
9. |
fl
.173629
1206
93
17.8
10.6 |
18. |
la
.1592638
1062
75
26.4
14.9 |
25. |
ms
.602106
434
30.7
24.7
14.7 |
51. |
dc
3.203429
2922
100
26.4
22.1 |
+--------------------------------------------------------+
. sum crime pctmetro poverty single
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+--------------------------------------------------------crime |
51
612.8431
441.1003
82
2922
pctmetro |
51
67.3902
21.95713
24
100
poverty |
51
14.25882
4.584242
8
26.4
single |
51
11.32549
2.121494
8.4
22.1
Measuring influence for each IV: DFBETA
• dfbeta is a stastistic of influence of obs for each
IV in the model
• It tells us how many standard errors the
coefficient WOULD CHANGE if we removed the
obs
• A new variable is generate for each IV
• Ex. DC increases Beta of %single parent by
3.13*se (or 3.13*15.5) compared to reg without
DC
• Dependent on scale of Y and X!
• Caution for any dfbeta number above:
list _dfbeta_1
state if _dfbeta_1>.28
+------------------+
| _dfbet~1
state |
|------------------|
9. |
.64175
fl |
25. | 1.006877
ms |
+------------------+
. list _dfbeta_2
state if _dfbeta_2>.28
+------------------+
| _dfbet~2
state |
|------------------|
9. | .5959252
fl |
+------------------+
. list _dfbeta_3
state if _dfbeta_3>.28
+------------------+
| _dfbet~3
state |
|------------------|
2
𝑛
=2
51
= 0.28
51. | 3.139084
dc |
+------------------+
What to do about outliers?
Again, depends on what type of ’outlier’ an observation is!
no ”right” answer here, just be aware of if they exsist and how much effect they have on the
estimates, BUT:
1. Check for data error!
2. Create an obs. dummy for the outliers
’gen outlier = 1 if ccode== x
’replace outlier=0 if outlier==.
3. *Take out the obs & re-run model & see if any differences, run ’lfit’ and compare R² stats..
Report any differences…
3. New functional form (log, normalize variables)
4. Do nothing, leave them in and footnote
5. Use weighted observations
2
0
5
4
10
6
15
8
20
10
25
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
0
5
10
15
x
y
20
0
5
10
15
x
Fitted values
y
Fitted values
20
Robust regression (rreg)
• Robust regression can be used in any situation in which you would use OLS
• Can also be helpful in dealing with outliers
• After we decide that have no compelling reason to exclude them from the
analysis.
• In normal OLS, all observations are weighted equally. The idea of robust
regression is to weigh the observations differently based on how “well behaved”
they are. Basically, it is a form of weighted and reweighted OLS (WLS)
Robust regression (rreg)
• Stata's rreg command implements a version of robust regression.
• It runs the OLS regression, gets the Cook's D for each observation. Obs. with small
residuals gets higher weight (1>), any obs. with Cook's distance greater than 1 (sever
influence) are dropped.
• Using the Stata defaults, robust regression is about 95% as efficient as OLS
(Hamilton, 1991). In short, the most influential points are dropped, and then cases
with large absolute residuals are down-weighted.
• Looking at our example data on women in parliament…
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
-----------------------------------------------------------------------------reg ipu_l_sw une_eep ti_cpi logpop
ipu_l_sw |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------une_eep |
3.869044
1.299639
2.98
0.004
1.29563
6.442458
ti_cpi |
.2091336
.0511606
4.09
0.000
.1078304
.3104368
logpop |
.7506063
.5774238
1.30
0.196
-.3927504
1.893963
_cons |
-2.813357
6.907481
-0.41
0.685
-16.49086
10.86415
-----------------------------------------------------------------------------rreg ipu_l_sw une_eep ti_cpi logpop
Robust regression
Number of obs
F(
3,
Prob > F
=
123
119) =
8.31
=
0.0000
-----------------------------------------------------------------------------ipu_l_sw |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------une_eep |
4.050256
1.329502
3.05
0.003
1.417709
6.682803
ti_cpi |
.2208518
.0523362
4.22
0.000
.1172209
.3244828
logpop |
1.054711
.590692
1.79
0.077
-.1149176
2.224341
_cons |
-7.11359
7.066203
-1.01
0.316
-21.10538
6.878198
------------------------------------------------------------------------------
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
Short exercise
• Open ’practicedata’ dataset again, and we’ll do the same regression as in example 1
• Again, examine a scatterplots between the DV and each IV. Run regression
• Search for outliers:
1. Visual residual leverage plot: lvr2plot, mlabel( cname )
2. Cook’s d
3. Dfbeta (you can look at all 3 , or dfbeta for each IV one at a time if easier):
e.g. list cname _dfbeta_1 if _dfbeta_1 > 2
(**dont forget to calculate the 𝟐
𝑛
𝒏
What do you see?
Do any observations break our 4/n Cook’s, or 2
𝑛
dbeta rule? Which countries are they?
What would you do about this?
Do your adjustments change your regression results?
)
ASSUMPTIONS THAT ARE ERROR-TERM
VIOLATIONS:
-NORMALITY
-HOMOSKADASTICITY
-AUTOCORRELATION
-INDEPENDENCE OF OBSERVATIONS
4. mean of error=0, are normally distributed for
all levels of X
Key issues:
1. There is a probability distribution of Y for each level of X.
A ‘hard’ assumption is that this distribution is normal (bell
shaped)
2. Given that µy is the mean value of Y, the standard form of
the model is

y
 f (x)  
where  is a random variable with a normal distribution with
mean 0 and standard deviation .
Normality distribution of error terms
• While violations against any of these three former assumptions (1 Model specification –
linearity 2, No extreme observations 3, (No strong multicollinearity)) could potentially result in
bias in the estimated coefficients.
• However, violations against the assumptions concerning the residuals (4) absence of
autocorrelation 5) normally distributed residuals and 6) homoskadasticity) may not necessarily
not affect the estimated coefficients but it may affect and reduce your ability to perform
inference and hypothesis testing. But they can, so it’s always good to check!
• This since the residuals distribution is the foundation to perform significance tests for the
coefficients - it's the distribution that underlies the calculation of t- and P-values. This is
especially true for smaller data samples.
•
• a prerequisite in small samples is that the residuals are normally distributed.
Analysis of Residual
• Always important to do – for several assumptions
• To examine whether the regression model is appropriate for the
data being analyzed, we can check the residual plots.
• Later we can do more ‘advanced’ tests to see if we’ve violated
some assumptions
• Residual plots:
1. histogram of the residuals
2. Scatterplot residuals against the fitted values (y-hat).
3. Scatterplot residuals against the independent variables (x).
4. Scatterplot residuals over time if the data are chronological (more later
in time series analysis).
Plotting the residuals
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
• Use the academic performance data, and regress academic performance on the %of ESL
learniners, %of students with free meals, and average education of parents
• use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2
• regress api00 meals ell emer
• Then predict the residuals:
• predict r, resid
• Plot the desnity of the residuals against a normal bell curve – how close are they matched?
• kdensity r, normal
• A qnorm plot (plots the quantiles of a variable against the quantiles of a normal distribution)
• qnorm r
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
qnorm plot
200
Density plot
-100
0
Residuals
Kernel density estimate
Normal density
kernel = epanechnikov, bandwidth = 15.5162
100
200
0
-100
-200
-200
0
.002
Residuals
.004
100
.006
.008
Kernel density estimate
-200
-100
0
Inverse Normal
100
200
More ’formal’ tests
1. Shapiro-Wilk W test for normality. Tests proximity of our residual distribution compared
wit the normal bell curve. Ho: residuals are normally distributed
swilk r
swilk r
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+-----------------------------------------------------r |
381
0.99698
0.795
-0.544
0.70691
5. Homoskadasticity
• Homoskedasticity: The error has a constant variance around our regression line
• The opposite of this is:
• Heteroskedasticity: The variance of the error depends on the values of Xs.
What does hetereoskadasticity look like?
• Plotting the residuals against X, we should not variance around a fitted line
consequences
• If you find heteroscedasticity, like multicollinearity, this will effect the EFFICIENCY of the model.
• The calculation of standard errors, and thus P values will be uncertain, since differences in residuals
dispersions is depending on the level of the variables.
• The effect of X on Y might be very significant at some levels of X, and less so at others, which
makes a total significance calculation impossible.
• Heteroscedasticity does not necessarily result in biased parameter estimates but OLS is no longer
BLUE.
•
Risk for Type I or Type II error will increase (what are these??)
• E.g. ‘false positive’ & ‘false negative’
How to check for Heteroskadasticity
1. A visual plot of the residuals over the fitted values of Y: rvfplot, yline(0)
Here we do not want to see any pattern – just a random insignificant scattering of dots..
Use the ’academic performance data’, and regress academic performance (api100) on the %of ESL
learniners (ell), %of students with free meals (meals), and average education of parents (ave_ed)
reg api00 meals ell
avg_ed
-----------------------------------------------------------------------------api00 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------meals |
-3.006897
.1891693
-15.90
0.000
-3.378856
-2.634937
ell |
-.8303102
.1946917
-4.26
0.000
-1.213128
-.4474925
avg_ed |
27.65032
6.867322
4.03
0.000
14.14726
41.15337
_cons |
781.9566
27.12323
28.83
0.000
728.6248
835.2883
------------------------------------------------------------------------------
rvfplot, yline(0)
200
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
100
• What do we
observe?
-200
-100
0
Residuals
• Looks kind of
random, but error
term seems to
narro as fitted
values get higher..
400
500
600
700
Fitted values
800
900
More ’formal’ tests
2. Breusch-Pagan / Cook-Weisberg test
-Regresses Sq. Errors on X’s
*good at detecting linear hetereoskadascticity,
but not for non-linear forms.
estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
Ho: Constant variance
Variables: fitted values of api00
chi2(1)
=
12.60
Prob > chi2
=
0.0004
Ho: no heteroskadascticity
. estat imtest
3. Cameron & Trivedi's IM test
-similar, but includes also sqaured X’s in
regression
Ho: no heteroskadascticity
**both are sensitive and will often be signficant
even with only slight hetero…
Cameron & Trivedi's decomposition of IM-test
--------------------------------------------------Source |
chi2
df
p
---------------------+----------------------------Heteroskedasticity |
23.55
9
0.0051
Skewness |
6.16
3
0.1040
Kurtosis |
0.39
1
0.5305
---------------------+----------------------------Total |
30.10
13
0.0046
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
200
If we find something, we might check individual IV’s and
residual plots, and look at correlations of IV’s and error
pwcorr r meals ell emer
100
|
r
meals
ell
emer
-200
-100
0
-------------+------------------------------------
0
20
40
60
english language learners
80
100
r |
1.0000
meals |
0.0000
1.0000
ell |
-0.0000
0.7724
1.0000
emer |
-0.0000
0.5330
0.4722
1.0000
What to do about this?
• You don’t always have to do anything, but if severe:
1. Try transforming X’s (non-linear, logged,, un log) to make relationship more linear
2. Remove variables that are suspect or insignifincat and re-run regression.
3. Add more variables
4. ”weighted least squares regression” WLS , where certain observations, (maybe those that
deviate most?), are weighted less then others, thus affecting the standard errors.
5. Use strictor alpha in significance cut-off of P-vales – 0.01 instead of 0.05 to reduce risk of
Type I error
6) Autocorrelation
Same unobservable forces might be influencing the dependent variable
in successive time points
– *It is defined as the correlation is between two values of the same
variable at times Xt and Xt-1.
• For example, factors are likely to predict defense
spending/voting/economic development, etc. at 1971 are likely to also
predict 1972 and therefore whatever error remains from our
estimation of Y in 1971 will persist in 1972.
• Can lead to BIAS and/or INEFICIENT estimates with OLS
4) Autocorrelation
• The problem also occurs when the order of observation by geographical location. serial vs. spatial
autocorrelation.
• The consequences are quite serious. Positive autocorrelation will tend to increase the variation of the sample
distributions for the estimated coefficients - which then can indicate great variation across different models.
•
• In a simple model the result would, on the contrary, result in an underestimation of standard errors of the
actual estimates, so we can risk that a given coefficient is significant while it is not. It same goes for R2, which
may also be overestimated.
4) Autocorrelation
• Detection:
• The Durbin Watson goes between 0 to 4, where 0 indicates high
positive autocorrelation and 4 high negative autocorr. while 2
indicates absence of autocorrelation.
– In Stata: estat dwatson
• Solutions
• In order to correct for auto-correlation one has to use time-series
regression but in OLS one could consider to lag the dep. Variable
(Y1 – Yt-1) and thereby remove all non-independent information in a
variable (only works in time series autocorrelation).
MORE ON AUTOCORRELATION IN
STEFAN’S TIME SERIES MOMENT!
7) Independence of Errors
• This assumption states that an error from one observation is independent of the error
from another observation.
• Actually, it is not the dependency by itself that matters, it if whether they are correlated
that matters..
•
• dependency in errors often happens in financial and economic time series data and in
cross-country multilevel data (e.g. survey data from multiple countries)
•
• Multi level - affect coefficients & sig.
• TSCS - affects mainly sigs.
• A Hausman test can be used to estimate this.
What needs to be considered depends on your data!
1.
Model specification – linearity
-Always important
2.
No extreme observations
-more important in small samples
3.
No strong multicollinearity
-more important in small samples
4.
No autocorrelation
-more important in time-series or cross-section data
5.
Errors have zero mean with a normal distribution
-more important in small samples
6.
Errors have constant variance
-more important in large samples - in small samples - outliers are more severe
7.
Observations shall be independent of each other
-more important in time-series or multi-level cross-section data
Interaction terms
• Back to our discussion about the assumption of ’proper model specification’
• Sometimes our X variable has a non-linear effect due to an interaction with another IV. When testing
this, it is called a ‘conditional hypothesis’
• A conditional hypothesis is simply one in which a relationship between two or more variables depends
on the value of one or more other variables.
– Ex. An increase in X is associated with an increase in Y when condition Z is met, but not when condition Z is absent.
– Ex. The effect of education on income is stronger in men than in women
• In technical terms, we compare the following two models:
Additative multiple regression model
Y= α + β1𝑥 + β2𝑧 + µ
Multiplicative multiple regression model
Y= α + β1𝑥 + β2𝑧 + β3𝑥𝑧 + µ
Types of interaction terms
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
• Our X variables can take different shapes depending on their measurment
These are the combinations in order of complexity to interpret:
1. Two dummy variables: ex. gender*unemployed
2. One dummy, one continuous/ordinal variable: ex. gender*age
3. Two continuous/ordinal variales: ex. age*income
the first inteaction can also be modelled as 3 dummy variabels in relation to one reference
category, in this case unemployed males:
Y= α + β1(𝑓𝑒𝑚𝑎𝑙𝑒_𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑) + β2(𝑚𝑎𝑙𝑒_𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑) + β3(𝑓𝑒𝑚𝑎𝑙𝑒_𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑 + µ
X=1, high
Y
X=1, high
X=0, low
Z
X=1, high
Y
X=0, low
Y
X=0, low
Z
Z
One dummy, one continuous/ordinal: visual example
In this example
Z=0 and 1
Taken from Brambor et. al. (2005)
Interaction Interpretations
• Multiplicative multiple regression model
• Y= α + β1𝑥 + β2𝑧 + 𝜷𝟑𝒙𝒛 + µ
• when condition Z (a dummy variable) is absent (e.g. =0) the equation above simplifies
to:
• Y= α + β1𝑥 + µ
• Where B1 is the effect of X for observations that take 0 for Z
• And when condition Z is present (e.g. =1, or greater) the effect of x on y becomes:
• Y= (α + β2) + β1 + β3 𝑥 + µ
• Now we see that B1X cannot be interpreted indpendently of B3
4 important points
1.
Interaction models should be used whenever the hypothesis they want to test is
conditional in nature
2.
Include All ”Constitutive Terms”. These are just the two variales that make up
the interaction (e.g. x and z).
3.
Do Not Interpret Constitutive Terms as Unconditional Marginal Effects
4.
Calculate Substantively Meaningful Marginal Effects and Standard Errors
Brambor, Thomas, William Roberts Clark, & Matt Golder. 2006. "Understanding
Interaction Models: Improving Empirical Analyses." Political Analysis 14: 63-82.
Include All Constitutive Terms
• No matter what form the interaction term takes, all constitutive terms should be
included. Thus, X should be included when the interaction term is X2 and X, Z, J,
XZ, XJ, and ZJ should be included when the interaction term is XZJ.
• b1 does not represent the average effect of X on Y; it only indicates the effect of X
when Z is zero
• b2 does not represent the average effect of Z on Y; it only indicates the effect of Z
when X is zero
• Excluding X or Z is equivalent to assuming that b1 and b2 is zero.
Taken from Brambor et. al. (2005)
Include All Constitutive Terms
The constitutive terms (b2x
and b3Z) captures the
difference in the intercepts
between the regression lines
for the case in which
condition Z is present and the
case in which condition Z is
absent - omitting Z amounts
to constraining the two
regression lines to meet on
the Y axis.
Taken from Brambor et. al. (2005)
Multicollinearity
• Just as we discussed with the quadradic term, the coefficients in interaction models
no longer indicate the average effect of a variable as they do in an additive model. As
a result, they are almost certain to change with the inclusion of an interaction term,
and this should not be interpreted as a sign of multicollinearity.
• Even if there really is high multicollinearity and this leads to large standard errors on
the model parameters, it is important to remember that these standard errors are
never in any sense ‘‘too’’ large—they are always the ‘‘correct’’ standard errors.
• High multicollinearity simply means that there is not enough information in the data to
estimate the model parameters accurately and the standard errors rightfully reflect
this.
Multicollinearity
• ‘solutions’ have been posited: re-scaling the variables, ‘centering’
• Centering the IV:s around their mean does not solve the problem (Aiken and West
(1991))
• Regardless of the complexity of the regression equation, centering has no effect at all on
the coefficients of the highest-order terms, but may drastically change those of the lowerorder terms in the equation.
• Centering unstandardized IVs usually does not affect anything of interest. Simple slopes
will be the same in centered as in un-centered equations, their standard errors and ttests will be the same, and interaction plots will look exactly the same, but with different
values on the x-axis.
3. Do Not Interpret Constitutive Terms as
Unconditional Marginal Effects
• When we have an interaction, the effect of the independent variable X on the
dependent variable Y depend on some third variable Z (and vice versa).
• The coefficient on X only captures the effect of X on Y when Z is zero. Similarly, the
coefficient of Z only captures the effect of Z on Y when X is zero.
• It is, therefore, incorrect to say that a positive and significant coefficient on X(or Z)
indicates that an increase in X (or Z) is expected to lead to an increase in Y.
• Also, whether X modified Z or vice versa cannot be determined by the model, only by
the researcher and the theory behind it!
4. Calculate Substantively Meaningful Marginal
Effects and Standard Errors
typical results tables will report only the marginal effect of X when the conditioning variable is
zero, i.e., b1. Similarly, Stata tables report only the standard error for this particular effect. As
a result, the only inference we can draw is whether X has a sig. effect on Y when Z = 0
Basically, we want to know WHERE and HOW MUCH Z conditions X’s effect on Y, and the
significance level.
Results tables are often quite uninformative in this respect. Even a ‘significant’ interaction
coefficient might not be that interesting, while even an insignificant one can actually be
significant at certain levels of Z (or X)
This is where the Margins command in Stata is very helpful (help margins)
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
Example: back to explaining % women in parliament
• This time let’s try a few differnet vairables:
• IV’s – level of democracy (0-10) and the % of protestants in a country (0-100)
reg
ipu_l_sw c.fh_polity2 c.lp_protmg80
Source |
SS
df
MS
-------------+----------------------------------
Number of obs
=
152
F(2, 149)
=
12.59
Model |
2248.21552
2
1124.10776
Prob > F
=
0.0000
Residual |
13306.2439
149
89.3036502
R-squared
=
0.1445
Adj R-squared
=
0.1331
Root MSE
=
9.4501
-------------+---------------------------------Total |
15554.4594
151
103.009665
-----------------------------------------------------------------------------ipu_l_sw |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------fh_polity2 |
.6576315
.263696
2.49
0.014
.1365647
1.178698
lp_protmg80 |
.1399659
.0419661
3.34
0.001
.0570404
.2228914
_cons |
10.65832
1.767326
6.03
0.000
7.166058
14.15058
------------------------------------------------------------------------------
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
Example: back to explaining % women in parliament
• This time let’s try a few differnet vairables:
• IV’s – level of democracy (0-10) and the % of protestants in a country (0-100)– now w/ interaction
• What does this tell us generally speaking?
reg
ipu_l_sw c.fh_polity2 c.lp_protmg80 c.lp_protmg80#c.fh_polity2
Source |
SS
df
MS
-------------+----------------------------------
Number of obs
=
152
F(3, 148)
=
12.30
Model |
3104.63365
3
1034.87788
Prob > F
=
0.0000
Residual |
12449.8258
148
84.1204443
R-squared
=
0.1996
Adj R-squared
=
0.1834
Root MSE
=
9.1717
-------------+---------------------------------Total |
15554.4594
151
103.009665
-------------------------------------------------------------------------------------------ipu_l_sw |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
---------------------------+---------------------------------------------------------------fh_polity2 |
.3311227
.2756287
1.20
0.232
-.2135533
.8757987
lp_protmg80 |
c.lp_protmg80#c.fh_polity2 |
-.397334
.173249
-2.29
0.023
-.7396953
-.0549728
.0607897
.0190519
3.19
0.002
.0231409
.0984386
13.24061
1.896612
6.98
0.000
9.492677
16.98855
|
_cons |
Using margins for interpreation
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
• We can show this interaction a number of ways:
• 1. the margina effect (1 unit increase) of democracy over a range of % protestant
margins, dydx(fh_polity2 ) at (lp_protmg80=(0 13 52 97))
• Where ’dydx’ means we want to see a marginal effect (ΔY from a 1 unit increase in X)
• The numbers (0 13 52 97) after the % protestant variable are just the min, mean +2 s.d. and
max values. I got these from just doing the ‘sum’ command
• To see a visual plot, just type
marginsplot
0
2
4
6
8
10
Average Marginal Effects of fh_polity2 with 95% CIs
0
13
52
Religion: Protestant
97
Using margins for interpreation
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
• 2. compare predicted levels of % women in parliament for 2 ’meaningful’ vlaues of
democracy over a range of % protestant
margins, at (lp_protmg80=(0 13 52 97)fh_polity2 =(0 10) )
• Note the ’modifying variable’ (e.g. on the x axis) goes 1st after ’at’.
• The numbers (0 13 52 97) after the % protestant variable are just the min, mean +2 s.d. and
max values. 0 and 10 for democracy are just the min and max values. I got these from just
doing the ‘sum’ command
• This is also what you’d do if you had a binary variable in the interaction (e.g. instead of 0
and 10, just type 0 1)…
• To see a visual plot, just type
marginsplot
-60
-40
-20
0
20
40
Adjusted Predictions with 95% CIs
0
13
52
Religion: Protestant
fh_polity2=0
97
fh_polity2=10
ORGANISATIONSNAMN (ÄNDRA SIDHUVUD
VIA FLIKEN INFOGA-SIDHUVUD/SIDFOT)
We will do more in the next section with
this command!
Next time: models for limited
dependent variables: logit and probit