Download Lecture notes for 11/21/00 - University of Maryland

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia, lookup

Linear regression wikipedia, lookup

Data assimilation wikipedia, lookup

Choice modelling wikipedia, lookup

Time series wikipedia, lookup

Least squares wikipedia, lookup

Coefficient of determination wikipedia, lookup

Transcript
Sociology 601 Class 21: November 10, 2009
• Review
– formulas for b and se(b)
– stata regression commands & output
• Violations of Model Assumptions, and their effects (9.6)
• Causality (10)
1
Formulas for b, a, r, and se(b)
(X  X )(Y  Y )
sx
b
;
a

Y

bX
;r

b
2
(X  X )
sy
Yˆ  a  bX;
SSE  (Y  Yˆ )
SSE
n

2
se(b) 
sx
n 1
2
2
Stata Example of Inference about a Slope
. summarize murder poverty
Variable |
Obs
Mean Std. Dev.
Min
Max
-------------+-------------------------------------------------------murder |
51 8.727451 10.71758
1.6
78.5
poverty |
51 14.25882 4.584242
8
26.4
. regress murder poverty
Source |
SS
df
MS
Number of obs =
51
-------------+-----------------------------F( 1, 49) = 23.08
Model | 1839.06931 1 1839.06931
Prob > F
= 0.0000
Residual | 3904.25223 49 79.6786169
R-squared = 0.3202
-------------+-----------------------------Adj R-squared = 0.3063
Total | 5743.32154 50 114.866431
Root MSE
= 8.9263
-----------------------------------------------------------------------------murder |
Coef. Std. Err.
t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339
_cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707
----------------------------------------------------------------------------3
Stata Example of Inference about a Slope
. correlate murder poverty
(obs=51)
| murder poverty
-------------+-----------------murder | 1.0000
poverty | 0.5659 1.0000
. correlate murder poverty, covariance
(obs=51)
| murder poverty
-------------+-----------------murder | 114.866
poverty | 27.8024 21.0153
sqrt(114.866) = 14.26 = sd(y);
sqrt (21.0153) = 8.73 = sd(x)
4
Alternative Formula for b
(X  X )(Y  Y )
b
2
(X  X )
(X  X )(Y  Y ) /(N 1)

2
(X  X ) /(N 1)
cov ariance(x, y)

var iance(x)
b = 27.8024 / 21.0153 = 1.323
5
Stata Example of Inference about a Slope
scatter murder poverty || lfit murder poverty
6
Stata Example of Inference about a Slope
. regress murder poverty if state!="DC"
Source |
SS
df
MS
Number of obs =
50
-------------+-----------------------------F( 1, 48) = 31.36
Model | 307.342297 1 307.342297
Prob > F
= 0.0000
Residual | 470.406476 48 9.80013492
R-squared = 0.3952
-------------+-----------------------------Adj R-squared = 0.3826
Total | 777.748773 49 15.8724239
Root MSE
= 3.1305
-----------------------------------------------------------------------------murder |
Coef. Std. Err.
t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------poverty | .5842405 .104327 5.60 0.000 .3744771 .7940039
_cons | -.8567153 1.527798 -0.56 0.578 -3.92856 2.215129
------------------------------------------------------------------------------
7
Assumptions Needed to make Population Inferences for
slopes.
• The sample is selected randomly.
• X and Y are interval scale variables.
• The mean of Y is related to X by the linear equation
E{Y} =  + X.
• The conditional standard deviation of Y is identical at
each X value. (no heteroscedasticity)
• The conditional distribution of Y at each value of X is
normal.
• There is no error in the measurement of X.
8
Common Ways to Violate These Assumptions
•
•
The sample is selected randomly.
o
Cluster sampling (e.g., census tracts / neighborhoods) causes
observations in any cluster to be more similar than to observations
outside the cluster.
o
Autocorrelation (spatial and temporal)
o
Two or more siblings in the same family.
o
Sample = populations (e.g., states in the U.S.)
X and Y are interval scale variables.
o
Ordinal scale attitude measures
o
Nominal scale categories (e.g., race/ethnicity, religion)
9
Common Ways to Violate These Assumptions (2)
•
•
The mean of Y is related to X by the linear equation
E{Y} =  + X.
o
U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita)
o
Thresholds:
o
Logarithmic (e.g., earnings <- education)
The conditional standard deviation of Y is identical at each
X value. (no heteroscedasticity)
o
earnings <- education
o
hours worked <- years
o
adult child occupational status <- parental occupational status
10
Common Ways to Violate These Assumptions (3)
•
The conditional distribution of Y at each value of X is
normal.
o
earnings (skewed) <- education
o
Y is binary
o
Y is a %
• There is no error in the measurement of X.
o
almost everything
o
what is the effect of measurement error in x on b?
11
Things to watch out for: extrapolation.
Extrapolation beyond observed values of X is dangerous.
• The pattern may be nonlinear.
• Even if the pattern is linear, the standard errors become
increasingly wide.
• Be especially careful interpreting the Y-intercept: it may lie
outside the observed data.
o e.g., year zero
o e.g., zero education in the U.S.
o e.g., zero parity
12
Things to watch out for: outliers
• Influential observations and outliers may unduly influence
the fit of the model.
•
The slope and standard error of the slope may be affected
by influential observations.
•
This is an inherent weakness of least squares regression.
•
You may wish to evaluate two models; one with and one
without the influential observations.
13
Things to watch out for: truncated samples
Truncated samples cause the opposite problems of influential
observations and outliers.
•
Truncation on the X axis reduces the correlation coefficient
for the remaining data.
•
Truncation on the Y axis is a worse problem, because it
violates the assumption of normally distributed errors.
•Examples: Topcoded income data, health as measured by
number of days spent in a hospital in a year.
14
Causality
• We never prove that x causes y
• Research and theory make it increasingly likely
• Criteria:
• association
• time order
• no alternative explanations
• is the relationship spurious?
15
Alternative Explanations
Example: Neighborhood poverty -> Low Test Scores
16
Alternative Explanations
Example: Neighborhood poverty -> Low Test Scores
Possible solutions:
• multivariate models
• e.g., control for parents’ education, income
• controls for other measureable differences
• fixed effects models
• e.g., changes in poverty -> changes in test scores
• controls for constant, unmeasured differences
• instrumental variables
• find an instrument that affects x1 but not y
• experiments
• e.g., Moving to Opportunity
• randomize increases in $
17
Alternative Explanations
Example: Fertility -> Lower Mothers’ LFP
Possible solutions:
18
Alternative Explanations
Example: Fertility -> Lower Mothers’ LFP
Possible solutions:
• multivariate models
• e.g., control for gender attitudes
• controls for other measureable differences
• fixed effects models
• e.g., changes in # children -> dropping out
• controls for constant, unmeasured differences
• instrumental variables
• find an instrument that affects x1 but not y
• e.g., mothers of two same sex children
• experiments
• not feasible (or ethical)
19
Types of 3-variable Causal Models
• Spurious
• x2 causes both x1 and y
• e.g., religion causes fertility and women’s lfp
• Intervening
• x1 causes x2 which causes y
• e.g., fertility raises time spent on children which
lowers time in the labor force
• What is the statistical difference between these?
20
Another type of 3-varaible relationship:
Statistical Interaction Effects
Example: Fertility -> Lower Mothers’ LFP
The relationship between x1 and y depends on the value of
another variable, x2
• e.g., marital status -> earnings depends on gender
21