Download Lecture notes for 11/21/00 - University of Maryland

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Least squares wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Sociology 601 Class 21: November 10, 2009
• Review
– formulas for b and se(b)
– stata regression commands & output
• Violations of Model Assumptions, and their effects (9.6)
• Causality (10)
1
Formulas for b, a, r, and se(b)
(X  X )(Y  Y )
sx
b
;
a

Y

bX
;r

b
2
(X  X )
sy
Yˆ  a  bX;
SSE  (Y  Yˆ )
SSE
n

2
se(b) 
sx
n 1
2
2
Stata Example of Inference about a Slope
. summarize murder poverty
Variable |
Obs
Mean Std. Dev.
Min
Max
-------------+-------------------------------------------------------murder |
51 8.727451 10.71758
1.6
78.5
poverty |
51 14.25882 4.584242
8
26.4
. regress murder poverty
Source |
SS
df
MS
Number of obs =
51
-------------+-----------------------------F( 1, 49) = 23.08
Model | 1839.06931 1 1839.06931
Prob > F
= 0.0000
Residual | 3904.25223 49 79.6786169
R-squared = 0.3202
-------------+-----------------------------Adj R-squared = 0.3063
Total | 5743.32154 50 114.866431
Root MSE
= 8.9263
-----------------------------------------------------------------------------murder |
Coef. Std. Err.
t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339
_cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707
----------------------------------------------------------------------------3
Stata Example of Inference about a Slope
. correlate murder poverty
(obs=51)
| murder poverty
-------------+-----------------murder | 1.0000
poverty | 0.5659 1.0000
. correlate murder poverty, covariance
(obs=51)
| murder poverty
-------------+-----------------murder | 114.866
poverty | 27.8024 21.0153
sqrt(114.866) = 14.26 = sd(y);
sqrt (21.0153) = 8.73 = sd(x)
4
Alternative Formula for b
(X  X )(Y  Y )
b
2
(X  X )
(X  X )(Y  Y ) /(N 1)

2
(X  X ) /(N 1)
cov ariance(x, y)

var iance(x)
b = 27.8024 / 21.0153 = 1.323
5
Stata Example of Inference about a Slope
scatter murder poverty || lfit murder poverty
6
Stata Example of Inference about a Slope
. regress murder poverty if state!="DC"
Source |
SS
df
MS
Number of obs =
50
-------------+-----------------------------F( 1, 48) = 31.36
Model | 307.342297 1 307.342297
Prob > F
= 0.0000
Residual | 470.406476 48 9.80013492
R-squared = 0.3952
-------------+-----------------------------Adj R-squared = 0.3826
Total | 777.748773 49 15.8724239
Root MSE
= 3.1305
-----------------------------------------------------------------------------murder |
Coef. Std. Err.
t P>|t| [95% Conf. Interval]
-------------+---------------------------------------------------------------poverty | .5842405 .104327 5.60 0.000 .3744771 .7940039
_cons | -.8567153 1.527798 -0.56 0.578 -3.92856 2.215129
------------------------------------------------------------------------------
7
Assumptions Needed to make Population Inferences for
slopes.
• The sample is selected randomly.
• X and Y are interval scale variables.
• The mean of Y is related to X by the linear equation
E{Y} =  + X.
• The conditional standard deviation of Y is identical at
each X value. (no heteroscedasticity)
• The conditional distribution of Y at each value of X is
normal.
• There is no error in the measurement of X.
8
Common Ways to Violate These Assumptions
•
•
The sample is selected randomly.
o
Cluster sampling (e.g., census tracts / neighborhoods) causes
observations in any cluster to be more similar than to observations
outside the cluster.
o
Autocorrelation (spatial and temporal)
o
Two or more siblings in the same family.
o
Sample = populations (e.g., states in the U.S.)
X and Y are interval scale variables.
o
Ordinal scale attitude measures
o
Nominal scale categories (e.g., race/ethnicity, religion)
9
Common Ways to Violate These Assumptions (2)
•
•
The mean of Y is related to X by the linear equation
E{Y} =  + X.
o
U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita)
o
Thresholds:
o
Logarithmic (e.g., earnings <- education)
The conditional standard deviation of Y is identical at each
X value. (no heteroscedasticity)
o
earnings <- education
o
hours worked <- years
o
adult child occupational status <- parental occupational status
10
Common Ways to Violate These Assumptions (3)
•
The conditional distribution of Y at each value of X is
normal.
o
earnings (skewed) <- education
o
Y is binary
o
Y is a %
• There is no error in the measurement of X.
o
almost everything
o
what is the effect of measurement error in x on b?
11
Things to watch out for: extrapolation.
Extrapolation beyond observed values of X is dangerous.
• The pattern may be nonlinear.
• Even if the pattern is linear, the standard errors become
increasingly wide.
• Be especially careful interpreting the Y-intercept: it may lie
outside the observed data.
o e.g., year zero
o e.g., zero education in the U.S.
o e.g., zero parity
12
Things to watch out for: outliers
• Influential observations and outliers may unduly influence
the fit of the model.
•
The slope and standard error of the slope may be affected
by influential observations.
•
This is an inherent weakness of least squares regression.
•
You may wish to evaluate two models; one with and one
without the influential observations.
13
Things to watch out for: truncated samples
Truncated samples cause the opposite problems of influential
observations and outliers.
•
Truncation on the X axis reduces the correlation coefficient
for the remaining data.
•
Truncation on the Y axis is a worse problem, because it
violates the assumption of normally distributed errors.
•Examples: Topcoded income data, health as measured by
number of days spent in a hospital in a year.
14
Causality
• We never prove that x causes y
• Research and theory make it increasingly likely
• Criteria:
• association
• time order
• no alternative explanations
• is the relationship spurious?
15
Alternative Explanations
Example: Neighborhood poverty -> Low Test Scores
16
Alternative Explanations
Example: Neighborhood poverty -> Low Test Scores
Possible solutions:
• multivariate models
• e.g., control for parents’ education, income
• controls for other measureable differences
• fixed effects models
• e.g., changes in poverty -> changes in test scores
• controls for constant, unmeasured differences
• instrumental variables
• find an instrument that affects x1 but not y
• experiments
• e.g., Moving to Opportunity
• randomize increases in $
17
Alternative Explanations
Example: Fertility -> Lower Mothers’ LFP
Possible solutions:
18
Alternative Explanations
Example: Fertility -> Lower Mothers’ LFP
Possible solutions:
• multivariate models
• e.g., control for gender attitudes
• controls for other measureable differences
• fixed effects models
• e.g., changes in # children -> dropping out
• controls for constant, unmeasured differences
• instrumental variables
• find an instrument that affects x1 but not y
• e.g., mothers of two same sex children
• experiments
• not feasible (or ethical)
19
Types of 3-variable Causal Models
• Spurious
• x2 causes both x1 and y
• e.g., religion causes fertility and women’s lfp
• Intervening
• x1 causes x2 which causes y
• e.g., fertility raises time spent on children which
lowers time in the labor force
• What is the statistical difference between these?
20
Another type of 3-varaible relationship:
Statistical Interaction Effects
Example: Fertility -> Lower Mothers’ LFP
The relationship between x1 and y depends on the value of
another variable, x2
• e.g., marital status -> earnings depends on gender
21