Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sociology 601 Class 21: November 10, 2009 • Review – formulas for b and se(b) – stata regression commands & output • Violations of Model Assumptions, and their effects (9.6) • Causality (10) 1 Formulas for b, a, r, and se(b) (X X )(Y Y ) sx b ; a Y bX ;r b 2 (X X ) sy Yˆ a bX; SSE (Y Yˆ ) SSE n 2 se(b) sx n 1 2 2 Stata Example of Inference about a Slope . summarize murder poverty Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------murder | 51 8.727451 10.71758 1.6 78.5 poverty | 51 14.25882 4.584242 8 26.4 . regress murder poverty Source | SS df MS Number of obs = 51 -------------+-----------------------------F( 1, 49) = 23.08 Model | 1839.06931 1 1839.06931 Prob > F = 0.0000 Residual | 3904.25223 49 79.6786169 R-squared = 0.3202 -------------+-----------------------------Adj R-squared = 0.3063 Total | 5743.32154 50 114.866431 Root MSE = 8.9263 -----------------------------------------------------------------------------murder | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339 _cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707 ----------------------------------------------------------------------------3 Stata Example of Inference about a Slope . correlate murder poverty (obs=51) | murder poverty -------------+-----------------murder | 1.0000 poverty | 0.5659 1.0000 . correlate murder poverty, covariance (obs=51) | murder poverty -------------+-----------------murder | 114.866 poverty | 27.8024 21.0153 sqrt(114.866) = 14.26 = sd(y); sqrt (21.0153) = 8.73 = sd(x) 4 Alternative Formula for b (X X )(Y Y ) b 2 (X X ) (X X )(Y Y ) /(N 1) 2 (X X ) /(N 1) cov ariance(x, y) var iance(x) b = 27.8024 / 21.0153 = 1.323 5 Stata Example of Inference about a Slope scatter murder poverty || lfit murder poverty 6 Stata Example of Inference about a Slope . regress murder poverty if state!="DC" Source | SS df MS Number of obs = 50 -------------+-----------------------------F( 1, 48) = 31.36 Model | 307.342297 1 307.342297 Prob > F = 0.0000 Residual | 470.406476 48 9.80013492 R-squared = 0.3952 -------------+-----------------------------Adj R-squared = 0.3826 Total | 777.748773 49 15.8724239 Root MSE = 3.1305 -----------------------------------------------------------------------------murder | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------poverty | .5842405 .104327 5.60 0.000 .3744771 .7940039 _cons | -.8567153 1.527798 -0.56 0.578 -3.92856 2.215129 ------------------------------------------------------------------------------ 7 Assumptions Needed to make Population Inferences for slopes. • The sample is selected randomly. • X and Y are interval scale variables. • The mean of Y is related to X by the linear equation E{Y} = + X. • The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity) • The conditional distribution of Y at each value of X is normal. • There is no error in the measurement of X. 8 Common Ways to Violate These Assumptions • • The sample is selected randomly. o Cluster sampling (e.g., census tracts / neighborhoods) causes observations in any cluster to be more similar than to observations outside the cluster. o Autocorrelation (spatial and temporal) o Two or more siblings in the same family. o Sample = populations (e.g., states in the U.S.) X and Y are interval scale variables. o Ordinal scale attitude measures o Nominal scale categories (e.g., race/ethnicity, religion) 9 Common Ways to Violate These Assumptions (2) • • The mean of Y is related to X by the linear equation E{Y} = + X. o U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita) o Thresholds: o Logarithmic (e.g., earnings <- education) The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity) o earnings <- education o hours worked <- years o adult child occupational status <- parental occupational status 10 Common Ways to Violate These Assumptions (3) • The conditional distribution of Y at each value of X is normal. o earnings (skewed) <- education o Y is binary o Y is a % • There is no error in the measurement of X. o almost everything o what is the effect of measurement error in x on b? 11 Things to watch out for: extrapolation. Extrapolation beyond observed values of X is dangerous. • The pattern may be nonlinear. • Even if the pattern is linear, the standard errors become increasingly wide. • Be especially careful interpreting the Y-intercept: it may lie outside the observed data. o e.g., year zero o e.g., zero education in the U.S. o e.g., zero parity 12 Things to watch out for: outliers • Influential observations and outliers may unduly influence the fit of the model. • The slope and standard error of the slope may be affected by influential observations. • This is an inherent weakness of least squares regression. • You may wish to evaluate two models; one with and one without the influential observations. 13 Things to watch out for: truncated samples Truncated samples cause the opposite problems of influential observations and outliers. • Truncation on the X axis reduces the correlation coefficient for the remaining data. • Truncation on the Y axis is a worse problem, because it violates the assumption of normally distributed errors. •Examples: Topcoded income data, health as measured by number of days spent in a hospital in a year. 14 Causality • We never prove that x causes y • Research and theory make it increasingly likely • Criteria: • association • time order • no alternative explanations • is the relationship spurious? 15 Alternative Explanations Example: Neighborhood poverty -> Low Test Scores 16 Alternative Explanations Example: Neighborhood poverty -> Low Test Scores Possible solutions: • multivariate models • e.g., control for parents’ education, income • controls for other measureable differences • fixed effects models • e.g., changes in poverty -> changes in test scores • controls for constant, unmeasured differences • instrumental variables • find an instrument that affects x1 but not y • experiments • e.g., Moving to Opportunity • randomize increases in $ 17 Alternative Explanations Example: Fertility -> Lower Mothers’ LFP Possible solutions: 18 Alternative Explanations Example: Fertility -> Lower Mothers’ LFP Possible solutions: • multivariate models • e.g., control for gender attitudes • controls for other measureable differences • fixed effects models • e.g., changes in # children -> dropping out • controls for constant, unmeasured differences • instrumental variables • find an instrument that affects x1 but not y • e.g., mothers of two same sex children • experiments • not feasible (or ethical) 19 Types of 3-variable Causal Models • Spurious • x2 causes both x1 and y • e.g., religion causes fertility and women’s lfp • Intervening • x1 causes x2 which causes y • e.g., fertility raises time spent on children which lowers time in the labor force • What is the statistical difference between these? 20 Another type of 3-varaible relationship: Statistical Interaction Effects Example: Fertility -> Lower Mothers’ LFP The relationship between x1 and y depends on the value of another variable, x2 • e.g., marital status -> earnings depends on gender 21