Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interaction (statistics) wikipedia , lookup
Lasso (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Choice modelling wikipedia , lookup
Least squares wikipedia , lookup
Linear regression wikipedia , lookup
Research Design & Analysis 2: Class 19 Review least squares regression line and relation between regression coefficient and correlation coefficient • Changes in regression lines as correlation changes • Phillippe Rushton: (women less brainy than men) • Correlational and Ex Post Facto designs • Cautions interpreting causation • Simpson's paradox Mental Rotation: Shephard & Metzler Calculating I.Q. and GPA Correlation Formulas If using standard scores and 2 variables [1 IV], regression coefficient (b) [or raw score regression weight] = standardized regression weight (or ) = correlation coefficient (r) b==r Reminder: Implications of Formulas If standard scores (z-scores) are plotted, the slope of the least squares regression line = r r= change in S.D. units in Y' (the predicted value of Y) associated with a change of 1 S.D. in X. For perfect correlations (r=± 1.0) 1) Every participant who obtained a given value of X obtained one, and only one value of Y: there are no differences among Y scores for a given X 2) Y scores are perfectly predictable from X scores: the data points for a given X are all on top of one another and all data points fall along the regression line. Regression Lines: r = 1 For Intermediate Correlations: 0< r < 1 1) There are different values of Y for each X, however these different Ys are relatively close in value (the variability in Y associate with a given X is less than the overall variability in Y) 2) knowing X allows prediction of approximately what Y will be: data points will fall near the regression line but not on it. Regression Lines: 0< r < 1 For Zero Correlation: r = 0 1) Y scores are as variable at a any given value of X as in the entire sample (i.e. across all Xs) 2) The best prediction of Y, regardless of X will be the average of Y and there will be no regression solution. Regression Lines: r = 0 Implications of Formulas As the correlation grows less strong, Y' moves less in response to a given change in X, (the slope, b approaches 0). If r = 0, best predictor of Y from X is the mean of Y, and the best predictor of X from Y is the mean of X. If r =± 1.0: then the regression line from regressing Y on X and the regression line from regressing X on Y are the same. Implications of Formulas As the correlation between X and Y weakens, the predicted value of Y' for a Zx=1 will be Zy’= r <1 and the predicted value of X' for a Zy=1 will be Zx'<1. The regression lines predicting Y' from X and X' from Y diverge with decreasing correlation until at r=0.0, they are perpendicular: horizontal and vertical lines passing through the means of Y and X respectively. This can lead to regression artifact... Are Male Brains Bigger than Female Brains? Regression Lines: 0< r < 1 Are Male Brains Bigger than Female Brains? Are Male Brains Bigger than Female Brains? Cautions for Regression Data Same as for correlations: • Regression assumes linear relations • Truncated ranges • Outliers • Heteroscedasticity • Combining data from different groups Also (if a correlational design) 1) subjects not randomly assigned 2) No attempt (in correlational designs) to control variables, 3) different levels of the IV are not contrasted while concurrently holding other variables constant. Anscombe’s quartet Correlation Versus ex-post facto Designs These are very similar and you can convert one to the other e.g., assign dummy coding to the categorical (nominal) variable (if there is one) and calculate a point-biserial correlation coefficient Interpretation problems are not related to the statistical choice, rather due to the design Death Sentences for Murder in Southern U.S. Death Sentences for Murder in Southern U.S. Paradox • Once convicted of murder, Whites are more likely to be sentenced to death than are Blacks • Yet for both Black and White victims, Black murderers are more likely to be sentenced to death. ? Death Sentences for Murder in Southern U.S. Explaining the Paradox How does this help us explain the paradox? • Victim’s race is a confound • Tendency to murder members of own race. • Whites are more likely to murder Whites and this is treated as a more serious crime, (in terms of likelihood of death penalty) • Relative risk ratio = (30/214) (6/112) = 2.6 [Murderers are 2.6 times as likely to be sentenced to death for killing a white vs a blacks victim. Simpson's Paradox Classify two groups with respect to the incidence of one attribute; if groups are then separated into categories (subgroups), the group with the higher overall incidence can have lower incidence within each (every) category (sub-group). (and vice versa) Simpson's Paradox - 2nd Example Negative correlation between starting salary for people with economics degrees and the level of degree they have obtained. i.e. Ph.D.s in economics earn less than M.A.s, who earn less than B.A.s Does this make sense? No! Break down these data in terms of the type of employment (industry, government, teaching). In every type of job: there was a positive correlation between degree and starting salary. Simpson's Paradox • Employment selection is the confounding third variable: • Teachers get paid less than government workers who get paid less than those in private industry, • People with higher degrees are more likely to end up teaching and those with B.A.s are very unlikely to be teachers. • Examples of the danger of combining data from several distinct groups (with respect to the relation between two variables) in calculating correlations. • How could sampling avoid these cases of Simpson’s paradox? • Use stratified sampling. • If equal numbers of people are sampled from the categories, the overall relationship will be an average of the relations in the subcategories.