Download Cl19

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Research Design & Analysis 2: Class 19
Review least squares regression line and relation between regression
coefficient and correlation coefficient
• Changes in regression lines as correlation changes
• Phillippe Rushton: (women less brainy than men)
• Correlational and Ex Post Facto designs
• Cautions interpreting causation
• Simpson's paradox
Mental Rotation: Shephard & Metzler
Calculating I.Q. and GPA Correlation
Formulas
If using standard scores and 2 variables [1 IV], regression coefficient (b)
[or raw score regression weight] = standardized regression weight (or
) = correlation coefficient (r)
b==r
Reminder: Implications of Formulas
If standard scores (z-scores) are plotted, the slope of the least squares
regression line = r
r= change in S.D. units in Y' (the predicted value of Y) associated with a
change of 1 S.D. in X.
For perfect correlations (r=± 1.0)
1) Every participant who obtained a given value of X obtained one, and
only one value of Y: there are no differences among Y scores for a
given X
2) Y scores are perfectly predictable from X scores: the data points for a
given X are all on top of one another and all data points fall along the
regression line.
Regression Lines: r = 1
For Intermediate Correlations: 0< r < 1
1) There are different values of Y for each X, however these different Ys
are relatively close in value (the variability in Y associate with a given X
is less than the overall variability in Y)
2) knowing X allows prediction of approximately what Y will be: data
points will fall near the regression line but not on it.
Regression Lines: 0< r < 1
For Zero Correlation: r = 0
1) Y scores are as variable at a any given value of X as in the entire
sample (i.e. across all Xs)
2) The best prediction of Y, regardless of X will be the average of Y and
there will be no regression solution.
Regression Lines: r = 0
Implications of Formulas
As the correlation grows less strong, Y' moves less in response to a given
change in X, (the slope, b approaches 0).
If r = 0, best predictor of Y from X is the mean of Y, and the best
predictor of X from Y is the mean of X.
If r =± 1.0: then the regression line from regressing Y on X and the
regression line from regressing X on Y are the same.
Implications of Formulas
As the correlation between X and Y weakens, the predicted value of Y' for
a Zx=1 will be Zy’= r <1 and the predicted value of X' for a Zy=1 will be
Zx'<1.
The regression lines predicting Y' from X and X' from Y diverge with
decreasing correlation until at r=0.0, they are perpendicular: horizontal
and vertical lines passing through the means of Y and X respectively.
This can lead to regression artifact...
Are Male Brains Bigger than Female Brains?
Regression Lines: 0< r < 1
Are Male Brains Bigger than Female Brains?
Are Male Brains Bigger than Female Brains?
Cautions for Regression Data
Same as for correlations:
• Regression assumes linear relations
• Truncated ranges
• Outliers
• Heteroscedasticity
• Combining data from different groups
Also (if a correlational design)
1) subjects not randomly assigned
2) No attempt (in correlational designs) to control variables,
3) different levels of the IV are not contrasted while concurrently holding other
variables constant.
Anscombe’s quartet
Correlation Versus ex-post facto Designs
These are very similar and you can convert one to the other
e.g., assign dummy coding to the categorical (nominal) variable (if
there is one) and calculate a point-biserial correlation coefficient
Interpretation problems are not related to the
statistical choice, rather due to the design
Death Sentences for Murder in Southern U.S.
Death Sentences for Murder in Southern U.S.
Paradox
• Once convicted of murder, Whites are more likely to be sentenced to
death than are Blacks
• Yet for both Black and White victims, Black murderers are more likely
to be sentenced to death.
?
Death Sentences for Murder in Southern U.S.
Explaining the Paradox
How does this help us explain the paradox?
• Victim’s race is a confound
• Tendency to murder members of own race.
• Whites are more likely to murder Whites and this is treated as a more
serious crime, (in terms of likelihood of death penalty)
• Relative risk ratio = (30/214)  (6/112) = 2.6
[Murderers are 2.6 times as likely to be sentenced to death for killing a
white vs a blacks victim.
Simpson's Paradox
Classify two groups with respect to the incidence of one attribute;
if groups are then separated into categories (subgroups),
the group with the higher overall incidence can have lower incidence
within each (every) category (sub-group).
(and vice versa)
Simpson's Paradox - 2nd Example
Negative correlation between starting salary for people with economics
degrees and the level of degree they have obtained.
i.e. Ph.D.s in economics earn less than M.A.s, who earn less than
B.A.s
Does this make sense? No!
Break down these data in terms of the type of employment (industry,
government, teaching).
In every type of job: there was a positive correlation between degree and
starting salary.
Simpson's Paradox
• Employment selection is the confounding third variable:
• Teachers get paid less than government workers who get paid less
than those in private industry,
• People with higher degrees are more likely to end up teaching and
those with B.A.s are very unlikely to be teachers.
• Examples of the danger of combining data from several distinct groups
(with respect to the relation between two variables) in calculating
correlations.
• How could sampling avoid these cases of Simpson’s paradox?
• Use stratified sampling.
• If equal numbers of people are sampled from the categories, the
overall relationship will be an average of the relations in the
subcategories.